MArkov decision processes (MDPs) have been widely

Size: px
Start display at page:

Download "MArkov decision processes (MDPs) have been widely"

Transcription

1 Spre Mrkov Deciion Procee with Cul Spre Tlli Entropy Regulriztion for Reinforcement Lerning yungje Lee, Sungjoon Choi, nd Songhwi Oh rxiv: v3 [c.lg] 3 Oct 07 Abtrct In thi pper, re Mrkov deciion proce MDP with novel cul re Tlli entropy regulriztion i propoed. The propoed policy regulriztion induce re nd multi-modl optiml policy ditribution of re MDP. The full mthemticl nlyi of the propoed re MDP i provided. We firt nlyze the optimlity condition of re MDP. Then, we propoe re vlue itertion method which olve re MDP nd then prove the convergence nd optimlity of re vlue itertion uing the Bnch fixed point theorem. The propoed re MDP i compred to oft MDP which utilize cul entropy regulriztion. We how tht the performnce error of re MDP h contnt bound, while the error of oft MDP incree logrithmiclly with reect to the number of ction, where thi performnce error i cued by the introduced regulriztion term. In experiment, we pply re MDP to reinforcement lerning problem. The propoed method outperform exiting method in term of the convergence eed nd performnce. I. INTRODUCTION MArkov deciion procee MDP hve been widely ued mthemticl frmework to olve tochtic equentil deciion problem, uch utonomou driving [], pth plnning [], nd qudrotor control [3]. In generl, the gol of n MDP i to find the optiml policy function which mximize the expected return. The expected return i performnce meure of policy function nd it i often defined the expected um of dicounted rewrd. An MDP i often ued to formulte reinforcement lerning RL [4], which im to find the optiml policy without the explicit ecifiction of tochticity of n environment, nd invere reinforcement lerning IRL [5], whoe gol i to erch the proper rewrd function tht cn explin the behvior of n expert who follow the underlying optiml policy. While the optiml olution of n MDP i determinitic policy, it i not deirble to pply n MDP to the problem with multiple optiml ction. In perective of RL, the knowledge of multiple optiml ction mke it poible to cope with unexpected itution. For exmple, uppoe tht n utonomou vehicle h multiple optiml route to rech given gol. If trffic ccident occur t the currently elected optiml route, it i poible to void the ccident by chooing nother fe optiml route without dditionl computtion of new optiml route. For thi reon, it i more deirble to lern ll poible optiml ction in term of robutne of. Lee, S. Choi, nd S. Oh re with the Deprtment of Electricl nd Computer Engineering nd ASRI, Seoul Ntionl Univerity, Seoul 0886, ore e-mil: {kyungje.lee, ungjoon.choi, onghwi.oh}@cplb.nu.c.kr. policy function. In perective of IRL, ince the expert often mke multiple deciion in the me itution, determinitic policy h limittion in expreing the expert behvior. For thi reon, it i indienble to model the policy function of n expert multi-modl ditribution. Thee reon give rie to the neceity of multi-modl policy model. In order to ddre the iue with determinitic policy function, cul entropy regulriztion method h been utilized [6] [0]. Thi i minly due to the fct tht the optiml olution of n MDP with cul entropy regulriztion become oftmx ditribution of tte-ction vlue Q,, expq, i.e., = expq,, which i often referred to oft MDP []. While oftmx ditribution h been widely ued to model tochtic policy, it h wekne in modeling policy function when the number of ction i lrge. In other word, the policy function modeled by oftmx ditribution i prone to ign non-negligible probbility m to non-optiml ction even if tte-ction vlue of thee ction re dimiible. Thi tendency get wore the number of ction incree demontrted in Figure. In thi pper, we propoe re MDP by preenting novel cul re Tlli entropy regulriztion method, which cn be interpreted ecil ce of Tlli generlized entropy []. The propoed regulriztion method h unique property in tht the reulting policy ditribution become re ditribution. In other word, the upporting ction et which h non-zero probbility m contin re ubet of the ction ce. We provide full mthemticl nlyi bout the propoed re MDP. We firt derive the optimlity condition of re MDP, which i nmed re Bellmn eqution. We how tht the re Bellmn eqution i n pproximtion of the originl Bellmn eqution. Interetingly, we further find the connection between the optimlity condition of re MDP nd the probbility implex projection problem [3]. We preent re vlue itertion method for olving re MDP problem, where the optimlity nd convergence re proven uing the Bnch fixed point theorem [4]. We further nlyze the performnce gp of the expected return of the optiml policie obtined by re MDP nd oft MDP compred to tht of the originl MDP. In prticulr, we prove tht the performnce gp between the propoed re MDP nd the originl MDP h contnt bound the number of ction incree, where the performnce gp between oft MDP nd the originl MDP grow logrithmiclly. From thi property, re MDP hve benefit over oft MDP when

2 it come to olving problem in robotic with continuou ction ce. To vlidte effectivene of re MDP, we pply the propoed method to the explortion trtegy nd the updte rule of Q-lerning nd compre to the ɛ-greedy method nd oftmx policy [9]. The propoed method i lo compred to the deep determinitic policy grdient DDPG method [5], which i deigned to operte in continuou ction ce without dicretiztion. The propoed method how the tte of the rt performnce compred to other method the dicretiztion level of n ction ce incree. Rewrd mp nd ction vlue t tte. II. BACGROUND A. Mrkov Deciion Procee A Mrkov deciion proce MDP h been widely ued to formulte equentil deciion mking problem. An MDP cn be chrcterized by tuple M = {S, F, A, d, T, γ, r}, where S i the tte ce, F i the correonding feture ce, A i the ction ce, d i the ditribution of n initil tte, T, i the trnition probbility from S to S by tking A, γ 0, i dicount fctor, nd r i the rewrd function. The objective of n MDP i to find policy which mximize E [ t=0 γt r t, t, d, T ], where policy i mpping from the tte ce to the ction ce. For nottionl implicity, we denote the expecttion of dicounted ummtion of function f,, i.e., E[ t=0 γt f t, t, d, T ], by E [f, ], where f, i function of tte nd ction, uch rewrd function r, or n indictor function {= }. We lo denote the expecttion of dicounted ummtion of function f, conditioned on the initil tte, i.e., E[ t=0 γt f t, t, 0 =, T ], by E [f, 0 = ]. Finding n optiml policy for n MDP cn be formulted follow: mximize E [r t, t ] ubject to =,, 0. The necery condition for the optiml olution of i clled the Bellmn eqution. The Bellmn eqution i derived from the Bellmn optimlity principl follow: Q, = r, + γ V T, V = mx Q, = rg mx Q,, where V i vlue function of, which i the expected um of dicounted rewrd when the initil tte i given, nd Q, i tte-ction vlue function of, which i the expected um of dicounted rewrd when the initil tte nd ction re given nd, reectively. Note tht the optiml olution i determinitic function, which i referred to determinitic policy. B. Entropy Regulrized Mrkov Deciion Procee In order to obtin multi-modl policy function, n entropyregulrized MDP, lo known oft MDP, h been widely b Propoed policy model nd vlue difference drker i better. c Softmx policy model nd vlue difference drker i better. Fig. : A -dimenionl multi-objective environment with point m dynmic. The tte i loction nd the ction i velocity bounded with [ 3, 3] [ 3, 3]. The left figure how the rewrd mp with four mxim multiple objective. The ction ce i dicretized into two level: 9 low reolution nd 49 high reolution. The middle re., right figure how the optiml ction vlue t tte indicted red cro point when the number of ction i 9 re., 49. b The firt nd third figure indicte the propoed policy ditribution t tte induced by the ction vlue in. The econd nd fourth figure how mp of the performnce difference between the propoed policy nd the optiml policy t ech tte when the number of ction i 9 nd 49, reectively. The lrger the error, the brighter the color of the tte. c All figure re obtined in the me wy b by replcing the propoed policy with oftmx policy. Thi exmple how tht the propoed policy model i le ffected when the number of ction incree. ued [8] []. In oft MDP, cul entropy regulriztion over i introduced to obtin multi-modl policy ditribution, i.e.,. Since cul entropy regulriztion penlize determinitic ditribution, it mke n optiml policy of oft MDP to be oftmx ditribution. A oft MDP i formulted follow: mximize E [r t, t ] + H ubject to =,, 0, where H E [ log t t ] i γ-dicounted cul entropy nd i regulriztion coefficient. Thi problem h been extenively tudied in [6], [8], []. In [],

3 3 oft Bellmn eqution nd the optiml policy ditribution re derived from the ruh uhn Tucker T condition follow: Q oft, = r, + γ V oft T, Q V oft oft, = log exp Q oft, exp = exp Q oft,, where V oft = E [r t, t log t t 0 = ] Q oft, = E [r t, t log t t 0 =, 0 = ]. V oft i oft vlue of indicting the expected um of rewrd including the entropy of policy, obtined by trting t tte nd Q oft, i oft tte-ction vlue of, which i the expected um of rewrd obtined by trting t tte by tking ction. Note tht the optiml policy ditribution i oftmx ditribution. In [], oft vlue itertion method i lo propoed nd the optimlity of oft vlue itertion i proved. By uing cul entropy regulriztion, the optiml policy ditribution of oft MDP i ble to repreent multi-modl ditribution. The cul entropy regulriztion h n effect of mking the reulting policy of oft MDP cloer to uniform ditribution the number of ction incree. To hndle thi iue, we propoe novel regulriztion method whoe reulting policy ditribution till h multiple mode tochtic policy but the performnce lo i le thn oftmx policy ditribution. III. SPARSE MAROV DECISION PROCESSES We propoe re Mrkov deciion proce by introducing novel cul re Tlli entropy regulrizer: [ ] W E γ t t t, d, T t=0 [ ] = E. By dding W to the objective function of, we im to olve the following optimiztion problem: mximize E [r, ] + W ubject to =,, 0, where > 0 i regulriztion coefficient. We will firt derive the re Bellmn eqution from the necery condition of 3. Then by oberving the connection between the re Bellmn eqution nd the probbility implex projection, we how tht the optiml policy become remx ditribution, where the rity cn be controlled by. In ddition, we preent re vlue itertion lgorithm where the optimlity i gurnteed uing the Bnch fixed point theorem. The detiled derivtion of lemm nd theorem in thi pper cn be found in Appendix A. 3 A. Nottion nd Propertie We firt introduce nottion nd propertie ued in the pper. In Tble I, ll nottion nd definition re ummrized. The utility, vlue, tte viittion cn be compctly expreed below in term of vector nd mtrice: J J oft = d G r, = d G r oft V, V oft = G r = G r oft, ρ = d G where x i the trnoe of vector x, G = I γt, indicte re MDP problem nd oft indicte oft MDP problem. B. Spre Bellmn Eqution from ruh-uhn-tucker condition The re Bellmn eqution cn be derived from the necery condition of n optiml olution of re MDP. We crefully invetigte the ruh uhn Tucker T condition, which indicte necery condition for olution to be optiml when ome regulrity condition bout the feible et re tified. The feible et of re MDP tifie linerity contrint qulifiction [6] ince the feible et conit of liner fine function. In thi regrd, the optiml olution of re MDP necerily tify T condition follow. Theorem. If policy ditribution i the optiml olution of re MDP 3, then nd the correonding re vlue function V necerily tify following eqution for ll tte nd ction pir: Q, = r, + γ V T, V = Q, Q, τ + S Q, Q, = mx τ, 0, 4 Q Q, where τ, = S, S i et of Q, j ction tifying + i Q, i > i j=0 with indicting the ction with the ith lrget ction vlue i Q, i, nd i the crdinlity of S. The full proof of Theorem i provided in Appendix A-A. The proof depend on the T condition where the derivtive of Lgrngin objective function with reect to policy become zero t the optiml olution, the ttionry condition. From 4, it cn be hown tht the optiml olution obtined from the re MDP ign zero probbility to the ction whoe ction vlue Q, i below the threhold Q τ, nd the optiml policy ign poitive probbility to ner optiml ction in proportion to their ction vlue, Q where the threhold τ, determine the rnge of ner optiml ction. Thi property mke the optiml policy to hve re ditribution nd prevent the performnce drop cued by igning non-negligible poitive probbilitie to non-optiml ction, which often occur in oft MDP. From the definition of S nd, we cn further oberve n intereting connection between the re Bellmn eqution nd the probbility implex projection problem [3].

4 4 Term re MDP [ oft MDP Utility J E r, + ] = dv = J oft E [r, log ] r ρ = oft dv = roft ρ Vlue V [ E r, + 0 = ] V oft = r + γ E [r, log 0 = ] V T = r oft + γ V oft T Action vlue Q, r, + γ Q oft V T,, r, + γ V oft T, Expected r Stte Rewrd r, + r oft r, log [ Policy Regulriztion = H = E [ log ] W E ], ρ =, log ρ Mx Approximtion mxz z i τz + logumexp z log i expzi Vlue U r, +γ x = mx x T, U oft r, +γ x = logumexp Itertion Opertor Stte Viittion Stte Action Viittion Trnition Probbility given [ ] ρ E { =} = d + γ, T, ρ, [ ] ρ, E { =, =} = d + γ, T, ρ, T T, TABLE I: Nottion nd Propertie x T, C. Probbility Simplex Projection nd SpreMx Opertion The probbility implex projection [3] i well known problem of projecting d-dimenionl vector into d dimenionl probbility implex in n Eucliden metric ene. A probbility implex projection problem i defined follow: minimize p ubject to p z d p i =, p i 0, i =,, d, where z i given d-dimenionl vector, d i the dimenion of p nd z, nd p i i the ith element of p. Let z i be the ith lrget element of z nd uppz be the upporting et of the optiml olution defined by uppz = {z i + iz i > i j= z j}. It i well known fct tht the problem 5 h cloed form olution which i p i z = mxz i τz, 0, where i indicte the ith dimenion, p i z i the ith element of the optiml olution for fixed z, nd τz = z i with = uppz [3], [7]. Interetingly, the optiml olution p, τ nd the upporting et upp of 5 cn be preciely mtched to thoe of the re Bellmn eqution 4. From thi obervtion, it cn be hown tht the optiml policy ditribution of re MDP i the projection of Q, into probbility implex. Note tht we refer p remx ditribution. More urpriingly, V cn be repreented n pproximtion of the mx opertion derived from p z. A differentible pproximtion of the mx opertion i defined follow: mxz z i τz We cll mxz remx. In [7], it i proven tht mxz i n indefinite integrl of p z, i.e., mxz = p z dz+c, where C i contnt nd, in our ce, C =. We provide imple upper nd lower bound of mxz with reect to mxz z mxz mx mxz + d d. 7 The lower bound of remx i hown in [7]. However, we provide nother proof of the lower bound nd the proof for the upper bound in Appendix A-B. The bound 7 how tht remx i bounded nd mooth pproximtion of mx nd, from thi fct, 4 cn be interpreted n pproximtion of the originl Bellmn eqution. Uing thi nottion, V cn be rewritten, Q V = mx., D. Supporting Set of Spre Optiml Policy The upporting et S of re MDP i et of ction with nonzero probbilitie nd the crdinlity of S cn be controlled by regulriztion coefficient, while the upporting et of oft MDP i lwy the me the entire ction ce. In re MDP, ction igned with non-zero probbility mut tify the following inequlity: + iq, i > i j= Q, j, 8 where i indicte the ction with the ith lrget ction vlue. From thi inequlity, it cn be hown tht control the mrgin between the lrget ction vlue nd the other included in the upporting et. In other word, incree,

5 5 the crdinlity of upporting et incree ince the ction vlue tht tify 8 incree. Converely, decree, the upporting et decree. In extreme ce, if goe zero, only will be included in S nd if goe infinity, the entire ction will be included in S. On the other hnd, in oft MDP, the upporting et of oftmx ditribution cnnot be controlled by the regulriztion coefficient even if the hrpne of the oftmx ditribution cn be djuted. Thi property mke re MDP hve n dvntge over oft MDP, ince we cn give zero probbility to non-optiml ction by controlling. E. Connection to Tlli Generlized Entropy The notion of the Tlli entropy w introduced by C. Tlli generl extenion of entropy [] nd the Tlli entropy h been widely ued to decribe thermodynmic ytem nd moleculr motion. Surpriingly, the propoed regulriztion i cloely relted to ecil ce of the Tlli entropy. The Tlli entropy i defined follow: S q,k p = k q i where p i probbility m function, q i prmeter clled entropic-index, nd k i poitive rel contnt. Note tht, if q nd k =, S, p i the me entropy, i.e., i p i logp i. In [], [8], it i hown tht H i n extenion of S, ince H = E [S, ] =, log ρ. We dicover the connection between the Tlli entropy nd the propoed regulriztion when q = nd k =. Theorem. The propoed policy regulriztion W i n extenion of the Tlli entropy with prmeter q = nd k = to the verion of cul entropy, i.e., p q i W = E [S, ]. The proof i provided in Appendix A-D From thi theorem, W cn be interpreted n extenion of S, p to the ce of cully conditioned ditribution, imilrly to the cul entropy. IV. SPARSE VALUE ITERATION In thi ection, we propoe n lgorithm for olving cul re Tlli entropy regulrized MDP problem. Similr to the originl MDP nd oft MDP, the re verion of vlue itertion cn be induced from the re Bellmn eqution. We firt define re Bellmn opertion U : R S R S : for ll, r, + γ U x = mx x T,,, where x i vector in R S nd U x i the reulting vector fter pplying U to x nd U x i the element for tte in U x. Then, the re vlue itertion lgorithm cn be decribed imply x i+ = U x i, where i i the number of itertion. In the following ection, we how the convergence nd the optimlity of the propoed re vlue itertion method. A. Optimlity of Spre Vlue Itertion In thi ection, we prove the convergence nd optimlity of the re vlue itertion method. We firt how tht U h monotonic nd dicounting propertie nd, by uing thoe propertie, we prove tht U i contrction. Then, by the Bnch fixed point theorem, repetedly pplying U for n rbitrry initil point lwy converge into the unique fixed point. Lemm. U i monotone: for x, y R S, if x y, then U x U y, where indicte n element-wie inequlity. Lemm. For ny contnt c R, U x + c = U x + γc, where R S i vector of ll one. The full proof cn be found in Appendix A-E. The proof of Lemm nd Lemm rely on the bounded property of the remx opertion. It i poible to prove tht the re Bellmn opertor U i contrction uing Lemm nd Lemm follow: Lemm 3. U i γ-contrction mpping nd hve unique fixed point, where γ i in 0, by definition. Uing Lemm, Lemm, nd Lemm 3, the optimlity nd convergence of re vlue itertion cn be proven. Theorem 3. Spre vlue itertion converge to the optiml vlue of 3. The proof cn be found in Appendix A-E. Theorem 3 i proven uing the uniquene of the fixed point of U nd the re Bellmn eqution. V. PERFORMANCE ERROR BOUNDS FOR SPARSE VALUE ITERATION We prove the bound of the performnce gp between the policy obtined by re MDP nd the policy obtined by the originl MDP, where thi performnce error i cued by regulriztion. The boundedne of 7 ply n crucil role to prove the error bound. The performnce bound cn be derived from bound of remx. A imilr pproch cn be pplied to prove the error bound of oft MDP ince log-um-exp function i lo bounded pproximtion of the mx opertion. Comprion of log-um-exp nd remx opertion i provided in Appendix A-C Before explining the performnce error bound, we introduce two ueful propoition which re employed to prove the performnce error bound of re MDP nd oft MDP. We firt prove n importnt fct which how tht the optiml vlue of re vlue itertion nd oft vlue itertion re greter thn tht of the originl MDP. Lemm 4. Let U nd U oft be the Bellmn opertion of n originl MDP nd oft MDP, reectively, uch tht, for tte

6 6 Algorithm Spre Deep Q-Lerning : Initilize prioritized reply memory M =, Q network prmeter θ nd θ : for i = 0 to N do 3: Smple initil tte 0 d 0 4: for t = 0 to T do 5: Smple ction t t 4 6: Excute t nd oberve next tte t+ nd rewrd r t 7: Add experience to reply memory M with n initil importnce weight, M t, t, r t, t+, w 0 M 8: Smple mini-btch B from M bed on importnce weight Qj+, ;θ 9: Set trget vlue y j of j, j, r j, j+, w j in B, y j = r j + γmx 0: Minimize j wj yj Qj, j; θ uing grdient decent method : Updte importnce weight {w j} bed on temporl difference error δ j = y j Q j, j; θ [9] : end for 3: Updte θ = θ every c itertion 4: end for nd x R S, Ux = mx U oft x = log r, + γ x T, r, + γ x T, exp. Then following inequlitie hold for every integer n > 0: U n x U n x, U n x U oft n x, where U n re., U n i the reult fter pplying U re., U n time. In ddition, let x, x nd x oft be the fixed point of U, U nd U oft, reectively. Then, following inequlitie lo hold: x x, x x oft. The detiled proof i provided in Appendix A-F. Lemm 4 how tht the optiml vlue, V nd V oft, obtined by re vlue itertion nd oft vlue itertion re lwy greter thn the originl optiml vlue V. Intuitively eking, the reon for thi inequlity i due to the regulriztion term, i.e., W or H, dded to the objective function. Now, we dicu other ueful propertie bout the propoed cul re Tlli entropy regulriztion W nd cul entropy regulriztion H. Lemm 5. W nd H hve following upper bound: W γ A, H log A A γ where A i the crdinlity of the ction ce A. The proof i provided in Appendix A-F. Theorem 5 cn be induced by extending the upper bound of S, nd S, to the cul entropy nd cul re Tlli entropy. By uing Lemm 4 nd Lemm 5, the performnce bound for re MDP nd oft MDP cn be derived follow. Theorem 4. Following inequlitie hold: E r, A E r, E r,, γ A where nd re the optiml policy obtined by the originl MDP nd re MDP, reectively. Theorem 5. Following inequlitie hold: E r, γ log A E oftr, Er, where nd oft re the optiml policy obtined by the originl MDP nd oft MDP, reectively. The proof of Theorem 4 nd Theorem 5 cn be found in Appendix A-F. Thee error bound how u tht the expected return of the optiml policy of re MDP h lwy tighter error bound thn tht of oft MDP. Moreover, it cn be lo known tht the bound for the propoed re MDP converge γ to contnt the number of ction incree, where the error bound of oft MDP grow logrithmiclly. Thi property h cler benefit when re MDP i pplied to robotic problem with continuou ction ce. To pply n MDP to continuou ction ce, dicretiztion of the ction ce i eentil nd fine dicretiztion i required to obtin olution which i cloer to the underlying continuou optiml policy. Accordingly, the number of ction become lrger the level of dicretiztion incree. In thi ce, re MDP h dvntge over oft MDP in tht the performnce error of re MDP i bounded by contnt fctor the number of ction incree, where performnce error of optiml policy of oft MDP grow logrithmiclly. VI. SPARSE EXPLORATION AND UPDATE RULE FOR SPARSE DEEP Q-LEARNING In thi ection, we firt propoe re Q-lerning nd further extend to re deep Q-lerning where remx policy nd the re Bellmn eqution re employed explortion method nd updte rule. Spre Q-lerning i model free method to olve the propoed re MDP without the knowledge of trnition probbilitie. In other word, when the trnition probbility T, i unknown but mpling from T, i poible, re Q-lerning etimte n optiml Q of the re MDP uing mpling, Q-lerning find n pproximted vlue of n optiml Q of the conventionl MDP. Similr to Q-lerning, the updte eqution of re Q-lerning i derived

7 7 Performnce Bound b Supporting Set Comprion Fig. : The performnce gp i clculted the bolute vlue of the difference between the performnce of re MDP or oft MDP nd the performnce of n originl MDP. b The rtio of the number of upporting ction to the totl number of ction i hown. The ction ce of unicycle dynmic i dicretized into 5 ction. from the re Bellmn eqution, Q i, i Q i, i + [ Q ] i+, ηi r i, i + γmx Q i, i, where i indicte the number of itertion nd ηi i lerning rte. If the lerning rte ηi tifie i=0 ηi = nd i=0 ηi <, then, the number of mple incree to infinity, re Q-lerning converge to the optiml olution of re MDP. The proof of the convergence nd optimlity of re Q-lerning i the me tht of the tndrd Q- lerning [0]. The propoed re Q-lerning cn be eily extended to re deep Q-lerning uing deep neurl network n etimtor of the re Q vlue. In ech itertion, re deep Q-lerning perform grdient decent tep to minimize the qured lo y Q, ; θ, where θ i the prmeter of the Q network. Here, y i the trget vlue defined follow: Q, ; θ y = r, + γmx, where i the next tte mpled by tking ction t the tte nd θ indicte network prmeter. Moreover, we employ the remx policy the explortion trtegy where the policy ditribution i computed by 4 with ction vlue etimted by deep Q network. The remx policy exclude the ction whoe etimted ction vlue i too low to be re-explored, by igning zero probbility m. The effectivene of the remx explortion i invetigted in Section VII. For tble convergence of Q network, we utilize double Q- lerning [], where the prmeter θ for obtining policy nd the prmeter θ for computing the trget vlue re eprted nd θ i updted to θ t every predetermined itertion. In other word, double Q-lerning prevent intbility of deep Q-lerning by lowly updting the trget vlue. Prioritized experience reply [9] i lo pplied where the optimiztion of network proceed in conidertion of the importnce of experience. The whole proce of re deep Q-lerning i ummrized in Algorithm. VII. EXPERIMENTS We firt verify Theorem 4, Theorem 5 nd the effect of 8 in imultion. For verifiction of Theorem 4 nd Theorem 5, we meure the performnce of the expected return while increing the number of ction, A. For verifiction of the effect of 8, the crdinlity of the upporting et of optiml policie of re nd oft MDP re compred t different vlue of. To invetigte effectivene of the propoed method, we tet remx explortion nd the re Bellmn updte rule on reinforcement lerning with continuou ction ce. To pply Q-lerning to continuou ction ce, fine dicretiztion i necery to obtin olution which i cloer to the originl continuou optiml policy. A the level of dicretiztion incree, the number of ction to be explored become lrger. In thi regrd, n efficient explortion method i required to obtin high performnce. We compre our method to other explortion method with reect to the convergence eed nd the expected um of rewrd. We further check the effect of the updte rule. A. Experiment on Performnce Bound nd Supporting Set To verify our theorem bout performnce error bound, we crete trnition model T by dicretiztion of unicycle dynmic defined in continuou tte nd ction ce nd olve the originl MDP, oft MDP nd re MDP under predefined rewrd while increing the dicretiztion level of the ction ce. The rewrd function i defined liner combintion of two qured exponentil function, i.e., rx = exp x x exp x x, where x i σ σ loction of unicycle, x i gol point, x i the point to void, nd σ nd σ re cle prmeter. The rewrd function i deigned to let n gent to nvigte towrd x while voiding x. The bolute vlue of difference between the expected return of the originl MDP nd tht of re MDP or oft MDP i meured. A hown in Figure, the performnce gp of re MDP converge to contnt bound while the performnce of the oft MDP grow logrithmiclly. Note tht the performnce gp of the re MDP nd oft MDP re lwy mller thn their error bound. Supporting et experiment re conducted uing dicretized unicycle dynmic. The crdinlity of optiml policie re meured while vrie from 0. to 00. In Figure b, while the rtio of the upporting et for oft MDP i chnged from 0.79 to.00, the rtio for re MDP i chnged from 0.4 to 0.99, demontrting the rene of the propoed re MDP compred to oft MDP. B. Reinforcement Lerning in Continuou Action Spce We tet our method in MuJoCo [], phyic-bed imultor, uing two problem with continuou ction ce: Inverted Pendulum nd Recher. The ction ce i dicretized to pply Q-lerning to continuou ction ce nd experiment re conducted with four different dicretiztion level to vlidte the effectivene of remx explortion nd the re Bellmn updte rule.

8 8 We compre the remx explortion method to the ɛ- greedy method nd oftmx explortion [0] nd further compre the re Bellmn updte rule to the originl Bellmn updte rule [0] nd the oft Bellmn updte rule []. In ddition, three different regulriztion coefficient etting re experimented. In totl, we tet 7 combintion of vrint of deep Q-lerning by combining three explortion method, three updte rule, nd three different regulriztion coefficient of 0.0, 0., nd. The deep determinitic policy grdient DDPG method [5], which operte in continuou ction ce without dicretiztion of the ction ce, i lo compred. Hence, totl of 8 lgorithm re teted. Reult re hown in Figure 3 nd Figure 4 for inverted pendulum nd recher, reectively, where only the top five lgorithm re plotted nd ech point in grph i obtined by verging the vlue from three independent run with different rndom eed. Reult of ll 8 lgorithm re provided in Appendix B. Q network with two 5 dimenionl hidden lyer i ued for the inverted pendulum problem nd network with four 56 dimenionl hidden lyer i ued for the recher problem. Ech Q-lerning lgorithm utilize the me network topology. For inverted pendulum, ince the problem i eier thn the recher problem, mot of top five lgorithm converge to the mximum return of 000 t ech dicretiztion level hown in Figure 3. Four of top five lgorithm utilize the propoed remx explortion. Only one of the top five method utilize the oftmx explortion. In Figure 3b, the number of epiode required to rech ner optiml return, 980, i hown. The remx explortion require le number of epiode to obtin ner optiml vlue thn ɛ-greedy, oftmx explortion. For the recher problem, the lgorithm with remx explortion lightly outperform ɛ-greedy method nd the performnce of oftmx explortion i not included in the top five hown in Figure 4. In term of the number of required epiode, remx explortion outperform epilon greedy method hown in Figure 4b, where we et the threhold return to be 6. DDPG how poor performnce in both problem ince the number of mpled epiode i inufficient. In thi regrd, deep Q-lerning with remx explortion outperform DDPG with le number of epiode. From thee experiment, it cn be known tht the remx explortion method h n dvntge over oftmx explortion, ɛ-greedy method nd DDPG with reect to the number of epiode required to rech the optiml performnce. VIII. CONCLUSION In thi pper, we hve propoed new MDP with novel cul re Tlli entropy regulriztion which induce re nd multi-modl optiml policy ditribution. In ddition, we hve provided the full mthemticl nlyi of the propoed re MDP: the optimlity condition of re MDP given the re Bellmn eqution, re vlue itertion nd it convergence nd optimlity propertie, nd the performnce bound between the propoe MDP nd the To tet DDPG, we ued the code from Open AI vilble t com/openi/beline. Expected Return b Required Epiode Fig. 3: Inverted pendulum problem. Algorithm re nmed <explortion method>+<updte rule>+<>. The verge performnce of ech lgorithm fter 3000 epiode. The performnce of DDPG i out of cle. b The verge number of epiode required to rech the threhold vlue 980. Expected Return b Required Epiode Fig. 4: Recher problem. The verge performnce of ech lgorithm fter 0000 epiode. b The verge number of epiode required to rech the threhold vlue 6. originl MDP. We hve lo proven tht the performnce gp of re MDP i trictly mller thn tht of oft MDP. In experiment, we hve verified tht the theoreticl performnce gp of re MDP nd oft MDP from the originl MDP re correct. We hve pplied the remx policy nd re Bellmn eqution to deep Q-lerning the explortion trtegy nd updte rule, reectively, nd hown tht the propoed explortion method how ignificntly better performnce compred to ɛ-greedy, oftmx explortion, nd DDPG, the number of ction incree. From the nlyi nd experiment, we hve demontrted tht the propoed re MDP cn be n efficient lterntive to problem with lrge number of poible ction nd even continuou ction ce. APPENDIX A A. Spre Bellmn Eqution from ruh-uhn-tucker condition The following proof explin the optimlity condition of the re MDP from ruh-uhn-tucker T condition. Proof of Theorem : The T condition of 3 re follow:, = 0, 0 9, λ 0 0, λ = 0, L, c, λ = 0

9 9 where c nd λ re Lgrngin multiplier for the equlity nd inequlity contrint, reectively, nd 9 i the feibility of priml vrible, 0 i the feibility of dul vrible, i the complementry lckne nd i the ttionrity condition. The Lgrngin function of 3 i written follow: L, c, λ = J + c, λ where the mximiztion of 3 i chnged into the minimiztion problem, i.e., min J. Firt, the derivtive of J cn be obtined by uing the chin rule. J = d G + γd G T G = ρ r + T γρ V r = ρ r, + + γ = ρ Q, +. r V T, Here, the prtil derivtive of Lgrngin i obtined follow: L, c, λ = ρ Q, + + c λ = 0. Firt, conider poitive where the correonding Lgrngin multiplier λ i zero due to the complementry lckne. By umming with reect to ction, Lgrngin multiplier c cn be obtined follow: 0 = ρ Q, + + c = c ρ + + Q, = c ρ + >0 >0 + Q, = Q c = ρ, >0 + where i the number of poitive element of. By replcing c with thi reult, the optiml policy ditribution i induced follow. = c, ρ + + Q = Q, >0 Q, A thi eqution i derived under the umption tht i poitive. For > 0, following condition i necerily fulfilled, Q, > >0 Q,. We notte thi upporting et S = { + Q, > Q, >0 }. S contin the ction which h lrger ction vlue thn threhold τq, = >0 Q,. By uing thee nottion, the optiml policy ditribution cn be rewritten follow: Q = mx, Q τ,, 0. By ubtituting with thi reult, the following optimlity eqution of V i induced. V Q, + = = Q, + = S Q = S =, S Q, Q τ, + Q, Q + τ, + Q, τ Q, + To ummrize, we obtin the re Bellmn eqution follow: Q, = r, + γ V T, V = Q, Q, τ + S Q, Q, = mx τ, 0. B. Upper nd Lower Bound for Spremx Opertion In thi ection, we prove the lower nd upper bound of mxz defined in 6. We would like to mention tht the proof of lower bound of 7 i provided in [7]. However, we find nother intereting wy to prove 7 by uing the Cuchy-Schwrtz inequlity nd the nonnegtive property of qudrtic eqution. We firt prove mxz mxz nd next prove mxz mxz + d d. Without lo of generlity, we ume tht = but the originl inequlitie cn be imply obtined by replcing z with z. Lower Bound of SpreMx Opertion. For ll z R d, mxz mxz hold. Proof: We prove tht, for ll z, mxz z 0 where z = mxz by definition. The proof i done by

10 0 imply rerrnging the term in 6, mxz z = z i τz + z = z i z i + z = z i z i + z z i z i z + = = z + z i i= z + z i z +. i= The qudrtic term cn be decompoed follow: z + z i i= = z + z i + + z i= i= z i z z i. By putting thi reult into the eqution nd rerrnging them, three term re obtined follow: i= mxz z { = } z z z i + i= + zi + z i + z i. i= i= i= Then, i= z i + i= z i + cn be replced with i= zi + { i= z i nd we lo decompoe the econd term z i= z i + into two } { } prt: z i= z i + nd z, nd rerrnge the eqution follow, = { z z zi + } i= i= + zi + z i z i. Agin, we chnge i= z i i= i z into i= z i + + by dding nd ubtrcting follow, = + i= i= { z z zi + } i= zi + z i + +. i= i= { Then, the term z z i= zi + } i i=z i + reformulted z. i= z i+ By uing thi reformultion, we cn obtin following eqution. [ i= zi + ] = z + i= z i+ + zi + z i + +. Finlly, we cn obtin three term by rerrnging the bove eqution, i= [ i= zi + ] = z + zi + i= z i + i= [ i= zi + ] = z + zi + i= i= i= + z i + + where the firt nd third term re qudrtic nd lwy nonnegtive. The econd term i lo lwy nonnegtive by the Cuchy-Schwrtz inequlity. The Cuchy-Schwrtz inequlity i written p q p q. Let z : = [z,, z ], then, by etting p = z : + nd q = where i dimenionl vector of one, it cn be hown tht the econd term i nonnegtive. Therefore, mxz z i lwy nonnegtive for ll z ince three remining term re lwy nonnegtive, completing the proof. Now, we prove the upper bound of remx opertion. Upper Bound of SpreMx Opertion. For ll z R d, mxz mxz + d d hold. Proof: Firt, we decompoe the ummtion of 6 into

11 two term follow: mxz = = = = z i τz + zi τz z i + τz + p i z z i + τz + p i zz i + τz p i zz i + τz = p i zz i + p i z + + z i + where p i = mxz i τz, 0 which i the optiml olution of the implex projection problem 5 nd p i z = by definition. Now, we ue the fct tht, for every p on d dimenionl implex, d i p iz i mxz for ll z R d. By uing thi property, p z nd re on the probbility implex, following inequlity i induced, mxz = p i zz i + z i + mxz + mxz + mxz + mxz d + where d by definition of. Therefore, mxz mxz + d d hold. C. Comprion to Log-Sum-Exp We explin the error bound for the log-um-exp opertion nd compre it to the bound of the remx opertion. The log-um-exp opertion h widely known bound, mxz logumexpz mxz + logd. We would like to note tht remx h tighter bound thn log-um-exp it i lwy tified tht, for ll d >, d d logd. Intuitively, the pproximtion error of logum-exp incree the dimenion of input ce incree. However, the pproximtion error of remx pproche to the dimenion of input ce goe infinity. Thi fct ply crucil role in compring performnce error bound of the re MDP nd oft MDP. D. Cul Spre Tlli Entropy The following proof how tht W i equivlent to the dicounted expected um of ecil ce of Tlli entropy when q = nd k =. Proof of Theorem : The proof i imply done by rewriting our regulriztion follow: W [ ] = E γ t t t, d, T t=0 = [ ] E γ t {t =, t =}, d, T, t=0 = ρ,, = ρ = = = ρ ρ ] S, ρ = E [S,. E. Convergence nd Optimlity of Spre Vlue Itertion In thi ection, the monotonicity, dicounting property, contrction of re Bellmn opertion U re proved. Proof of Lemm : In [7], the monotonicity of 6 i proved. Then, the monotonicity of U cn be proved uing 6. Let x nd y re given uch tht x y. Then, r, + γ x T, r, + γ y T, where T, i trnition probbility which i lwy nonnegtive. Since the remx opertion i monotone, the following inequlity i induced r, + γ x T, mx r, + γ mx Finlly, we cn obtin U x U y. y T,. Proof of Lemm : In [7], it i hown tht for c R nd x R S, mxx + c = mxx + c. Uing thi property, U x + c r, + γ x + ct, = mx r, + γ x T, + γc = mx T, r, + γ x T, = mx + γc r, + γ x T, = mx + γc U x + c = U x + γc.

12 Proof of Lemm 3: Firt, we prove tht U i γ- contrction mpping with reect to d mx. Without lo of generlity, the proof i dicued for generl function φ : R S R S with dicounting nd monotone propertie. Let d mx x, y = M. Then, y M x y + M i tified. By monotone nd dicounting propertie, the following inequlity between mpping φx nd φy i etblihed. φy γm φx φy + γm, where γ i dicounting fctor of φ. From thi inequlity, d mx φx, φy γm = γd mx x, y nd γ 0,. Therefore, φ i γ-contrction mpping. In our ce, U i γ-contrction mpping. A R S nd d mx x, y re non-empty complete metric ce, by Bnch fixed-point theorem, γ-contrction mpping U h unique fixed point. Uing Lemm, Lemm, nd Lemm 3, we cn prove the convergence nd optimlity of re vlue itertion. Proof of Theorem 3: Spre vlue itertion converge into fixed point of U by the contrction property. Let x be fixed point of U nd, by definition of U, x i the point tht tifie the re Bellmn eqution, i.e. x = U x. Hence, by Theorem, x tifie neceity condition of the optiml olution. By the Bnch fixed point theorem, x i unique point which tifie neceity condition of optiml olution. In prticulr, x = U x i preciely equivlent to the re Bellmn eqution. In other word, there i no other point tht tifie the re Bellmn eqution. Therefore, x i the optiml vlue of re MDP. F. Performnce Error Bound for Spre Vlue Itertion In thi ection, we prove the performnce error bound for re vlue itertion nd oft vlue itertion. We firt how tht the optiml vlue of re MDP nd oft MDP re greter thn tht of the originl MDP. Proof of Lemm 4: We firt prove the inequlity of the re Bellmn opertion U n x U n x, x x. Thi inequlity cn be proven by the mthemticl induction. When n =, the inequlity i proven follow: Therefore, mx r, + γ x T, mx r, + γ x T, mxz mxz. Ux U x. For ome poitive integer k, let u ume tht U k x U k x hold for every x R S. Then, when n = k +, U k+ x = U k Ux U k Ux U k x U k x U k U x Ux U x = U k+ x. Therefore, by mthemticl induction, it i tified U n x U n x for every poitive integer n. Then, the inequlity of the fixed point of U nd U cn be obtined by n, x x where indicte the fixed point. The bove rgument lo hold when U nd remx re replced with U oft nd log-um-exp opertion, reectively. Before howing the performnce error bound, the upper bound of W nd H re proved firt. Proof of Lemm 5: For W, W = ρ ρ A A = A γ A A A ρ = γ. The inequlity tht A A cn be obtined by finding the point where the derivtive of xx i zero. Similrly, for H, [ ] H = E γ t log t t, d, T t=0 = [ ] log E γ t {t =, t =}, d, T, t=0 =, log ρ, = ρ log ρ log A log log A = γ log A ρ = γ. The inequlity tht log log A lo cn be obtined by finding the point where the derivtive of x logx i zero. Uing Lemm 4 nd Lemm 5, the error bound of re nd oft vlue itertion cn be proved. Proof of Theorem 4: Let be the optiml policy of the originl MDP, where the problem i defined mx E r,. E r, mx E r, = E r,. The rightide inequlity i by the definition of optimlity. Before proving the leftide inequlity, we firt derive the following inequlity from Lemm 4: V V, 3 where indicte n optiml vlue. Since the fixed point of U nd U re the optiml olution of the originl MDP nd re MDP, reectively, 3 cn be derived from Lemm 4.

13 3 The leftide inequlity i proved uing 3 follow: E r, = d V d V = J E r, + γ = E r, + W A A Lemm 5. Proof of Theorem 5: Let be the optiml policy of the originl MDP which i defined mx E r,. The rightide inequlity i by the definition of optimlity. E oft r, mx E r, = E r,. Before proving the leftide inequlity, we firt derive following inequlity from Lemm 4: V V oft 4 where indicte n optiml olution. Then, the proof of the leftide inequlity i done by uing 4 follow: E r, = d V d V oft = J oft E oftr, + = E oft r, + H oft log A Lemm 5. γ [] M. Bloem nd N. Bmbo, Infinite time horizon mximum cul entropy invere reinforcement lerning, in 53rd IEEE Conference on Deciion nd Control, December 04, pp [] C. Tlli, Poible generliztion of boltzmnn-gibb ttitic, Journl of ttiticl phyic, vol. 5, no., pp , 988. [3] W. Wng nd M. A. Crreir-Perpinán, Projection onto the probbility implex: An efficient lgorithm with imple proof, nd n ppliction, rxiv preprint rxiv:309.54, 03. [4] D. R. Smrt, Fixed point theorem. CUP Archive, 980, vol. 66. [5] T. P. Lillicrp, J. J. Hunt, A. Pritzel, N. Hee, T. Erez, Y. T, D. Silver, nd D. Wiertr, Continuou control with deep reinforcement lerning, rxiv preprint rxiv: , 05. [6] J. Ye, Contrint qulifiction nd necery optimlity condition for optimiztion problem with vritionl inequlity contrint, SIAM Journl on Optimiztion, vol. 0, no. 4, pp , 000. [7] A. Mrtin nd R. Atudillo, From oftmx to remx: A re model of ttention nd multi-lbel clifiction, in Interntionl Conference on Mchine Lerning, June 06, pp [8] B. D. Ziebrt, Modeling purpoeful dptive behvior with the principle of mximum cul entropy, Ph.D. dierttion, Crnegie Mellon Univerity, Pittburgh, PA, USA, 00. [9] T. Schul, J. Qun, I. Antonoglou, nd D. Silver, Prioritized experience reply, rxiv preprint rxiv:5.0595, 05. [0] C. J. Wtkin nd P. Dyn, Q-lerning, Mchine Lerning, vol. 8, no. 3-4, pp. 79 9, 99. [] H. vn Helt, A. Guez, nd D. Silver, Deep reinforcement lerning with double q-lerning, in Proc. of the Thirtieth AAAI Conference on Artificil Intelligence, Februry 06, pp [] E. Todorov, T. Erez, nd Y. T, Mujoco: A phyic engine for modelbed control, in Interntionl Conference on Intelligent Robot nd Sytem, October 0, pp APPENDIX B In thi ection, we preent the full experimentl reult of reinforcement lerning with continuou ction ce. We performe experiment on Inverted Pendulum nd Recher nd 8 lgorithm re teted including our re explortion method nd re Bellmn updte rule. REFERENCES [] S. Brechtel, T. Gindele, nd R. Dillmnn, Probbilitic deciion-mking under uncertinty for utonomou driving uing continuou pomdp, in 7th Interntionl Conference on Intelligent Trnorttion Sytem, October 04, pp [] S. Rgi nd E.. P. Chong, UAV pth plnning in dynmic environment vi prtilly obervble mrkov deciion proce, IEEE Trn. Aeroce nd Electronic Sytem, vol. 49, no. 4, pp , 03. [3] J. Hwngbo, I. S, R. Siegwrt, nd M. Hutter, Control of qudrotor with reinforcement lerning, IEEE Robotic nd Automtion Letter, vol., no. 4, pp , 07. [4] J. ober, J. A. Bgnell, nd J. Peter, Reinforcement lerning in robotic: A urvey, Interntionl Journl of Robotic Reerch, vol. 3, no., pp , 03. [5] A. Y. Ng nd S. J. Ruell, Algorithm for invere reinforcement lerning, in Proc. of the 7th Interntionl Conference on Mchine Lerning, June 000, pp [6] T. Hrnoj, H. Tng, P. Abbeel, nd S. Levine, Reinforcement lerning with deep energy-bed policie, in Proc. of the 34th Interntionl Conference on Mchine Lerning, Augut 07, pp [7] N. Hee, D. Silver, nd Y. W. Teh, Actor-critic reinforcement lerning with energy-bed policie, in Proc. of the Tenth Europen Workhop on Reinforcement Lerning, June 0, pp [8] J. Schulmn, P. Abbeel, nd X. Chen, Equivlence between policy grdient nd oft q-lerning, rxiv preprint rxiv: , 07. [9] M. Tokic nd G. Plm, Vlue-difference bed explortion: Adptive control between epilon-greedy nd oftmx, in I 0: Advnce in Artificil Intelligence, 34th Annul Germn Conference on AI, October 0, pp [0] P. Vmplew, R. Dzeley, nd C. Fole, Softmx explortion trtegie for multiobjective reinforcement lerning, Neurocomputing, vol. 63, pp , 07.

14 4 The Number of Action Averge Spre+SpreBellmn Spre+SpreBellmn Spre+SpreBellmn Spre+SoftBellmn Spre+SoftBellmn Spre+SoftBellmn Spre+Bellmn Spre+Bellmn Spre+Bellmn Soft+SpreBellmn Soft+SpreBellmn Soft+SpreBellmn Soft+SoftBellmn Soft+SoftBellmn Soft+SoftBellmn Soft+Bellmn Soft+Bellmn Soft+Bellmn EpGrdy+SpreBellmn EpGrdy+SpreBellmn EpGrdy+SpreBellmn EpGrdy+SoftBellmn EpGrdy+SoftBellmn EpGrdy+SoftBellmn EpGrdy+Bellmn EpGrdy+Bellmn EpGrdy+Bellmn DDPG TABLE II: Expected return of Inverted Pendulum. Top five performnce re mrked in bold. The Number of Action Spre+SpreBellmn Spre+SpreBellmn Spre+SpreBellmn Spre+SoftBellmn Spre+SoftBellmn Spre+SoftBellmn Spre+Bellmn Spre+Bellmn Spre+Bellmn Soft+SpreBellmn Soft+SpreBellmn Soft+SpreBellmn Soft+SoftBellmn Soft+SoftBellmn Soft+SoftBellmn Soft+Bellmn Soft+Bellmn Soft+Bellmn EpGrdy+SpreBellmn EpGrdy+SpreBellmn EpGrdy+SpreBellmn EpGrdy+SoftBellmn EpGrdy+SoftBellmn EpGrdy+SoftBellmn EpGrdy+Bellmn EpGrdy+Bellmn EpGrdy+Bellmn TABLE III: The number of epiode required to rech the threhold return, 980.

15 5 The Number of Action Averge Spre+SpreBellmn Spre+SpreBellmn Spre+SpreBellmn Spre+SoftBellmn Spre+SoftBellmn Spre+SoftBellmn Spre+Bellmn Spre+Bellmn Spre+Bellmn Soft+SpreBellmn Soft+SpreBellmn Soft+SpreBellmn Soft+SoftBellmn Soft+SoftBellmn Soft+SoftBellmn Soft+Bellmn Soft+Bellmn Soft+Bellmn EpGrdy+SpreBellmn EpGrdy+SpreBellmn EpGrdy+SpreBellmn EpGrdy+SoftBellmn EpGrdy+SoftBellmn EpGrdy+SoftBellmn EpGrdy+Bellmn EpGrdy+Bellmn EpGrdy+Bellmn DDPG TABLE IV: Expected return of Recher. Top five performnce re mrked in bold. The Number of Action Spre+SpreBellmn Spre+SpreBellmn Spre+SpreBellmn Spre+SoftBellmn Spre+SoftBellmn Spre+SoftBellmn Spre+Bellmn Spre+Bellmn Spre+Bellmn Soft+SpreBellmn Soft+SpreBellmn Soft+SpreBellmn Soft+SoftBellmn Soft+SoftBellmn Soft+SoftBellmn Soft+Bellmn Soft+Bellmn Soft+Bellmn EpGrdy+SpreBellmn EpGrdy+SpreBellmn EpGrdy+SpreBellmn EpGrdy+SoftBellmn EpGrdy+SoftBellmn EpGrdy+SoftBellmn EpGrdy+Bellmn EpGrdy+Bellmn EpGrdy+Bellmn TABLE V: The number of epiode required to rech the threhold return, -6.

Artificial Intelligence Markov Decision Problems

Artificial Intelligence Markov Decision Problems rtificil Intelligence Mrkov eciion Problem ilon - briefly mentioned in hpter Ruell nd orvig - hpter 7 Mrkov eciion Problem; pge of Mrkov eciion Problem; pge of exmple: probbilitic blockworld ction outcome

More information

Reinforcement Learning for Robotic Locomotions

Reinforcement Learning for Robotic Locomotions Reinforcement Lerning for Robotic Locomotion Bo Liu Stnford Univerity 121 Cmpu Drive Stnford, CA 94305, USA bliuxix@tnford.edu Hunzhong Xu Stnford Univerity 121 Cmpu Drive Stnford, CA 94305, USA xuhunvc@tnford.edu

More information

Reinforcement Learning and Policy Reuse

Reinforcement Learning and Policy Reuse Reinforcement Lerning nd Policy Reue Mnuel M. Veloo PEL Fll 206 Reding: Reinforcement Lerning: An Introduction R. Sutton nd A. Brto Probbilitic policy reue in reinforcement lerning gent Fernndo Fernndez

More information

Reinforcement learning

Reinforcement learning Reinforcement lerning Regulr MDP Given: Trnition model P Rewrd function R Find: Policy π Reinforcement lerning Trnition model nd rewrd function initilly unknown Still need to find the right policy Lern

More information

Reinforcement learning II

Reinforcement learning II CS 1675 Introduction to Mchine Lerning Lecture 26 Reinforcement lerning II Milos Huskrecht milos@cs.pitt.edu 5329 Sennott Squre Reinforcement lerning Bsics: Input x Lerner Output Reinforcement r Critic

More information

TP 10:Importance Sampling-The Metropolis Algorithm-The Ising Model-The Jackknife Method

TP 10:Importance Sampling-The Metropolis Algorithm-The Ising Model-The Jackknife Method TP 0:Importnce Smpling-The Metropoli Algorithm-The Iing Model-The Jckknife Method June, 200 The Cnonicl Enemble We conider phyicl ytem which re in therml contct with n environment. The environment i uully

More information

Policy Gradient Methods for Reinforcement Learning with Function Approximation

Policy Gradient Methods for Reinforcement Learning with Function Approximation Policy Grdient Method for Reinforcement Lerning with Function Approximtion Richrd S. Sutton, Dvid McAlleter, Stinder Singh, Yihy Mnour AT&T Lb Reerch, 180 Prk Avenue, Florhm Prk, NJ 07932 Abtrct Function

More information

Non-Myopic Multi-Aspect Sensing with Partially Observable Markov Decision Processes

Non-Myopic Multi-Aspect Sensing with Partially Observable Markov Decision Processes Non-Myopic Multi-Apect Sening with Prtilly Oervle Mrkov Deciion Procee Shiho Ji 2 Ronld Prr nd Lwrence Crin Deprtment of Electricl & Computer Engineering 2 Deprtment of Computer Engineering Duke Univerity

More information

Markov Decision Processes

Markov Decision Processes Mrkov Deciion Procee A Brief Introduction nd Overview Jck L. King Ph.D. Geno UK Limited Preenttion Outline Introduction to MDP Motivtion for Study Definition Key Point of Interet Solution Technique Prtilly

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Lerning Tom Mitchell, Mchine Lerning, chpter 13 Outline Introduction Comprison with inductive lerning Mrkov Decision Processes: the model Optiml policy: The tsk Q Lerning: Q function Algorithm

More information

CHOOSING THE NUMBER OF MODELS OF THE REFERENCE MODEL USING MULTIPLE MODELS ADAPTIVE CONTROL SYSTEM

CHOOSING THE NUMBER OF MODELS OF THE REFERENCE MODEL USING MULTIPLE MODELS ADAPTIVE CONTROL SYSTEM Interntionl Crpthin Control Conference ICCC 00 ALENOVICE, CZEC REPUBLIC y 7-30, 00 COOSING TE NUBER OF ODELS OF TE REFERENCE ODEL USING ULTIPLE ODELS ADAPTIVE CONTROL SYSTE rin BICĂ, Victor-Vleriu PATRICIU

More information

20.2. The Transform and its Inverse. Introduction. Prerequisites. Learning Outcomes

20.2. The Transform and its Inverse. Introduction. Prerequisites. Learning Outcomes The Trnform nd it Invere 2.2 Introduction In thi Section we formlly introduce the Lplce trnform. The trnform i only pplied to cul function which were introduced in Section 2.1. We find the Lplce trnform

More information

Bias in Natural Actor-Critic Algorithms

Bias in Natural Actor-Critic Algorithms Bi in Nturl Actor-Critic Algorithm Philip S. Thom pthom@c.um.edu Deprtment of Computer Science, Univerity of Mchuett, Amhert, MA 01002 USA Technicl Report UM-CS-2012-018 Abtrct We how tht two populr dicounted

More information

STABILITY and Routh-Hurwitz Stability Criterion

STABILITY and Routh-Hurwitz Stability Criterion Krdeniz Technicl Univerity Deprtment of Electricl nd Electronic Engineering 6080 Trbzon, Turkey Chpter 8- nd Routh-Hurwitz Stbility Criterion Bu der notlrı dece bu deri ln öğrencilerin kullnımın çık olup,

More information

2. The Laplace Transform

2. The Laplace Transform . The Lplce Trnform. Review of Lplce Trnform Theory Pierre Simon Mrqui de Lplce (749-87 French tronomer, mthemticin nd politicin, Miniter of Interior for 6 wee under Npoleon, Preident of Acdemie Frncie

More information

Robot Planning in Partially Observable Continuous Domains

Robot Planning in Partially Observable Continuous Domains Robot Plnning in Prtilly Obervble Continuou Domin Joep M. Port Intitut de Robòtic i Informàtic Indutril (UPC-CSIC) Lloren i Artig 4-6, 828, Brcelon Spin Emil: port@iri.upc.edu Mtthij T. J. Spn Informtic

More information

Robot Planning in Partially Observable Continuous Domains

Robot Planning in Partially Observable Continuous Domains Robot Plnning in Prtilly Obervble Continuou Domin Joep M. Port Intitut de Robòtic i Informàtic Indutril (UPC-CSIC) Lloren i Artig 4-6, 828, Brcelon Spin Emil: port@iri.upc.edu Mtthij T. J. Spn Informtic

More information

PHYSICS 211 MIDTERM I 22 October 2003

PHYSICS 211 MIDTERM I 22 October 2003 PHYSICS MIDTERM I October 3 Exm i cloed book, cloed note. Ue onl our formul heet. Write ll work nd nwer in exm booklet. The bck of pge will not be grded unle ou o requet on the front of the pge. Show ll

More information

ARCHIVUM MATHEMATICUM (BRNO) Tomus 47 (2011), Kristína Rostás

ARCHIVUM MATHEMATICUM (BRNO) Tomus 47 (2011), Kristína Rostás ARCHIVUM MAHEMAICUM (BRNO) omu 47 (20), 23 33 MINIMAL AND MAXIMAL SOLUIONS OF FOURH ORDER IERAED DIFFERENIAL EQUAIONS WIH SINGULAR NONLINEARIY Kritín Rotá Abtrct. In thi pper we re concerned with ufficient

More information

Actor-Critic. Hung-yi Lee

Actor-Critic. Hung-yi Lee Actor-Critic Hung-yi Lee Asynchronous Advntge Actor-Critic (A3C) Volodymyr Mnih, Adrià Puigdomènech Bdi, Mehdi Mirz, Alex Grves, Timothy P. Lillicrp, Tim Hrley, Dvid Silver, Kory Kvukcuoglu, Asynchronous

More information

APPENDIX 2 LAPLACE TRANSFORMS

APPENDIX 2 LAPLACE TRANSFORMS APPENDIX LAPLACE TRANSFORMS Thi ppendix preent hort introduction to Lplce trnform, the bic tool ued in nlyzing continuou ytem in the frequency domin. The Lplce trnform convert liner ordinry differentil

More information

PHYS 601 HW 5 Solution. We wish to find a Fourier expansion of e sin ψ so that the solution can be written in the form

PHYS 601 HW 5 Solution. We wish to find a Fourier expansion of e sin ψ so that the solution can be written in the form 5 Solving Kepler eqution Conider the Kepler eqution ωt = ψ e in ψ We wih to find Fourier expnion of e in ψ o tht the olution cn be written in the form ψωt = ωt + A n innωt, n= where A n re the Fourier

More information

Administrivia CSE 190: Reinforcement Learning: An Introduction

Administrivia CSE 190: Reinforcement Learning: An Introduction Administrivi CSE 190: Reinforcement Lerning: An Introduction Any emil sent to me bout the course should hve CSE 190 in the subject line! Chpter 4: Dynmic Progrmming Acknowledgment: A good number of these

More information

Chapter 2 Organizing and Summarizing Data. Chapter 3 Numerically Summarizing Data. Chapter 4 Describing the Relation between Two Variables

Chapter 2 Organizing and Summarizing Data. Chapter 3 Numerically Summarizing Data. Chapter 4 Describing the Relation between Two Variables Copyright 013 Peron Eduction, Inc. Tble nd Formul for Sullivn, Sttitic: Informed Deciion Uing Dt 013 Peron Eduction, Inc Chpter Orgnizing nd Summrizing Dt Reltive frequency = frequency um of ll frequencie

More information

4-4 E-field Calculations using Coulomb s Law

4-4 E-field Calculations using Coulomb s Law 1/11/5 ection_4_4_e-field_clcultion_uing_coulomb_lw_empty.doc 1/1 4-4 E-field Clcultion uing Coulomb Lw Reding Aignment: pp. 9-98 Specificlly: 1. HO: The Uniform, Infinite Line Chrge. HO: The Uniform Dik

More information

Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Module 6 Vlue Itertion CS 886 Sequentil Decision Mking nd Reinforcement Lerning University of Wterloo Mrkov Decision Process Definition Set of sttes: S Set of ctions (i.e., decisions): A Trnsition model:

More information

Efficient Planning in R-max

Efficient Planning in R-max Efficient Plnning in R-mx Mrek Grześ nd Jee Hoey Dvid R. Cheriton School of Computer Science, Univerity of Wterloo 200 Univerity Avenue Wet, Wterloo, ON, N2L 3G1, Cnd {mgrze, jhoey}@c.uwterloo.c ABSTRACT

More information

2D1431 Machine Learning Lab 3: Reinforcement Learning

2D1431 Machine Learning Lab 3: Reinforcement Learning 2D1431 Mchine Lerning Lb 3: Reinforcement Lerning Frnk Hoffmnn modified by Örjn Ekeberg December 7, 2004 1 Introduction In this lb you will lern bout dynmic progrmming nd reinforcement lerning. It is ssumed

More information

Bellman Optimality Equation for V*

Bellman Optimality Equation for V* Bellmn Optimlity Eqution for V* The vlue of stte under n optiml policy must equl the expected return for the best ction from tht stte: V (s) mx Q (s,) A(s) mx A(s) mx A(s) Er t 1 V (s t 1 ) s t s, t s

More information

Exam 2, Mathematics 4701, Section ETY6 6:05 pm 7:40 pm, March 31, 2016, IH-1105 Instructor: Attila Máté 1

Exam 2, Mathematics 4701, Section ETY6 6:05 pm 7:40 pm, March 31, 2016, IH-1105 Instructor: Attila Máté 1 Exm, Mthemtics 471, Section ETY6 6:5 pm 7:4 pm, Mrch 1, 16, IH-115 Instructor: Attil Máté 1 17 copies 1. ) Stte the usul sufficient condition for the fixed-point itertion to converge when solving the eqution

More information

Numerical Analysis: Trapezoidal and Simpson s Rule

Numerical Analysis: Trapezoidal and Simpson s Rule nd Simpson s Mthemticl question we re interested in numericlly nswering How to we evlute I = f (x) dx? Clculus tells us tht if F(x) is the ntiderivtive of function f (x) on the intervl [, b], then I =

More information

Improper Integrals, and Differential Equations

Improper Integrals, and Differential Equations Improper Integrls, nd Differentil Equtions October 22, 204 5.3 Improper Integrls Previously, we discussed how integrls correspond to res. More specificlly, we sid tht for function f(x), the region creted

More information

{ } = E! & $ " k r t +k +1

{ } = E! & $  k r t +k +1 Chpter 4: Dynmic Progrmming Objectives of this chpter: Overview of collection of clssicl solution methods for MDPs known s dynmic progrmming (DP) Show how DP cn be used to compute vlue functions, nd hence,

More information

Chapter 4: Dynamic Programming

Chapter 4: Dynamic Programming Chpter 4: Dynmic Progrmming Objectives of this chpter: Overview of collection of clssicl solution methods for MDPs known s dynmic progrmming (DP) Show how DP cn be used to compute vlue functions, nd hence,

More information

M. A. Pathan, O. A. Daman LAPLACE TRANSFORMS OF THE LOGARITHMIC FUNCTIONS AND THEIR APPLICATIONS

M. A. Pathan, O. A. Daman LAPLACE TRANSFORMS OF THE LOGARITHMIC FUNCTIONS AND THEIR APPLICATIONS DEMONSTRATIO MATHEMATICA Vol. XLVI No 3 3 M. A. Pthn, O. A. Dmn LAPLACE TRANSFORMS OF THE LOGARITHMIC FUNCTIONS AND THEIR APPLICATIONS Abtrct. Thi pper del with theorem nd formul uing the technique of

More information

8 Laplace s Method and Local Limit Theorems

8 Laplace s Method and Local Limit Theorems 8 Lplce s Method nd Locl Limit Theorems 8. Fourier Anlysis in Higher DImensions Most of the theorems of Fourier nlysis tht we hve proved hve nturl generliztions to higher dimensions, nd these cn be proved

More information

2π(t s) (3) B(t, ω) has independent increments, i.e., for any 0 t 1 <t 2 < <t n, the random variables

2π(t s) (3) B(t, ω) has independent increments, i.e., for any 0 t 1 <t 2 < <t n, the random variables 2 Brownin Motion 2.1 Definition of Brownin Motion Let Ω,F,P) be probbility pce. A tochtic proce i meurble function Xt, ω) defined on the product pce [, ) Ω. In prticulr, ) for ech t, Xt, ) i rndom vrible,

More information

CONTROL SYSTEMS LABORATORY ECE311 LAB 3: Control Design Using the Root Locus

CONTROL SYSTEMS LABORATORY ECE311 LAB 3: Control Design Using the Root Locus CONTROL SYSTEMS LABORATORY ECE311 LAB 3: Control Deign Uing the Root Locu 1 Purpoe The purpoe of thi lbortory i to deign cruie control ytem for cr uing the root locu. 2 Introduction Diturbnce D( ) = d

More information

Math& 152 Section Integration by Parts

Math& 152 Section Integration by Parts Mth& 5 Section 7. - Integrtion by Prts Integrtion by prts is rule tht trnsforms the integrl of the product of two functions into other (idelly simpler) integrls. Recll from Clculus I tht given two differentible

More information

THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS.

THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS. THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS RADON ROSBOROUGH https://intuitiveexplntionscom/picrd-lindelof-theorem/ This document is proof of the existence-uniqueness theorem

More information

Analysis of Variance and Design of Experiments-II

Analysis of Variance and Design of Experiments-II Anlyi of Vrince nd Deign of Experiment-II MODULE VI LECTURE - 7 SPLIT-PLOT AND STRIP-PLOT DESIGNS Dr. Shlbh Deprtment of Mthemtic & Sttitic Indin Intitute of Technology Knpur Anlyi of covrince ith one

More information

Research Article Generalized Hyers-Ulam Stability of the Second-Order Linear Differential Equations

Research Article Generalized Hyers-Ulam Stability of the Second-Order Linear Differential Equations Hindwi Publihing Corportion Journl of Applied Mthemtic Volume 011, Article ID 813137, 10 pge doi:10.1155/011/813137 Reerch Article Generlized Hyer-Ulm Stbility of the Second-Order Liner Differentil Eqution

More information

On the Adders with Minimum Tests

On the Adders with Minimum Tests Proceeding of the 5th Ain Tet Sympoium (ATS '97) On the Adder with Minimum Tet Seiji Kjihr nd Tutomu So Dept. of Computer Science nd Electronic, Kyuhu Intitute of Technology Atrct Thi pper conider two

More information

LINEAR STOCHASTIC DIFFERENTIAL EQUATIONS WITH ANTICIPATING INITIAL CONDITIONS

LINEAR STOCHASTIC DIFFERENTIAL EQUATIONS WITH ANTICIPATING INITIAL CONDITIONS Communiction on Stochtic Anlyi Vol. 7, No. 2 213 245-253 Seril Publiction www.erilpubliction.com LINEA STOCHASTIC DIFFEENTIAL EQUATIONS WITH ANTICIPATING INITIAL CONDITIONS NAJESS KHALIFA, HUI-HSIUNG KUO,

More information

19 Optimal behavior: Game theory

19 Optimal behavior: Game theory Intro. to Artificil Intelligence: Dle Schuurmns, Relu Ptrscu 1 19 Optiml behvior: Gme theory Adversril stte dynmics hve to ccount for worst cse Compute policy π : S A tht mximizes minimum rewrd Let S (,

More information

Package ifs. R topics documented: August 21, Version Title Iterated Function Systems. Author S. M. Iacus.

Package ifs. R topics documented: August 21, Version Title Iterated Function Systems. Author S. M. Iacus. Pckge if Augut 21, 2015 Verion 0.1.5 Title Iterted Function Sytem Author S. M. Icu Dte 2015-08-21 Mintiner S. M. Icu Iterted Function Sytem Etimtor. Licene GPL (>= 2) NeedCompiltion

More information

The ifs Package. December 28, 2005

The ifs Package. December 28, 2005 The if Pckge December 28, 2005 Verion 0.1-1 Title Iterted Function Sytem Author S. M. Icu Mintiner S. M. Icu Iterted Function Sytem Licene GPL Verion 2 or lter. R topic documented:

More information

APPROXIMATE INTEGRATION

APPROXIMATE INTEGRATION APPROXIMATE INTEGRATION. Introduction We hve seen tht there re functions whose nti-derivtives cnnot be expressed in closed form. For these resons ny definite integrl involving these integrnds cnnot be

More information

Advanced Calculus: MATH 410 Notes on Integrals and Integrability Professor David Levermore 17 October 2004

Advanced Calculus: MATH 410 Notes on Integrals and Integrability Professor David Levermore 17 October 2004 Advnced Clculus: MATH 410 Notes on Integrls nd Integrbility Professor Dvid Levermore 17 October 2004 1. Definite Integrls In this section we revisit the definite integrl tht you were introduced to when

More information

Notes on length and conformal metrics

Notes on length and conformal metrics Notes on length nd conforml metrics We recll how to mesure the Eucliden distnce of n rc in the plne. Let α : [, b] R 2 be smooth (C ) rc. Tht is α(t) (x(t), y(t)) where x(t) nd y(t) re smooth rel vlued

More information

Review of Calculus, cont d

Review of Calculus, cont d Jim Lmbers MAT 460 Fll Semester 2009-10 Lecture 3 Notes These notes correspond to Section 1.1 in the text. Review of Clculus, cont d Riemnn Sums nd the Definite Integrl There re mny cses in which some

More information

Excerpted Section. Consider the stochastic diffusion without Poisson jumps governed by the stochastic differential equation (SDE)

Excerpted Section. Consider the stochastic diffusion without Poisson jumps governed by the stochastic differential equation (SDE) ? > ) 1 Technique in Computtionl Stochtic Dynmic Progrmming Floyd B. Hnon niverity of Illinoi t Chicgo Chicgo, Illinoi 60607-705 Excerpted Section A. MARKOV CHAI APPROXIMATIO Another pproch to finite difference

More information

Accelerator Physics. G. A. Krafft Jefferson Lab Old Dominion University Lecture 5

Accelerator Physics. G. A. Krafft Jefferson Lab Old Dominion University Lecture 5 Accelertor Phyic G. A. Krfft Jefferon L Old Dominion Univerity Lecture 5 ODU Accelertor Phyic Spring 15 Inhomogeneou Hill Eqution Fundmentl trnvere eqution of motion in prticle ccelertor for mll devition

More information

1 Online Learning and Regret Minimization

1 Online Learning and Regret Minimization 2.997 Decision-Mking in Lrge-Scle Systems My 10 MIT, Spring 2004 Hndout #29 Lecture Note 24 1 Online Lerning nd Regret Minimiztion In this lecture, we consider the problem of sequentil decision mking in

More information

MORE FUNCTION GRAPHING; OPTIMIZATION. (Last edited October 28, 2013 at 11:09pm.)

MORE FUNCTION GRAPHING; OPTIMIZATION. (Last edited October 28, 2013 at 11:09pm.) MORE FUNCTION GRAPHING; OPTIMIZATION FRI, OCT 25, 203 (Lst edited October 28, 203 t :09pm.) Exercise. Let n be n rbitrry positive integer. Give n exmple of function with exctly n verticl symptotes. Give

More information

The Regulated and Riemann Integrals

The Regulated and Riemann Integrals Chpter 1 The Regulted nd Riemnn Integrls 1.1 Introduction We will consider severl different pproches to defining the definite integrl f(x) dx of function f(x). These definitions will ll ssign the sme vlue

More information

arxiv: v1 [stat.ml] 9 Aug 2016

arxiv: v1 [stat.ml] 9 Aug 2016 On Lower Bounds for Regret in Reinforcement Lerning In Osbnd Stnford University, Google DeepMind iosbnd@stnford.edu Benjmin Vn Roy Stnford University bvr@stnford.edu rxiv:1608.02732v1 [stt.ml 9 Aug 2016

More information

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below.

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below. Dulity #. Second itertion for HW problem Recll our LP emple problem we hve been working on, in equlity form, is given below.,,,, 8 m F which, when written in slightly different form, is 8 F Recll tht we

More information

NUMERICAL INTEGRATION. The inverse process to differentiation in calculus is integration. Mathematically, integration is represented by.

NUMERICAL INTEGRATION. The inverse process to differentiation in calculus is integration. Mathematically, integration is represented by. NUMERICAL INTEGRATION 1 Introduction The inverse process to differentition in clculus is integrtion. Mthemticlly, integrtion is represented by f(x) dx which stnds for the integrl of the function f(x) with

More information

1.9 C 2 inner variations

1.9 C 2 inner variations 46 CHAPTER 1. INDIRECT METHODS 1.9 C 2 inner vritions So fr, we hve restricted ttention to liner vritions. These re vritions of the form vx; ǫ = ux + ǫφx where φ is in some liner perturbtion clss P, for

More information

Scientific notation is a way of expressing really big numbers or really small numbers.

Scientific notation is a way of expressing really big numbers or really small numbers. Scientific Nottion (Stndrd form) Scientific nottion is wy of expressing relly big numbers or relly smll numbers. It is most often used in scientific clcultions where the nlysis must be very precise. Scientific

More information

COUNTING DESCENTS, RISES, AND LEVELS, WITH PRESCRIBED FIRST ELEMENT, IN WORDS

COUNTING DESCENTS, RISES, AND LEVELS, WITH PRESCRIBED FIRST ELEMENT, IN WORDS COUNTING DESCENTS, RISES, AND LEVELS, WITH PRESCRIBED FIRST ELEMENT, IN WORDS Sergey Kitev The Mthemtic Intitute, Reykvik Univerity, IS-03 Reykvik, Icelnd ergey@rui Toufik Mnour Deprtment of Mthemtic,

More information

Review of basic calculus

Review of basic calculus Review of bsic clculus This brief review reclls some of the most importnt concepts, definitions, nd theorems from bsic clculus. It is not intended to tech bsic clculus from scrtch. If ny of the items below

More information

Module 6: LINEAR TRANSFORMATIONS

Module 6: LINEAR TRANSFORMATIONS Module 6: LINEAR TRANSFORMATIONS. Trnsformtions nd mtrices Trnsformtions re generliztions of functions. A vector x in some set S n is mpped into m nother vector y T( x). A trnsformtion is liner if, for

More information

Zero-Sum Magic Graphs and Their Null Sets

Zero-Sum Magic Graphs and Their Null Sets Zero-Sum Mgic Grphs nd Their Null Sets Ebrhim Slehi Deprtment of Mthemticl Sciences University of Nevd Ls Vegs Ls Vegs, NV 89154-4020. ebrhim.slehi@unlv.edu Abstrct For ny h N, grph G = (V, E) is sid to

More information

New Expansion and Infinite Series

New Expansion and Infinite Series Interntionl Mthemticl Forum, Vol. 9, 204, no. 22, 06-073 HIKARI Ltd, www.m-hikri.com http://dx.doi.org/0.2988/imf.204.4502 New Expnsion nd Infinite Series Diyun Zhng College of Computer Nnjing University

More information

Oracular Partially Observable Markov Decision Processes: A Very Special Case

Oracular Partially Observable Markov Decision Processes: A Very Special Case Orculr Prtilly Obervble Mrkov Deciion Procee: A Very Specil Ce Nichol Armtrong-Crew nd Mnuel Veloo Robotic Intitute, Crnegie Mellon Univerity {nrmtro,veloo}@c.cmu.edu Abtrct We introduce the Orculr Prtilly

More information

Acceptance Sampling by Attributes

Acceptance Sampling by Attributes Introduction Acceptnce Smpling by Attributes Acceptnce smpling is concerned with inspection nd decision mking regrding products. Three spects of smpling re importnt: o Involves rndom smpling of n entire

More information

A Fast and Reliable Policy Improvement Algorithm

A Fast and Reliable Policy Improvement Algorithm A Fst nd Relible Policy Improvement Algorithm Ysin Abbsi-Ydkori Peter L. Brtlett Stephen J. Wright Queenslnd University of Technology UC Berkeley nd QUT University of Wisconsin-Mdison Abstrct We introduce

More information

Best Approximation in the 2-norm

Best Approximation in the 2-norm Jim Lmbers MAT 77 Fll Semester 1-11 Lecture 1 Notes These notes correspond to Sections 9. nd 9.3 in the text. Best Approximtion in the -norm Suppose tht we wish to obtin function f n (x) tht is liner combintion

More information

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling CSE 547/Stt 548: Mchine Lerning for Big Dt Lecture Multi-Armed Bndits: Non-dptive nd Adptive Smpling Instructor: Shm Kkde 1 The (stochstic) multi-rmed bndit problem The bsic prdigm is s follows: K Independent

More information

Decision Networks. CS 188: Artificial Intelligence Fall Example: Decision Networks. Decision Networks. Decisions as Outcome Trees

Decision Networks. CS 188: Artificial Intelligence Fall Example: Decision Networks. Decision Networks. Decisions as Outcome Trees CS 188: Artificil Intelligence Fll 2011 Decision Networks ME: choose the ction which mximizes the expected utility given the evidence mbrell Lecture 17: Decision Digrms 10/27/2011 Cn directly opertionlize

More information

New data structures to reduce data size and search time

New data structures to reduce data size and search time New dt structures to reduce dt size nd serch time Tsuneo Kuwbr Deprtment of Informtion Sciences, Fculty of Science, Kngw University, Hirtsuk-shi, Jpn FIT2018 1D-1, No2, pp1-4 Copyright (c)2018 by The Institute

More information

positive definite (symmetric with positive eigenvalues) positive semi definite (symmetric with nonnegative eigenvalues)

positive definite (symmetric with positive eigenvalues) positive semi definite (symmetric with nonnegative eigenvalues) Chter Liner Qudrtic Regultor Problem inimize the cot function J given by J x' Qx u' Ru dt R > Q oitive definite ymmetric with oitive eigenvlue oitive emi definite ymmetric with nonnegtive eigenvlue ubject

More information

Lecture 14: Quadrature

Lecture 14: Quadrature Lecture 14: Qudrture This lecture is concerned with the evlution of integrls fx)dx 1) over finite intervl [, b] The integrnd fx) is ssumed to be rel-vlues nd smooth The pproximtion of n integrl by numericl

More information

Generation of Lyapunov Functions by Neural Networks

Generation of Lyapunov Functions by Neural Networks WCE 28, July 2-4, 28, London, U.K. Genertion of Lypunov Functions by Neurl Networks Nvid Noroozi, Pknoosh Krimghee, Ftemeh Sfei, nd Hmed Jvdi Abstrct Lypunov function is generlly obtined bsed on tril nd

More information

Lecture notes. Fundamental inequalities: techniques and applications

Lecture notes. Fundamental inequalities: techniques and applications Lecture notes Fundmentl inequlities: techniques nd pplictions Mnh Hong Duong Mthemtics Institute, University of Wrwick Emil: m.h.duong@wrwick.c.uk Februry 8, 207 2 Abstrct Inequlities re ubiquitous in

More information

Low-order simultaneous stabilization of linear bicycle models at different forward speeds

Low-order simultaneous stabilization of linear bicycle models at different forward speeds 203 Americn Control Conference (ACC) Whington, DC, USA, June 7-9, 203 Low-order imultneou tbiliztion of liner bicycle model t different forwrd peed A. N. Gündeş nd A. Nnngud 2 Abtrct Liner model of bicycle

More information

Variational Techniques for Sturm-Liouville Eigenvalue Problems

Variational Techniques for Sturm-Liouville Eigenvalue Problems Vritionl Techniques for Sturm-Liouville Eigenvlue Problems Vlerie Cormni Deprtment of Mthemtics nd Sttistics University of Nebrsk, Lincoln Lincoln, NE 68588 Emil: vcormni@mth.unl.edu Rolf Ryhm Deprtment

More information

STOCHASTIC REGULAR LANGUAGE: A MATHEMATICAL MODEL FOR THE LANGUAGE OF SEQUENTIAL ACTIONS FOR DECISION MAKING UNDER UNCERTAINTY

STOCHASTIC REGULAR LANGUAGE: A MATHEMATICAL MODEL FOR THE LANGUAGE OF SEQUENTIAL ACTIONS FOR DECISION MAKING UNDER UNCERTAINTY Interntionl Journl of Mthemtic nd Computer Appliction Reerch (IJMCAR) ISSN 49-6955 Vol. 3, Iue, Mr 3, -8 TJPRC Pvt. Ltd. STOCHASTIC REGULAR LANGUAGE: A MATHEMATICAL MODEL FOR THE LANGUAGE OF SEQUENTIAL

More information

Uncertain Dynamic Systems on Time Scales

Uncertain Dynamic Systems on Time Scales Journl of Uncertin Sytem Vol.9, No.1, pp.17-30, 2015 Online t: www.ju.org.uk Uncertin Dynmic Sytem on Time Scle Umber Abb Hhmi, Vile Lupulecu, Ghu ur Rhmn Abdu Slm School of Mthemticl Science, GCU Lhore

More information

SUMMER KNOWHOW STUDY AND LEARNING CENTRE

SUMMER KNOWHOW STUDY AND LEARNING CENTRE SUMMER KNOWHOW STUDY AND LEARNING CENTRE Indices & Logrithms 2 Contents Indices.2 Frctionl Indices.4 Logrithms 6 Exponentil equtions. Simplifying Surds 13 Opertions on Surds..16 Scientific Nottion..18

More information

221B Lecture Notes WKB Method

221B Lecture Notes WKB Method Clssicl Limit B Lecture Notes WKB Method Hmilton Jcobi Eqution We strt from the Schrödinger eqution for single prticle in potentil i h t ψ x, t = [ ] h m + V x ψ x, t. We cn rewrite this eqution by using

More information

SPACE VECTOR PULSE- WIDTH-MODULATED (SV-PWM) INVERTERS

SPACE VECTOR PULSE- WIDTH-MODULATED (SV-PWM) INVERTERS CHAPTER 7 SPACE VECTOR PULSE- WIDTH-MODULATED (SV-PWM) INVERTERS 7-1 INTRODUCTION In Chpter 5, we briefly icue current-regulte PWM inverter uing current-hyterei control, in which the witching frequency

More information

Chapters 4 & 5 Integrals & Applications

Chapters 4 & 5 Integrals & Applications Contents Chpters 4 & 5 Integrls & Applictions Motivtion to Chpters 4 & 5 2 Chpter 4 3 Ares nd Distnces 3. VIDEO - Ares Under Functions............................................ 3.2 VIDEO - Applictions

More information

VSS CONTROL OF STRIP STEERING FOR HOT ROLLING MILLS. M.Okada, K.Murayama, Y.Anabuki, Y.Hayashi

VSS CONTROL OF STRIP STEERING FOR HOT ROLLING MILLS. M.Okada, K.Murayama, Y.Anabuki, Y.Hayashi V ONTROL OF TRIP TEERING FOR OT ROLLING MILL M.Okd.Murym Y.Anbuki Y.yhi Wet Jpn Work (urhiki Ditrict) JFE teel orportion wkidori -chome Mizuhim urhiki 7-85 Jpn Abtrct: trip teering i one of the mot eriou

More information

The use of a so called graphing calculator or programmable calculator is not permitted. Simple scientific calculators are allowed.

The use of a so called graphing calculator or programmable calculator is not permitted. Simple scientific calculators are allowed. ERASMUS UNIVERSITY ROTTERDAM Informtion concerning the Entrnce exmintion Mthemtics level 1 for Interntionl Bchelor in Communiction nd Medi Generl informtion Avilble time: 2 hours 30 minutes. The exmintion

More information

Improper Integrals. Type I Improper Integrals How do we evaluate an integral such as

Improper Integrals. Type I Improper Integrals How do we evaluate an integral such as Improper Integrls Two different types of integrls cn qulify s improper. The first type of improper integrl (which we will refer to s Type I) involves evluting n integrl over n infinite region. In the grph

More information

A New Grey-rough Set Model Based on Interval-Valued Grey Sets

A New Grey-rough Set Model Based on Interval-Valued Grey Sets Proceedings of the 009 IEEE Interntionl Conference on Systems Mn nd Cybernetics Sn ntonio TX US - October 009 New Grey-rough Set Model sed on Intervl-Vlued Grey Sets Wu Shunxing Deprtment of utomtion Ximen

More information

International Journal of Scientific & Engineering Research, Volume 4, Issue 8, August ISSN

International Journal of Scientific & Engineering Research, Volume 4, Issue 8, August ISSN Interntionl Journl of Scientific & Engineering Reerc Volume Iue 8 ugut- 68 ISSN 9-558 n Inventory Moel wit llowble Sortge Uing rpezoil Fuzzy Number P. Prvti He & ocite Profeor eprtment of Mtemtic ui- E

More information

A sequence is a list of numbers in a specific order. A series is a sum of the terms of a sequence.

A sequence is a list of numbers in a specific order. A series is a sum of the terms of a sequence. Core Module Revision Sheet The C exm is hour 30 minutes long nd is in two sections. Section A (36 mrks) 8 0 short questions worth no more thn 5 mrks ech. Section B (36 mrks) 3 questions worth mrks ech.

More information

1 The Riemann Integral

1 The Riemann Integral The Riemnn Integrl. An exmple leding to the notion of integrl (res) We know how to find (i.e. define) the re of rectngle (bse height), tringle ( (sum of res of tringles). But how do we find/define n re

More information

MATH34032: Green s Functions, Integral Equations and the Calculus of Variations 1

MATH34032: Green s Functions, Integral Equations and the Calculus of Variations 1 MATH34032: Green s Functions, Integrl Equtions nd the Clculus of Vritions 1 Section 1 Function spces nd opertors Here we gives some brief detils nd definitions, prticulrly relting to opertors. For further

More information

AMATH 731: Applied Functional Analysis Fall Additional notes on Fréchet derivatives

AMATH 731: Applied Functional Analysis Fall Additional notes on Fréchet derivatives AMATH 731: Applied Functionl Anlysis Fll 214 Additionl notes on Fréchet derivtives (To ccompny Section 3.1 of the AMATH 731 Course Notes) Let X,Y be normed liner spces. The Fréchet derivtive of n opertor

More information

Analytical Methods Exam: Preparatory Exercises

Analytical Methods Exam: Preparatory Exercises Anlyticl Methods Exm: Preprtory Exercises Question. Wht does it men tht (X, F, µ) is mesure spce? Show tht µ is monotone, tht is: if E F re mesurble sets then µ(e) µ(f). Question. Discuss if ech of the

More information

Coalgebra, Lecture 15: Equations for Deterministic Automata

Coalgebra, Lecture 15: Equations for Deterministic Automata Colger, Lecture 15: Equtions for Deterministic Automt Julin Slmnc (nd Jurrin Rot) Decemer 19, 2016 In this lecture, we will study the concept of equtions for deterministic utomt. The notes re self contined

More information

Math 1B, lecture 4: Error bounds for numerical methods

Math 1B, lecture 4: Error bounds for numerical methods Mth B, lecture 4: Error bounds for numericl methods Nthn Pflueger 4 September 0 Introduction The five numericl methods descried in the previous lecture ll operte by the sme principle: they pproximte the

More information

Jim Lambers MAT 169 Fall Semester Lecture 4 Notes

Jim Lambers MAT 169 Fall Semester Lecture 4 Notes Jim Lmbers MAT 169 Fll Semester 2009-10 Lecture 4 Notes These notes correspond to Section 8.2 in the text. Series Wht is Series? An infinte series, usully referred to simply s series, is n sum of ll of

More information

Information Leakage as a Model for Quality of Anonymity Networks

Information Leakage as a Model for Quality of Anonymity Networks Clevelnd tte Univerity Enggedcholrhip@CU Electricl Engineering & Computer cience Fculty Publiction Electricl Engineering & Computer cience Deprtment 4-2009 Informtion Lekge Model for Qulity of Anonymity

More information

CMDA 4604: Intermediate Topics in Mathematical Modeling Lecture 19: Interpolation and Quadrature

CMDA 4604: Intermediate Topics in Mathematical Modeling Lecture 19: Interpolation and Quadrature CMDA 4604: Intermedite Topics in Mthemticl Modeling Lecture 19: Interpoltion nd Qudrture In this lecture we mke brief diversion into the res of interpoltion nd qudrture. Given function f C[, b], we sy

More information