The Essential Dynamics Algorithm: Essential Results

@ MIT maachuett nttute of technology artfcal ntellgence laboratory The Eental Dynamc Algorthm: Eental Reult Martn C. Martn AI Memo 003-014 May 003 003 maachuett nttute of technology, cambrdge, ma 0139 ua www.a.mt.edu

Abtract Th paper preent a novel algorthm for learnng n a cla of tochatc Markov decon procee (MDP) wth contnuou tate and acton pace that trade peed for accuracy. A tranform of the tochatc MDP nto a determntc one preented whch capture the eence of the orgnal dynamc, n a ene made prece. In th tranformed MDP, the calculaton of value greatly mplfed. The onlne algorthm etmate the model of the tranformed MDP and multaneouly doe polcy earch agant t. Bound on the error of th approxmaton are proven, and expermental reult n a bcycle rdng doman are preented. The algorthm learn near optmal polce n order of magntude fewer nteracton wth the tochatc MDP, ung le doman knowledge. All code ued n the experment avalable on the project web te. Th work wa funded by DARPA a part of the "Natural Takng of Robot Baed on Human Interacton Cue" project under contract number DABT 63-00-C-1010.

1 Introducton There currently much nteret n the problem of learnng n tochatc Markov decon procee (MDP) wth contnuou tate and acton pace [, 9, 10]. For uch doman, epecally when the tate or acton pace are of hgh dmenon, the value and Q-functon may be qute complcated and dffcult to approxmate. However, there may be relatvely mple polce whch perform well. Th ha lead to recent nteret n polcy earch algorthm, n whch the renforcement gnal ued to modfy the polcy drectly [5, 6, 10]. For many problem, a potve reward only acheved at the end of a tak f the agent reache a goal tate. For complex problem, the probablty that an ntal, random polcy would reach uch a tate could be vanhngly mall. A wdely ued methodology to overcome th hapng [1, 3, 4, 8]. Shapng the ntroducton of mall reward to reward partal progre toward the goal. A hapng functon eae the problem of backng up reward, nce acton are rewarded or punhed ooner. When a polcy change, etmatng the reultng change n value can be dffcult, requrng the new polcy to nteract wth the MDP for many epode. In th paper we ntroduce a method of tranformng a tochatc MDP nto a determntc one. Under certan condton on the orgnal MDP, and gven a hapng reward of the proper form, the determntc MDP can be ued to etmate the value of any polcy wth repect to the orgnal MDP. Th lead to an onlne algorthm for polcy earch: multaneouly etmate the parameter of a model for the tranformed, determntc MDP, and ue th model to etmate both the value of a polcy and the gradent of that value wth repect to the polcy parameter. Then, ung thee etmate, perform gradent decent earch on the polcy parameter. Snce the tranformaton capture what mportant about the orgnal MDP for plannng, we call our method the eental dynamc algorthm. The next ecton gve an overvew of the technque, developng the ntuton behnd t. In ecton 3 we decrbe the mathematcal foundaton of the algorthm, ncludng bound on the dfference between value n the orgnal and tranformed MDP. Secton 4 decrbe an applcaton of th technque to learnng to rde a bcycle. The lat ecton dcue thee reult, comparng them to prevou work. On the bcycle rdng tak, gven the mulator, the only doman knowledge needed a hapng reward that decreae a lean angle ncreae, and a angle to goal ncreae. Compared to prevou work on th problem, a near optmal polcy found n dramatcally le mulated tme, and wth le doman knowledge. Overvew of the Eental Dynamc Algorthm In the eental dynamc algorthm we learn a model of how tate evolve wth tme, and then ue th model to compute the value of the current polcy. In addton, f the polcy and model are from a parameterzed famly, we can compute the gradent of the value wth repect to the parameter. In puttng th plan nto practce, one dffculty that tate tranton are tochatc, o that expected reward mut be computed. One way to compute them to generate many trajectore and average over them, but th can be very tme conumng. Intead we mght be tempted to etmate only the mean of the tate at each future tme, and ue the reward aocated wth that. However, we can do better. If the reward quadratc, the expected reward partcularly mple. Gven knowledge of the tate at tme t, wecan then talk about the dtrbuton of poble tate at ome later tme. For a gven dtrbuton of tate, let denote the expected tate. Then Er [ ()] = ( a ( ) + b ( ) + c)p () d = avar() + b ( ) + c = avar() + c where a, b & c depend on. (1) The Eental Dynamc Algorthm: Eental Reult p.

Suppoe the polcy depend on a vector of parameter θ. When nteractng wth the MDP, at every tme t after havng taken acton a t-1 n tate t-1 and arrvng n tate t : 1. µ ( t 1, a t 1 ) t. ν ( t 1, a t 1 ) ( t µ ( t 1, a t 1 )) 3. 4. 5. t σ t Ṽ = = = t 0 0 6. For every τ n t+1.. t+n: a. τ = µ ( τ 1, π( τ 1 )) b. σ τ = ν ( τ 1, π( τ 1 )) + σ τ 1 ( µ' π ( τ 1 )) c. r τ = 1 --r''( τ )σ τ d. Ṽ = Ṽ + γ τ t r τ 7. Update the polcy n the drecton that ncreae Ṽ : θ = θ + α----- Ṽ θ Fgure 1: The eental dynamc algorthm for a one dmenonal tate pace. The notaton fx ( ) a mean adjut the parameter that determne f to make f(x) cloertoa, e.g. by gradent decent. µ' the dervatve of µ (, π() ) wth repect to. π + r τ ( ) Thu, to calculate the expected reward, we don t need to know the full tate dtrbuton, but mply t mean and varance. Thu, our model hould decrbe how the mean and varance evolve over tme. If the tate tranton are mooth, they can be approxmated by a Taylor ere. Let π be the current polcy, and let µ π () denote the expected tate that reult from takng acton π() n tate. If denote the mean tate at tme t, and the varance, and f tate tranton were determntc, then to frt order we would have σ t t 1 + µ π ( t ) σ t + 1 dµ π ( d t ) σ t where µ π ' the dervatve of µ π wth repect to tate. For tochatc tate tranton, let ν π () be the varance of the tate that reult from takng acton π() n tate. It turn out that the varance at the next tme tep mply ν π () plu the tranformed varance from above, leadng to t + 1 µ π ( t ) dµ σ t + 1 ν π ( t ) + π ( d t ) σ t () Thu, we learn etmate µ and ν of µ and ν repectvely, ue Eq. () to etmate the mean and varance of future tate, and Eq. (1) to calculate the expected reward. The reultng algorthm, whch we call the expected dynamc algorthm, preented n Fgure 1. t The Eental Dynamc Algorthm: Eental Reult p. 3

The next ecton gve a formal dervaton of the algorthm, and prove error bound on the etmated tate, varance, reward and value for the general n-dmenonal cae, where the reward only approxmately quadratc. 3 Dervaton of the Eental Dynamc Algorthm A Markov Decon Proce (MDP) a tuple S, D, A, P, a, r, γ where: S a et of tate; D: S thental-tate dtrbuton; A a et of acton; P, a : S are the tranton probablte; r: S A thereward; and γ the dcount factor. Th paper concerned wth contnuou tate and acton pace, n partcular we aume S n = and A n a =. We ue ubcrpt to denote tme and upercrpt to denote component of vector and matrce. Thu, t denote the th component of the vector at tme t. A (determntc) polcy a mappng from a tate to the acton to be taken n that tate, π : S A. Gven a polcy and a dtrbuton P t of tate at tme t, uch a the ntal tate dtrbuton or the oberved tate, the dtrbuton of tate at future tme defned by the recurve relaton P τ + 1 () = P ', π( ' ) ()P τ ( ' ) d' for τ > t. Gven uch a dtrbuton, we can defne the expectaton and the covarance matrx of a random vector x wth S repect to t, whch we denote E t [ x] and cov t () x repectvely. Thu, E t [ x] = xp t ()x x d j, cov t and () x = E t [( x E t [ x ])( x j E t [ x j ])]. When P t zero except for a ngle tate t, we ntroduce E[ x t ] a a ynonym for E t [ x] whch make the dtrbuton explct. Gven an MDP, we defne the lmted horzon value functon for a gven polcy a n V π ( t ) = γ τ t E [ r (, π ( ))] where the probablty denty at tme t zero except for τ τ τ τ = t tate t. Alo gven a polcy, we defne two functon, the mean µ π () and covarance matrx ν π () of the next tate. Thu, µ π ( t ) = E[ t + 1 t ] and ν π ( t ) = E[ ( t + 1 µ π ( t ))( t + 1 µ π ( t )) T t ]. In polcy earch, we have a fxed et of polce Π and we try to fnd one that reult n a value functon wth hgh value. We tranform the tochatc MDP M to a determntc one M' = S', 0 ', A', f', r', γ' a follow. A tate n the new MDP an ordered par contng of a tate from S and a covarance matrx, denoted (, Σ ). The new ntal tate 0 ' = ( E D [ ], cov D [ ] ). The new acton pace the et of all poble polce for M,that A' = { ππ: A S}. The tate tranton probablte are replaced wth a (determntc) tate tranton functon f' ( ' t, a' t ), whch gve the unque ucceor tate that reult from takng acton a' t = π n tate ' t = ( t, Σ t ). We et f' ( ' t, a' t ) = f' ( t, Σ t, π) = ( µ π ( t ), ν π ( t ) + ( µ π )Σ t ( µ π ) T ). 1 The reward r' (, Σ, π) = r() + --tr where denote the matrx of econd dervatve of r wth repect to each tate varable. Fnally, γ' = γ. The trength of the method come from the theorem below, whch tate that the ( r () Σ) r j j () above tranform approxmately capture the dynamc of the orgnal probabltc MDP to the extent that the orgnal dynamc are mooth. The frt theorem bound the error n approxmatng tate, the econd n covarance, the thrd n reward and the fourth n value. Theorem 1 Fx a tme t, a polcy π, and a dtrbuton of tate P t. Chooe M µ and M The Eental Dynamc Algorthm: Eental Reult p. 4

n uch that µπ (), < M, and, where j k µ µ π ( t ) < M cov t ( t, t ) F < M ε t jk,, = 1 F denote the Frobenu norm. Let t be gven, and defne t + 1 = µ π ( t ), = E t [ t ] t and ε t + 1 = E t [ t + 1 ] t + 1. Then ε t + 1 < ( ε t + M µ ) 3. --M + 1 -- ε t Theorem Suppoe M ν and M are choen o that, ν ----------, j () < M, k ν E t [ t E t [ t ] k ] F < M for k = 1,, 3, 4, t + 1 = µ π ( t ) < M and all the condton of Theorem 1. Let Σ t be gven, and defne Σ t + 1 =, ( t ) + ( µ ( t )) T Σ t ( µ j ( t )). Let ε t Σ = cov t ( t, t ) Σ t, mlarly for ε Σ t + 1. Then, j ν j n, j, k = 1 ε Σ t + 1 F ( ε Σ t F + ε t + M µ + M ν )M ( 10 + O( ε t )) n 3 r n Theorem 3 Suppoe (), and j k jk,, r, < M r ( t ) = 1 j, < M j = 1 r( t ) < M and the condton of the prevou two theorem. Let ε r t = E t [ r ( t )] r' ( t ). Then E t [ r ( t )] = r' ( t ) ε r 1 + t = r t ( ) + --tr( r Σ t) + ε r where t j ε r t < ( ε Σ t F + ε t + M r ) 5 3 --M + O ( ε t ) Theorem 4 Fx a tme t and a polcy π, and a dtrbuton of tate P t. Let t and Σ t be gven, and defne τ and Σ τ for τ = t + 1 t + n recurvely a n theorem 1 and above. Let M εr be an upper bound for εr τ for all τ [ t, t + n]. Then under the condton of the above three theorem, E[ V ( t )] = V' ( ) + εv where. t t ε V 1 γ t < -------------------M n + 1 1 γ εr Proof: Frt, ome prelmnare. In the frt three theorem, whch deal only wth a ngle tranton and a ngle dtrbuton of tate at tme t, namely P t, let x = E Pt [ x] for any random varable x. Note that for any vector x and quare matrce A and B, x T Ax = tr( Axx ( T )) where tr(.) denote the trace of a matrx, tr( AB) A F B F, and xx T F = x. In the tatement of theorem, E t [ ( t t ) 3 ] a three dmenonal matrx whoe, j, k element E t [ ( t t )( j t j t )( k t k t ) ]. Smlarly, E t [ ( t t ) 4 ] a four dmenonal matrx, and f all of t element are fnte, then the lower power mut alo be fnte. The Frobenu norm of uch matrce mply the quare root of the umofthequareofalltherelement. Alo,fa, b, c & d are real number that are greater than zero, then ab + cd < ( a + c) ( b + d). Note that, nce µ π a vector valued functon, µ π () a matrx. Snce µ π, the th component of µ π, a real valued functon, µ π () n. Becaue ν() a matrx, ν, j (). Let µπ () x denote the matrx of econd partal dervatve of µ π, evaluated at x n j k. For any, let 1 = t, = t t and = 1 + = t. The Eental Dynamc Algorthm: Eental Reult p. 5

Thu, E Pt [ ] = and E t T T T T [ ] E t [ 1 1 ] + Σ t + Σ t ε Σ T = = = + t +. Note that = ε t..e. Proof of Theorem 1: Expand µ π () form of the remander, namely ung a frt order Taylor ere wth the Lagrange µ π () µ π ( t ) µ π ( t ) T 1 ( ) -- ( t ), (3) T = + + µπ () x ( t ) j k µ π () = µ π ( t ) + µ π ( t ) T + 1 -- T µπ () x j k for ome x on the lne jonng and t. Then E Pt [ t + 1 ] t + 1 = E Pt [ µ π ( t )] µ π ( t ) µ π ( t ) µ π ( t ) T 1 T = + + --tr( µπ () x ( Σ t + )) µ π ( t ) j k (4) So M ε 1 < t + --M. µ ( M + ε t ) < ( ε t + M µ ) M + 1 -- ( M + ε t ) ε t + 1 Proof of Theorem : Let M k ' = E t [ ( t t ) k ] F. By the mean value theorem, ν j, () = ν, j ( t ) + ν, j () x for ome x on the lne jonng and t. Alo, ν j, ( t ) = E[ t + 1 j t + 1 t ] µ ( t )µ j ( t ) o that cov Pt ( t + 1, j t + 1 ) = E[ t + 1 j t + 1 ] t + 1 = j t + 1 = E Pt [ E[ t + 1 j t + 1 t ]] t + 1 j t + 1 ν, j ( ) + E t Pt [ ν, j () x ] + E Pt [ µ ( t )µ j ( t )] t + 1 j t + 1. (5) The econd term an error term, call t ε' j,. We have ε' < M ν M 1 '. For the thrd term, we expand both µ and µ j ung Eq. (4) and multplyng out the term, obtanng E Pt [ µ ( t )µ j ( t )] = µ ( t )µ j ( t ) + µ ( t ) µ j ( t ) T + µ j ( t ) µ ( t ) T µ ( t ) T Σ t ε Σ T + ( + t + ) µ j ( t ) + 1 --µ ( t ) T E Pt T µπ j () x + 1 --µ k l j ( t ) T E Pt T µπ () x k l + 1 -- µ ( t ) T E Pt T µπ j () x + 1 -- µ k l j ( t ) T E Pt T µπ () x k l + 1 --E 4 Pt T µπ x k l () T k l µπ j () x ε'', j All term other than the frt and the one nvolvng Σ t. That, are error term, call ther um E Pt [ µ ( t )µ j ( t )] = µ ( t )µ j ( t ) + µ ( t ) T Σ t µ j ( t ) + ε'' j, The Eental Dynamc Algorthm: Eental Reult p. 6

o that where Latly let ε ''' = µ ( t )µ j ( t ) t + 1 j t + 1. By Theorem 1, ε''' < ε t t 1 Subttutng nto Eq. (5), we obtan: Each term ha at leat one of the mall bound ε t, ε Σ t F, Mµ or M ν. Ung the nequalty from the prelmnare, we can factor them out. The four M k ' are bounded by M + O( ε t ), a can be hown ung the bnomal theorem, e.g. 3 = E t [ 1 ] + 3 E t [ 1 ] + 3 Et [ 1 ] + 3 3 = E t [ 1 ] + O( ε t ). Proof of Theorem 3: Expand r() ung a econd order Taylor ere wth the Lagrange form of the remander, namely Call the lat term ε'. Thu, and ε Σ t + 1 ε'' < t 1 + µ ( t ) ε t + µ ( t ) ( ε t + ε Σ t F ) 1 + t + 1 M µ M ' + µ ( t ) M µ M 3 ' + --M 4 µ M4 ' ε''' j = µ ( t )µ j ( t ) t + 1 j t + 1 = µ ( t )µ j ( t ) ( µ ( t ) + ε t )( µ j ( t ) + ε j t ) = µ ( t )ε t µ j ( t )ε t ε t ε j t + + ε t cov Pt ( t + 1, j t + 1 ) = ν, j ( t ) + ε', j + µ ( t ) T Σ t µ j ( t ) + ε'' j, + ε''' j, Σ = ε' + ε'' + ε''' and ε t + 1 F M ν M 1 ' M ε t M ε t ε Σ 1 < + + ( + t ) + MM µ M ' + MM µ M 3 ' + --M 4 µ M4 ' + ( ε t + M µ ) 3 --M + 1 -- ε t M + ( εt + M µ ) 3 --M + 1 -- ε t 3 E t [ 1 + ] E t [( 1 + ) 3 ] r () r t ( ) r( t ) T 1 -- T = + + r ( t ) + 1 6 -- j, j, k = 1 ε r t r( t ) ε 1 < t + -- ( ε Σ t F + ε t )M + 1 6 --M r M 3 ' ( ε t + ε Σ t F + M r ) M 1 --M 1 --M ε 1 < + + t + --M 6 3 '. Proof of Theorem 4: n 3 r () x j k. (6) j k E t [ r ()] r t ( ) r( t ) T 1 --tr r ( t ) Σ t ε Σ T = + + ( ( + t + )) + E t [ ε' ] j = r' ( t ) + ε r t The Eental Dynamc Algorthm: Eental Reult p. 7

So, n ε V t γ τ t n < M εr M εr γ τ t 1 γ = = M εr ------------------- n + 1. 1 γ τ = t τ = t The above theorem tate that a long a ε t, ε Σ t F, Mµ, M ν and M r are mall and M fnte, and gven a good etmate of the mean and covarance of the tate at ome tme, the tranformed MDP wll reult n good etmate at later tme, and hence the reward and value functon wll alo be good etmate. Note that no partcular dtrbuton of tate aumed, only that, eentally, the frt four moment are bounded at every tme. The mot unuual condton are that the reward r be roughly quadratc, and that the value functon nclude only a lmted number of future reward. Th motvate the ue of hapng reward. 4 Experment n = τ = t n = γ τ t τ = t E[ V ( t )] γ τ t E τ [ r ( τ )] ( r' ( τ ) + ε r τ ) The code ued for all experment n th paper avalable from www.metahuman.org/ martn/reearch.html. The eental dynamc algorthm wa appled to Randløv and Altrøm bcycle rdng tak [8], wth the objectve of rdng a bcycle to a goal 1km away. The fve tate varable were mply the lean angle, the handlebar angle, ther tme dervatve, and the angle to the goal. The two acton were the torque to apply to the handlebar and the horzontal dplacement of the rder center of ma from the bcycle center lne. The tochatcty of tate tranton came from a unform random number added to the rder dplacement. If the lean angle exceeded /15, the bcycle fell over and the run termnated. If the varance of the tate not too large at every tme tep, then the varance term n the tranformed reward can mply be condered another form of error, and only µ need be etmated. Th wa done here. A contnuou tme formulaton wa ued where, ntead of etmatng the value of the tate varable at a next tme, ther dervatve were etmated. The model wa of the form where ϕ(, a) wa a vector of feature and w wa a vector of weght. The feature were mply the tate and acton varable themelve. The dervatve of each tate varable wa etmated ung gradent decent on w wth the error meaure err = w ϕ(, a) and a learnng rate of 1.0. Th error meaure wa found to work better than the more tradtonal quared error. The quared error mnmzed by the mean of the oberved value, wherea the abolute value mnmzed by the medan [7]. The medan a more robut etmate of central tendency,.e. le uceptble to outler, and therefore may be a better choce n many practcal tuaton. Model etmaton wa done onlne, multaneou wth polcy earch. In the contnuou formulaton, the value functon the tme ntegral of the reward tme the dcount factor. The future tate wa etmated ung Euler ntegraton [7]. Whle the bcycle mulator alo ued Euler ntegraton, thee choce were unrelated. In fact, t = 0.01 for the bcycle mulator and 0.051 for ntegratng the etmated reward. It wa ntegrated for 30 tme tep. n τ = t ε τ r = V' ( t ) + γ τ t ------ = µ w t (, a) = w ϕ(, a) The Eental Dynamc Algorthm: Eental Reult p. 8

length of epode (ec) 000 800 600 400 00 0 1000 000 3000 4000 5000 0.5 0 50 100 150 00 50 300 350 tranng tme (ec) mulated tme (ec) Fgure : The left graph how length of epode v. tranng tme for 10 run. The dahed lne ndcate the optmal polcy. Stable rdng wa acheved wthn 00 mulated econd. The rght graph how angle to goal v. tme for a ngle epode tartng after 3000 mulated econd of tranng. angle to goal (radan) 0.1 0 0.1 0. 0.3 0.4 The hapng reward wa the quare of the angle to goal plu 10 tme the quare of the lean angle. The polcy wa a weghted um of feature, wth a mall Gauan added for exploraton, π() = θ ϕ() + N( 0, 0.05). The feature were mply the tate varable themelve. When the model poor or the polcy parameter are far from a local optmum, V θ can be qute large, reultng n a large gradent decent tep whch may overhoot t regon of applcablty. Th can be addreed by reducng the learnng rate, but then learnng become ntermnably low. Thu, the gradent decent rule wa modfed to = α-----------------------------------. Near an optmum, when V θ «β, th reduce to θ V θ t ( β + V θ ) the uual rule wth a learnng rate of /. In thexperment, =0.01and =1.0. A graph of epode tme v. learnng tme hown n Fgure 1. After fallng over between 40 and 60 tme, the controller wa able to rde to the goal or the tme lmt wthout fallng over. After a ngle uch epode, t contently rode drectly to the goal n a near mnmum amount of tme. The reultng polcy wa eentally an optmal polcy. 5 Dcuon For learnng and plannng n complex world wth contnuou, hgh dmenonal tate and acton pace, the goal not o much to converge on a perfect oluton, but to fnd a good oluton wthn a reaonable tme. Such problem often ue a hapng reward to accelerate learnng. For a large cla of uch problem, th paper propoe approxmatng the problem dynamc n uch a way that the mean and covarance of the future tate can be etmated from the oberved current tate. We have hown that, under certan condton, the reward n the approxmate MDP are cloe to thoe n the orgnal, wth an error that grow boundedly a tme ncreae. Thu, f the reward are only ummed for a lmted number of tep ahead, the reultng value wll approxmate the value of the orgnal ytem. Learnng n th tranformed problem conderably eaer than n the orgnal, and both model etmaton and polcy earch can be acheved onlne. The mulaton of bcycle rdng a good example of a problem where the value functon complex and hard to approxmate, yet mple polce produce near optmal oluton. Ung a tradtonal value functon approxmaton approach, Randløv needed to augment the tate wth the econd dervatve of the lean angle ( Ω ) and provde hapng reward [8]. The reultng algorthm took 1700 epode to rde tably, and 400 epode to get to the goal for the frt tme. The reultng polce tended to rde n crcle and prece toward the goal, rdng roughly 7km to get to a goal 1km away. In contrat, when the acton a weghted um of (very mple) feature, random earch can fnd near optmal polce. Th wa teted expermentally; 0.55% of random polce contently reached the goal when Ω wa ncluded n the tate, and 0.30% dd The Eental Dynamc Algorthm: Eental Reult p. 9

when t wan t. 1 What more, over half of thee polce had a path length wthn 1% of the bet reported oluton. Polce that rode tably but not to the goal were obtaned 0.89% and 0.4% of the tme repectvely. Thu, a random earch of polce need only a few hundred epode to fnd a near optmal polcy. The eental dynamc algorthm contently fnd uch near optmal polce, and the author aware of only one other algorthm whch doe, the PEGASUS algorthm of [5]. The experment n th paper took 40 to 60 epode to rde tably, that, to the goal or untl the tme lmt wthout fallng over. After a ngle uch epode, the polcy contently rode drectly to the goal n a near mnmum amount of tme. In contrat, PEGASUS ued at leat 450 epode to evaluate each polcy. One reaonable ntal polcy to alway apply zero torque to the handlebar and zero dplacement of body poton. Th fall over n an average of 1.74 econd, o PEGASUS would need 780 mulated econd to evaluate uch a polcy. The eental dynamc algorthm learn to rde tably n approxmately 00 mulated econd, and n the econd 780 mulated econd wll have found a near optmal polcy. Th wa acheved ung very lttle doman knowledge. Ω wa not needed n the tate, and the feature were trval. The eental dynamc algorthm can be ued for onlne learnng, or can learn from trajectore provded by other polce, that, t can learn by watchng. In the bcycle experment, the eental dynamc algorthm needed many tme more computng power per mulated econd than PEGASUS, although t wa tll fater than real tme on a 1GHz moble Pentum III, and therefore could preumably be ued for learnng on a real bcycle. The experment n ecton 4 added the quare of the lean angle to the hapng reward, but dd not ue any nformaton about dynamc (.e. velocte or acceleraton), nor about the handlebar. In fact, the hapng reward mply correponded to the common ene advce tay uprght and head toward the goal. However, thee advantage do not come wthout drawback. The eental dynamc algorthm only doe polcy earch n an approxmaton to the orgnal MDP, o an optmal polcy for th approxmate MDP won t, n general, be optmal for the orgnal MDP. The theorem n ecton 3 gve bound on th error, and for bcycle rdng th error mall. Concluon Th paper ha preented an algorthm for onlne polcy earch n MDP wth contnuou tate and acton pace. A tochatc MDP tranformed to a determntc MDP whch capture the eental dynamc of the orgnal. Polcy earch then be performed n th tranformed MDP. Error bound were gven and the technque wa appled to a mulaton of bcycle rdng. The algorthm found near optmal oluton wth le doman knowledge and order of magntude le tme than extng technque. Acknowledgement The author would lke Lele Kaelblng, Al Rahm and epecally Kevn Murphy for enlghtenng comment and dcuon of th work. 1. Our experment contaned two condton, namely wth or wthout Ω n the tate, reultng n 5 or 6 tate varable. The feature were the tate varable themelve, tate and acton varable were caled to roughly the range [-1, +1], weght were choen unformly from [-, +], and each polcy wa run 30 tme. In 100,000 polce per condton, 549 (0.55%) reached the goal all 30 tme when Ω wa ncluded, and 300 (0.30%) when t wan t. For uch polce, the medan rdng dtance wa 1009m and 1008m repectvely. The code ued avalable on the web te.. [5] evaluated a gven polcy by mulatng t 30 tme. The dervatve wth repect to each of the 15 weght wa evaluated ung fnte dfference, requrng another 30 mulaton per weght, for a total of 30 15 = 450 mulaton. Often, the tartng weght at a gven tage were evaluated durng the prevou tage, o only the dervatve need to be calculated. The Eental Dynamc Algorthm: Eental Reult p. 10

Reference [1] Colombett, M. & Dorgo, M. (1994) Tranng agent to perform equental behavor. In Adaptve Behavor, (3), pp. 47-75. [] Forbe, J., & Andre, D. (000) Real-tme renforcement learnng n contnuou doman. In AAAI Sprng Sympoum on Real-Tme Autonomou Sytem. [3] Matarc, M.J. (1994) Reward functon for accelerated learnng. In W.W. Cohen and H. Hrch (ed.) Proc. 11th Intl. Conf. on Machne Learnng. [4] Ng, A. et al. (1999) Polcy nvarance under reward tranformaton: Theory and applcaton to reward hapng. In Proc. 16th Intl. Conf. on Machne Learnng, pp. 78-87. [5] Ng, A. & Jordan, M. (000) PEGASUS: A polcy earch method for large MDP and POMDP. In Uncertanty n Artfcal Intellgence (UAI), Proc. of the Sxteenth Conf., pp. 406-415. [6] Pehkn, L. et al. (000) Learnng to Cooperate va Polcy Search. In Uncertanty n Artfcal Intellgence (UAI), Proc. of the Sxteenth Conf., pp. 307-314. [7] Pre,W.H.etal.(199) Numercal Recpe: The Art of Scentfc Computng. Cambrdge Unverty Pre. [8] Randløv, J. (000) Shapng n Renforcement Learnng by Changng the Phyc of the Problem. In Proc. Intl. Conf. on Machne Learnng. pp. 767-774. [9] Santamaría, J.C. et al. (1998) Experment wth Renforcement Learnng n Problem wth Contnuou State and Acton Space. In Adaptve Behavor, 6(), 1998 [10] Stren, M. J. A. & Moore, A.W. (00) Polcy Search ung Pared Comparon. In Journal of Machne Learnng Reearch, v. 3, pp. 91-950. The Eental Dynamc Algorthm: Eental Reult p. 11