A Generalized Reinforcement-Learning Model: Convergence and. Applications

Size: px

Start display at page:

Download "A Generalized Reinforcement-Learning Model: Convergence and. Applications"

Jeffry Newman
6 years ago
Views:

1 URL ftp://iserv.iki.kfki.hu/pub/ppers/icml96/szepes.greinf.ps.z WWW A Generlized Reinforcement-Lerning Model: Convergence nd Applictions Michel L. Littmn Deprtment of Computer Science Brown Universit Providence, RI , USA mlittmn@cs.brown.edu Abstrct Reinforcement lerning is the process b which n utonomous gent uses its eperience intercting with n environment to improve its behvior. The Mrkov decision process (mdp) model is populr w of formlizing the reinforcement-lerning problem, but it is b no mens the onl w. In this pper, we show how mn of the importnt theoreticl results concerning reinforcement lerning in mdps etend to generlized mdp model tht includes mdps, two-pler gmes nd mdps under worst-cse optimlit criterion s specil cses. The bsis of this etension is stochstic-pproimtion theorem tht reduces snchronous convergence to snchronous convergence. 1 INTRODUCTION Reinforcement lerning is the process b which n gent improves its behvior in n environment vi eperience. A reinforcement-lerning scenrio is dened b the eperience presented to the gent t ech step, nd the criterion for evluting the gent's behvior. One prticulrl well-studied reinforcement-lerning scenrio is tht of single gent mimizing epected discounted totl rewrd in nite-stte environment; in this scenrio eperiences re of the form h; ; ; ri, with stte, ction, resulting stte, nd the gent's sclr immedite rewrd r. A discount prmeter Also Deprtment of Adptive Sstems, Joint Deprtment of the "Jozsef Attil" Universit, Szeged nd the Institute of Isotopes of the Hungrin Acdem of Sciences, Budpest 1525, P.O. Bo. 77. HUNGARY Csb Szepesvri Reserch Group of Articil Intelligence \Jozsef Attil" Universit, Szeged Szeged 6720, Ardi vrt tere 1. HUNGARY szepes@mth.u-szeged.hu 0 < 1 controls the degree to which future rewrds re signicnt compred to immedite rewrds. The theor of Mrkov decision processes hs been used s theoreticl foundtion for importnt results concerning this reinforcement-lerning scenrio. A (nite) Mrkov decision process (mdp) is dened b the tuple hs; A; P; Ri, where S is nite set of sttes, A - nite set of ctions, P trnsition function, nd R rewrd function. The optiml behvior for n gent in n mdp depends on the optimlit criterion; for the innite-horizon epected discounted totl-rewrd criterion, the optiml behvior cn be found b identifing the optiml vlue function, dened recursivel b V () = m R(; ) + P (; ; )V () for ll sttes 2 S, where R(; ) is the immedite rewrd for tking ction from stte, the discount fctor, nd P (; ; ) the probbilit tht stte is reched from stte when ction 2 A is chosen. These simultneous equtions, known s the Bellmn equtions, cn be solved using vriet of techniques rnging from successive pproimtion to liner progrmming (Putermn, 1994). In the bsence of complete informtion regrding the trnsition nd rewrd functions, reinforcementlerning methods cn be used to nd optiml vlue functions. Reserchers hve eplored model-free (direct) methods, such s Q-lerning (Wtkins nd Dn, 1992), nd model-bsed (indirect) methods, such s prioritized sweeping (Moore nd Atkeson, 1993), nd mn converge to optiml vlue functions under the proper conditions (Tsitsiklis, 1994; Jkkol et l., 1994; Gullplli nd Brto, 1994). Not ll reinforcement-lerning scenrios of interest cn ;

2 be modeled s mdps. For emple, gret del of reinforcement-lerning reserch hs been directed to the problem of solving two-pler gmes (e.g, Tesuro, 1995), nd the reinforcement-lerning lgorithms for solving mdps nd their convergence proofs do not ppl directl to gmes. In one form of two-pler gme, eperiences re of the form h; ; ; ri, where sttes nd contin dditionl informtion concerning which pler (mimizer or minimizer) gets to choose the ction in tht stte. There re deep similrities between mdps nd this tpe of gme; for emple, it is possible to dene set of Bellmn equtions for the optiml minim vlue of two-pler zero-sum gme, V () = 8 >< >: m 2A (R(; ) + P P (; ; )V ()); if mimizer moves in min 2A (R(; ) + P P (; ; )V ()); if minimizer moves in, where R(; ) is the rewrd to the mimizing pler. When 0 < 1, these equtions hve unique solution nd cn be solved b successive-pproimtion methods. In ddition, we show tht simple etensions of severl reinforcement-lerning lgorithms for mdps converge to optiml vlue functions in these gmes. In this pper, we introduce generlized Mrkov decision process model with pplictions to reinforcement lerning, nd list some importnt results concerning the model. Generlized mdps provide foundtion for the use of reinforcement lerning in mdps nd gmes, s well s in risk-sensitive reinforcement lerning (Heger, 1994), eplortion-sensitive reinforcement lerning (John, 1995), nd reinforcement lerning in simultneous-ction gmes (Littmn, 1994). Our min theorem ddresses conditions for the convergence of snchronous stochstic processes nd shows how these conditions relte to conditions for convergence of corresponding snchronous process; it cn be used to prove the convergence of model-free nd model-bsed reinforcement-lerning lgorithms under vriet of reinforcement-lerning scenrios. In Section 2, we present generlized mdps nd motivte their form vi two detiled emples. In Section 3, we describe stochstic-pproimtion theorem, nd in Section 4 we show severl pplictions of the theorem tht prove the convergence of lerning processes in generlized mdps. 2 THE GENERALIZED MODEL In this section, we introduce our generlized mdp model. We begin b summrizing some of the more signicnt results regrding the stndrd mdp model nd some importnt results for two-pler gmes. 2.1 MARKOV DECISION PROCESSES To provide point of deprture for our generliztion of Mrkov decision processes, we rst describe the use of reinforcement lerning in the mdps; proofs of the unttributed clims cn be found in Putermn's (1994) mdp book. The ultimte trget of lerning is n optiml polic. A polic is some function tht tells the gent which ctions should be chosen under which circumstnces. A polic is optiml under the epected discounted totl rewrd criterion if, with respect to the spce of ll possible policies, mimizes the epected discounted totl rewrd from ll sttes. Directl mimizing over the spce of ll possible policies is imprcticl. However, mdps hve n importnt propert tht mkes it unnecessr to consider such brod spce of possibilities. We s polic is sttionr nd deterministic if it mps directl from sttes to ctions, ignoring everthing else, nd we write () s the ction chosen b when the current stte is. In epected discounted totl rewrd mdp environments, there is lws sttionr deterministic polic tht is optiml; we will use the word \polic" to men sttionr deterministic polic, unless otherwise stted. The vlue function for polic, V, mps sttes to their epected discounted totl rewrd under polic. It cn be dened b the simultneous equtions V () = R(; ) + P (; ; )V (); for ll 2 S. The optiml vlue function V is the vlue function of n optiml polic; it is unique for 0 < 1. The mopic polic with respect to vlue function V is the polic V such tht V () = rg m R(; ) + P (; ; )V () An mopic polic with respect to the optiml vlue function is optiml. The Bellmn equtions cn be opertionlized in the form of the dnmic-progrmming opertor T, which :

3 mps vlue functions to vlue functions: [T V ]() = m R(; ) + P (; ; )V () For 0 < 1, successive pplictions of T to vlue function bring it closer nd closer to the optiml vlue function V, which is the unique ed point of T : V = T V. In reinforcement lerning, R nd P re not known in dvnce. In model-bsed reinforcement lerning, R nd P re estimted on-line, nd the vlue function is updted ccording to the pproimte dnmicprogrmming opertor derived from these estimtes; this lgorithm converges to the optiml vlue function under wide vriet of choices of the order sttes re updted (Gullplli nd Brto, 1994). The method of Q-lerning (Wtkins nd Dn, 1992) uses eperience to estimte the optiml vlue function without ever eplicitl pproimting R nd P. The lgorithm estimtes the optiml Q function Q (; ) = R(; ) + P (; ; )V (); from which the optiml vlue function cn be computed vi V () = m Q (; ). Given the eperience t step t h t ; t ; t ; r t i nd the current estimte Q t (; ) of the optiml Q function, Q-lerning updtes Q t+1 ( t ; t ) := (1? t ( t ; t ))Q t ( t ; t ) + t ( t ; t )(r t + m Q t ( t ; )); where 0 t (; ) 1 is time-dependent lerning rte controlling the blending rte of new estimtes with old estimtes for ech stte-ction pir. The estimted Q function converges to Q under the proper conditions (Wtkins nd Dn, 1992). 2.2 ALTERNATING MARKOV GAMES In lternting Mrkov gmes, two plers tke turns issuing ctions to mimize their epected discounted totl rewrd. The model is dened b the tuple hs 1 ; S 2 ; A; B; P; Ri, where S 1 is the set of sttes in which pler 1 issues ctions from the set A, S 2 is the set of sttes in which pler 2 issues ctions from the set B, P is the trnsition function, nd R is the rewrd function for pler 1. In the zero-sum gmes we consider, the rewrds to pler 2 (the minimizer) re simpl the dditive inverse of the rewrds to pler 1 (the mimizer). Mrkov decision processes re specil cse of lternting Mrkov gmes in which S 2 = ;; : Condon (1992) proves this nd the other unttributed results in this section. A populr optimlit criterion for lternting Mrkov gmes is discounted minim optimlit. Under this criterion, the mimizer chooses ctions to mimize its rewrd ginst the minimizer's best possible counter-polic. A pir of policies is in equilibrium if neither pler hs n incentive to chnge policies if the other pler's polic remins ed. The vlue function for pir of equilibrium policies is the optiml vlue function for the gme; it is unique when 0 < 1, nd cn be found b successive pproimtion. For both plers, there is lws deterministic sttionr optiml polic. An mopic polic with respect to the optiml vlue function is optiml. Dnmic-progrmming opertors, Bellmn equtions, nd reinforcement-lerning lgorithms cn be dened for lternting Mrkov gmes b strting with the definitions used in mdps nd chnging the mimum opertors to either mimums or minimums conditioned on the stte. We show below tht the resulting lgorithms shre their convergence properties with the nlogous lgorithms for mdps. 2.3 GENERALIZED MDPS In lternting Mrkov gmes nd mdps, optiml behvior cn be specied b the Bellmn equtions; n mopic polic with respect to the optiml vlue function is optiml. In this section, we generlize the Bellmn equtions to dene optiml behvior for brod clss of reinforcement-lerning models. The objective criterion used in these models is dditive in tht the the vlue of polic is some mesure of the totl rewrd received. The generlized Bellmn equtions cn be written V () = O R(; ) + M ; V () : (1) Here \ N " is n opertor tht summrizes vlues over ctions s function of the stte, nd \ L ; " is n opertor tht summrizes vlues over net sttes s function of then stte nd ction. For Mrkov decision processes, f(; ) = m f(; L ) nd ; g() = L P P (; ; )g(). For N lternting Mrkov gmes, ; is the sme nd f(; ) = m f(; ) or min f(; ) depending whether is in S 1 or S 2. Mn models cn be represented in this frmework; see Section 4. From reinforcement-lerning perspective, the vlue

4 functions dened b the generlized mdp model cn be interpreted s the totl vlue of the rewrds received b n gent selecting ctions in stochstic environment. The gent begins in stte, tkes ction, nd ends up in stte. The L ; opertor denes how the vlue of the net stte should be used in ssigning vlue to the current stte. The N opertor denes how n optiml gent should choose ctions. When 0 < 1 nd N nd L ; re nonepnsions, the generlized Bellmn equtions hve unique optiml solution, nd therefore, the optiml vlue function is well dened. The N opertor is non-epnsion if O f1 (; )? O f2 (; ) m jf 1 (; )?f 2 (; )j for ll f 1, f 2, nd. An nlogous condition denes when L ; is non-epnsion. Mn nturl opertors re non-epnsions, such s m, min, midpoint, medin, men, nd ed weighted verges of these opertions. Severl previousl described reinforcement-lerning scenrios re specil cses of this generlized mdp model including computing the epected return of ed polic (Sutton, 1988), nding the optiml risk-verse polic (Heger, 1994), nd nding the optiml eplortionsensitive polic (John, 1995). As with mdps, we cn dene dnmic-progrmming opertor O M ; [T V ]() = R(; ) + V () : (2) The opertor T is contrction mpping for 0 < 1. This mens sup j[t V 1 ]()? [T V 2 ]()j sup jv 1 ()? V 2 ()j where V 1 nd V 2 re rbitrr functions nd 0 < 1 is the inde of contrction. We cn dene notion of sttionr mopic policies with respect to vlue function V ; it is n (stochstic) polic V for which T V = T V where [T V ]() = () R(; ) + M ; V () Here () represents the probbilit tht n gent following would choose ction in stte. To be certin tht ever vlue function possesses mopic : polic, we require tht the opertor N stisf the following propert: N for ll functions f nd sttes, min f(; ) f(; ) m f(; ). The vlue function with respect to polic, V, cn be dened b the simultneous equtions V = T V ; it is unique when T is contrction mpping. A polic is optiml if it is mopic with respect to its own vlue function. If is n optiml polic, then V = V becuse it solves the Bellmn eqution: V = T V = T V. The net section describes generl theorem tht cn be used to prove the convergence of severl reinforcement-lerning lgorithms for these nd other models. 3 CONVERGENCE THEOREM The process of nding n optiml vlue function cn be viewed in the following generl w. At n moment in time, there is set of vlues representing the current pproimtion of the optiml vlue function. On ech itertion, we ppl some dnmic-progrmming opertor, perhps modied b eperience, to the current pproimtion to generte new pproimtion. Over time, we would like the pproimtion to tend towrd the optiml vlue function. In this process, there re two tpes of pproimtion going on simultneousl. The rst is n pproimtion of the dnmic-progrmming opertor for the underling model, nd the second is the use of the pproimte dnmic-progrmming opertor to nd the optiml vlue function. This section presents theorem tht gives set of conditions under which this tpe of simultneous stochstic pproimtion converges to n optiml vlue function. First, we need to dene the generl stochstic process. Let the set be the sttes of the model, nd the set B() of bounded, rel-vlued functions over be the set of vlue functions. Let T : B() B() be n rbitrr contrction mpping with ed point V. If we hd direct ccess to the contrction mpping T, we could use it to successivel pproimte V. In most reinforcement-lerning scenrios, T is not vilble nd we must use eperience to construct pproimtions of T. Consider sequence of rndom opertors T t : B() (B() B()) nd dene U t+1 = [T t U t ]V where V nd U 0 2 B() re rbitrr vlue functions. We s T t pproimtes T t V, if U t converges to T V with probbilit 1 uniforml

5 over 1. The ide is tht T t is rndomized version of T tht uses U t s \memor" to converge to T V. The following theorem shows tht, under the proper conditions, we cn use the sequence T t to estimte the ed point V of T. Theorem 1 Let T be n rbitrr mpping with ed point V, nd let T t pproimte T t V. Let V 0 be n rbitrr vlue function, nd dene V t+1 = [T t V t ]V t. If there eist functions 0 F t () 1 nd 0 G t () 1 stisfing the conditions below with probbilit one, then V t converges to V with probbilit 1 uniforml over : 1. for ll U 1 nd U 2 2 B(), nd ll 2, j([t t U 1 ]V )()? ([T t U 2 ]V )()j G t ()ju 1 ()? U 2 ()j; 2. for ll U nd V 2 B(), nd ll 2, j([t t U]V )()? ([T t U]V )()j F t () sup jv ( 0 )? V ( 0 )j; 0 3. for ll k > 0, n t=k G t() converges to zero uniforml in s n increses; nd, 4. there eists 0 < 1 such tht for ll 2 nd lrge enough t, F t () (1? G t ()): Note tht from the conditions of the theorem, it follows tht T is contrction opertor t V with inde of contrction. The theorem is proven in more detiled version of this pper (Szepesvri nd Littmn, 1996). We net describe some of the intuition behind the sttement of the theorem nd its conditions. The itertive pproimtion of V is performed b computing V t+1 = [T t V t ]V t, where T t pproimtes T with the help of the \memor" present in V t. Becuse of Conditions 1 nd 2, G t () is the etent to which the estimted vlue function depends on its present vlue nd F t () 1? G t () is the etent to which the estimted vlue function is bsed on \new" informtion (this resoning becomes clerer in the contet of the pplictions in Section 4). 1 A sequence of functions f n converges to f with probbilit 1 uniforml over if, for the events w for which fn(w; ) f, the convergence is uniform in. In some pplictions, such s Q-lerning, the contribution of new informtion needs to dec over time to insure tht the process converges. In this cse, G t () needs to converge to one; Condition 3 llows this s long s the convergence is slow enough to incorporte sucient informtion for the process to converge. Condition 4 links the vlues of G t () nd F t () through some quntit < 1. If it were somehow possible to updte the vlues snchronousl over the entire stte spce, the process would converge to V even when = 1. In the more interesting snchronous cse, when = 1 the long-term behvior of V t is not immeditel cler; it m even be tht V t converges to something other thn V. The requirement tht < 1 insures tht the use of outdted informtion in the snchronous updtes does not cuse problem in convergence. One of the most noteworth spects of this theorem is tht it shows how to reduce the problem of pproimting V to the problem of pproimting T t prticulr point V (in prticulr, it is enough if T cn be pproimted t V ); in mn cses, the ltter is much esier to chieve nd lso to prove. For emple, the theorem mkes the convergence of Q-lerning consequence of the clssicl Robbins-Monro theorem (Robbins nd Monro, 1951). 4 APPLICATIONS This section mkes use of Theorem 1 to prove the convergence of vrious reinforcement-lerning lgorithms. 4.1 GENERALIZED Q-LEARNING FOR EPECTED VALUE MODELS Consider the fmil of nite stte nd ction generlized mdps dened b the Bellmn equtions V () = O R(; ) + P (; ; )V () where the denition of N does not depend on R or P. A Q-lerning lgorithm for this clss of models cn be dened s follows. Given eperience h t ; t ; t ; r t i t time t nd n estimte Q t (; ) of the optiml Q function, let Q t+1 ( t ; t ) := (1? t ( t ; t ))Q t ( t ; t ) + t ( t ; t ) r t + O Qt ( t ; ) :

6 We cn derive the ssumptions necessr for this lerning lgorithm to stisf the conditions of Theorem 1 nd therefore converge to the optiml Q vlues. The dnmic-progrmming opertor dening the optiml Q function is [T Q](; ) = R(; ) + P (; ; ) O 0 Q(; 0 ): The rndomized pproimte dnmic-progrmming opertor tht gives rise to the Q-lerning rule is ([T t Q 0 ]Q)(; ) = If 8 >< >: (1? t (; ))Q 0 (; )+ t (; )(r t + N 0 Q(t ; 0 )); if = t nd = t Q 0 (; ); otherwise. t is rndoml selected ccording to the probbilit distribution dened b P ( t ; t ; ), is non-epnsion, N nd both the epected vlue nd the vrince of Q(t ; ) eist given the w t is smpled, N r t hs nite vrince nd epected vlue given t nd t equl to R( t ; t ), the lerning rtes re deced so tht t ( t = ; t = ) t (; ) = 1 nd P t ( t = ; t = ) t (; ) 2 < 1 with probbilit 1 uniforml over A 2, then stndrd result from the theor of stochstic pproimtion (Robbins nd Monro, 1951) sttes tht T t pproimtes T everwhere. Tht is, this method of using deced, eponentill weighted verge correctl computes the verge one-step rewrd. Let G t (; ) = 1? t (; ); if = t nd = t ; 1; otherwise, nd F t (; ) = t (; ); if = t nd = t ; 0; otherwise. These functions stisf the conditions of Theorem 1 (Condition 3 is implied b the restrictions plced on the sequence of lerning rtes t ). 2 This condition implies, mong other things, tht ever stte-ction pir is updted innitel often. Here, denotes the chrcteristic function. Theorem 1 therefore implies tht this generlized Q- lerning lgorithm converges to the optiml Q function with probbilit 1 uniforml over A. The convergence of Q-lerning for discounted mdps nd lternting Mrkov gmes follows trivill from this. Etensions of this result for undiscounted \ll-policiesproper" mdps (Bertseks nd Tsitsiklis, 1989), soft stte ggregtion lerning rule (Singh et l., 1995), nd \spreding" lerning rule re given in more detiled version of this pper (Szepesvri nd Littmn, 1996). 4.2 Q-LEARNING FOR MARKOV GAMES Mrkov gmes re generliztion of mdps nd lternting Mrkov gmes in which both plers simultneousl choose ctions t ech step. The bsic model is dened b the tuple hs; A; B; P; Ri nd discount fctor. As in lternting Mrkov gmes, the optimlit criterion is one of discounted minim optimlit, but becuse the plers move simultneousl, the Bellmn equtions tke on more comple form: V () = m min 2(A) b2b 2A R(; ; b) + 2S () 1 P (; ; b; )V () A : In these equtions, R(; ; b) is the immedite rewrd for the mimizer for tking ction in stte t the sme time the minimizer tkes ction b, P (; ; b; ) is the probbilit tht stte is reched from stte when the mimizer tkes ction nd the minimizer tkes ction b, nd (A) represents the set of discrete probbilit distributions over the set A. The sets S, A, nd B re nite. Once gin, optiml policies re policies tht re in equilibrium, nd there is lws pir of optiml policies tht re sttionr. Unlike mdps nd lternting Mrkov gmes, the optiml policies re sometimes stochstic; there re Mrkov gmes in which no deterministic polic is optiml. The stochstic nture of optiml policies eplins the need for the optimiztion over probbilit distributions in the Bellmn equtions, nd stems from the fct tht plers must void being \second guessed" during ction selection. An equivlent set of equtions cn be written with stochstic choice for the minimizer, nd lso with the roles of the mimizer nd minimizer reversed. The Q-lerning updte rule for Mrkov gmes (Littmn, 1994) given step t eperience h t ; t ;

7 b t ; t ; r t i hs the form 0 Q t+1 ( t ; t ; b t ) := (1? t ( t ; t ; b t ))Q t ( t ; t ; b t ) ;b + t ( t ; t ; b t )@ rt + where O f(; ; b) = m min O ;b 2(A) b2b 2A 1 Qt ( t ; ; b) A ; ()f(; ; b): The results of the previous section prove tht this rule converges to the optiml Q function under the proper conditions. 4.3 RISK-SENSITIVE MODELS Heger (1994) described n optimlit criterion for mdps in which onl the worst possible vlue of the net stte mkes contribution to the vlue of stte. An optiml polic under this criterion is one tht voids sttes for which bd outcome is possible, even if it is not probble; for this reson, the criterion hs riskverse qulit to it. The generlized Bellmn equtions for this criterion re V () = O R(; ) + min :P (;;)>0 V () : The rgument in Section 4.5 shows tht model-bsed reinforcement lerning cn be used to nd optiml policies in risk-sensitive models, s long s N does not depend on R or P, nd P is estimted in w tht preserves its zero vs. non-zero nture in the limit. For the model in which N f(; ) = m f(; ), Heger dened Q-lerning-like lgorithm tht converges to optiml policies without estimting R nd P online. In essence, the lerning lgorithm uses n updte rule nlogous to the rule in Q-lerning with the dditionl requirement tht the initil Q function be set optimisticll; tht is, Q 0 (; ) must be lrger thn Q (; ) for ll nd. Using Theorem 1 it is possible to prove the convergence of N generliztion of Heger's lgorithm to models where f(; ) = f(; (f; )) for some function (); tht is, s long s the summr vlue of f(; ) is equl to f(; ) for some. The proof is bsed on estimting the Q-lerning lgorithm from bove b n pproprite process where the Q function is updted onl if the received eperience tuple is n etremit ccording to the optimlit eqution; detils re given elsewhere (Szepesvri nd Littmn, 1996). 4.4 EPLORATION-SENSITIVE MODELS John (1995) considered the implictions of insisting tht reinforcement-lerning gents keep eploring forever; he found tht better lerning performnce cn be chieved if the Q-lerning rule is chnged to incorporte the condition of persistent eplortion. In John's formultion, the gent is forced to dopt polic from restricted set; in one emple, the gent must choose stochstic sttionr polic tht selects ctions t rndom 5% of the time. This pproch requires tht the denition of optimlit be chnged to reect the restriction on policies. The optiml vlue function is given b V () = sup 2P0 V (), where P 0 is the set of permitted (sttionr) policies, nd the ssocited Bellmn equtions re V () = sup L 2P 0 ()(R(; )+ ; g() = P P (; ; )g() nd N P (; ; )V ()); which corresponds to generlized mdp model with f(; ) = sup 2P0 P ()f(; ). Becuse () is probbilit distribution over for n given stte, N is non-epnsion nd, thus, the convergence of the ssocited Q-lerning lgorithm follows from the rguments in Section 4.1. As result, John's lerning rule gives the optiml polic under the revised optimlit criterion. 4.5 MODEL-BASED METHODS The dening ssumption in reinforcement lerning is tht the rewrd nd trnsition functions, R nd P, re not known in dvnce. Although Q-lerning shows tht optiml vlue functions cn be estimted without ever eplicitl lerning R nd P, lerning R nd P mkes more ecient use of eperience t the epense of dditionl storge nd computtion (Moore nd Atkeson, 1993). The prmeters of R nd P cn be glened from eperience b keeping sttistics for ech sttection pir on the epected rewrd nd the proportion of trnsitions to ech net stte. In model-bsed reinforcement lerning, R nd P re estimted on-line, nd the vlue function is updted ccording to the pproimte dnmic-progrmming opertor derived from these estimtes. Theorem 1 implies the convergence of wide vriet of model-bsed reinforcementlerning methods. The dnmic-progrmming opertor dening the optiml vlue for generlized mdps is given in Eqution 2. Here we ssume tht L ; m depend on P nd/or

8 R, but N m not. It is possible to etend the following rgument to llow N to depend on P nd R s well. In model-bsed reinforcement lerning, R nd P re estimted b the quntities R t nd P t, nd L ;;t is n estimte of the L ; opertor dened using R t nd P t. As long s ever stte-ction pir is visited in- nitel often, there re number of simple methods for computing R t nd P t tht converge to R nd P. A bit more cre is needed to insure tht L ;;t converges to L ;, however. For emple, in epected-rewrd mod- L ; els, g() = P P (; ; )g() nd the convergence of P t to P gurntees the convergence L ;;t of L L to ;. On the other hnd, in risk-sensitive model, ; g() = min:p (;;)>0 g() nd it is necessr to pproimte P in w tht insures tht the set of such tht P t (; ; ) > 0 converges to the set of such tht P (; ; ) > 0. This cn be ccomplished esil, for emple, b setting P t (; ; ) = 0 if no trnsition from to under hs been observed. Assuming P nd R re L estimted L in w tht results in the convergence of ;;t to ;, the sequence of dnmic-progrmming opertors T t dened b ([T t U]V )() = 8 >< >: N R t (; ) + L ;;t if 2 t U(); otherwise, V () pproimtes T for ll vlue functions. The set t S represents the set of sttes whose vlues re updted on step t; one populr choice is to set t = f t g. The functions nd G t () = 0; if 2 t ; 1; otherwise, F t () = ; if 2 t ; 0; otherwise, stisf the conditions of Theorem 1 s long s ech is in innitel mn t sets (Condition 3) nd the discount fctor is less thn 1 (Condition 4). As consequence of this rgument nd Theorem 1, model-bsed methods cn be used to nd optiml policies in mdps, lternting Mrkov gmes, Mrkov gmes, risk-sensitive mdps, nd eplortion-sensitive mdps. Also, letting R t = R nd P t = P for ll t, this result implies tht rel-time dnmic progrmming (Brto et l., 1995) converges to the optiml vlue function. ; 5 CONCLUSIONS In this pper, we presented generlized model of Mrkov decision processes, nd proved the convergence of severl reinforcement-lerning lgorithms in the generlized model. Other Results We hve derived collection of results (Szepesvri nd Littmn, 1996) for the generlized mdp model tht demonstrte its generl pplicbilit: the Bellmn equtions cn be solved b vlue itertion; mopic polic with respect to n pproimtel optiml vlue function gives n pproimtel optiml polic; when N hs prticulr \mimiztion" propert, polic itertion converges to the optiml vlue function, nd, for models with the mimiztion propert nd nite stte nd ction spces, both vlue itertion nd polic itertion identif optiml policies in pseudopolnomil time. Relted Work The work presented here is closel relted to severl previous reserch eorts. Szepesvri (1995) described relted generlized reinforcement-lerning model nd presented conditions under which there is n optiml (sttionr) polic tht is mopic with respect to the optiml vlue function. Tsitsiklis (1994) developed the connection between stochstic-pproimtion theor nd reinforcement lerning in mdps. Our work is similr in spirit to tht of Jkkol, Jordn, nd Singh (1994). We believe the form of Theorem 1 mkes it prticulrl convenient for proving the convergence of reinforcementlerning lgorithms; our theorem reduces the proof of the convergence of n snchronous process to simpler proof of convergence of corresponding snchronized one. This ide enbles us to prove the convergence of snchronous stochstic processes whose underling snchronous process is not of the Robbins- Monro tpe (e.g., risk-sensitive mdps, model-bsed lgorithms, etc.). Future Work There re mn res of interest in the theor of reinforcement lerning tht we would like to ddress in future work. The results in this pper primril concern reinforcement-lerning in contrctive models ( < 1 or ll-policies-proper), nd there re importnt non-contrctive reinforcement-lerning scenrios, for emple, reinforcement lerning under n verge-rewrd criterion (Mhdevn, 1996). It would be interesting to develop TD() lgorithm (Sutton, 1988) for generlized mdps. Theorem 1 is not re-

9 stricted to nite stte spces, nd it might be vluble to prove the convergence of reinforcement-lerning lgorithm for innite stte-spce model. Conclusion B identifing common elements mong severl reinforcement-lerning scenrios, we creted new clss of models tht generlizes eisting models in n interesting w. In the generlized frmework, we replicted the estblished convergence proofs for reinforcement lerning in Mrkov decision processes, nd proved new results concerning the convergence of reinforcement-lerning lgorithms in gme environments, under risk-sensitive ssumption, nd under n eplortion-sensitive ssumption. At the hert of our results is new stochstic-pproimtion theorem tht is es to ppl to new situtions. Acknowledgements Reserch supported b PHARE H /1022 nd OTKA Grnt no. F nd b Bellcore's Support for Doctorl Eduction Progrm. References Brto, A. G., Brdtke, S. J., nd Singh, S. P. (1995). Lerning to ct using rel-time dnmic progrmming. Articil Intelligence, 72(1):81{138. Bertseks, D. P. nd Tsitsiklis, J. N. (1989). Prllel nd Distributed Computtion: Numericl Methods. Prentice-Hll, Englewood Clis, NJ. Condon, A. (1992). The compleit of stochstic gmes. Informtion nd Computtion, 96(2):203{ 224. Gullplli, V. nd Brto, A. G. (1994). Convergence of indirect dptive snchronous vlue itertion lgorithms. In Cown, J. D., Tesuro, G., nd Alspector, J., editors, Advnces in Neurl Informtion Processing Sstems 6, pges 695{702, Sn Mteo, CA. Morgn Kufmnn. Heger, M. (1994). Considertion of risk in reinforcement lerning. In Proceedings of the Eleventh Interntionl Conference on Mchine Lerning, pges 105{111, Sn Frncisco, CA. Morgn Kufmnn. Jkkol, T., Jordn, M. I., nd Singh, S. P. (1994). On the convergence of stochstic itertive dnmic progrmming lgorithms. Neurl Computtion, 6(6). John, G. H. (1995). When the best move isn't optiml: Q-lerning with eplortion. Unpublished mnuscript. Littmn, M. L. (1994). Mrkov gmes s frmework for multi-gent reinforcement lerning. In Proceedings of the Eleventh Interntionl Conference on Mchine Lerning, pges 157{163, Sn Frncisco, CA. Morgn Kufmnn. Mhdevn, S. (1996). Averge rewrd reinforcement lerning: Foundtions, lgorithms, nd empiricl results. Mchine Lerning, 22(1/2/3):159{196. Moore, A. W. nd Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement lerning with less dt nd less rel time. Mchine Lerning, 13. Putermn, M. L. (1994). Mrkov Decision Processes Discrete Stochstic Dnmic Progrmming. John Wile & Sons, Inc., New York, NY. Robbins, H. nd Monro, S. (1951). A stochstic pproimtion method. Annls of Mthemticl Sttistics, 22:400{407. Singh, S., Jkkol, T., nd Jordn, M. (1995). Reinforcement lerning with soft stte ggregtion. In Tesuro, G., Touretzk, D. S., nd Leen, T. K., editors, Advnces in Neurl Informtion Processing Sstems 7, Cmbridge, MA. The MIT Press. Sutton, R. S. (1988). Lerning to predict b the method of temporl dierences. Mchine Lerning, 3(1):9{44. Szepesvri, C. (1995). Generl frmework for reinforcement lerning. In Proceedings of ICANN'95 Pris. Szepesvri, C. nd Littmn, M. L. (1996). Generlized Mrkov decision processes: Dnmicprogrmming nd reinforcement-lerning lgorithms. Technicl Report CS-96-11, Brown Universit, Providence, RI. Tesuro, G. (1995). Temporl dierence lerning nd TD-Gmmon. Communictions of the ACM, pges 58{67. Tsitsiklis, J. N. (1994). Asnchronous stochstic pproimtion nd Q-lerning. Mchine Lerning, 16(3). Wtkins, C. J. C. H. nd Dn, P. (1992). Q-lerning. Mchine Lerning, 8(3):279{292.

Reinforcement learning II

Reinforcement learning II CS 1675 Introduction to Mchine Lerning Lecture 26 Reinforcement lerning II Milos Huskrecht milos@cs.pitt.edu 5329 Sennott Squre Reinforcement lerning Bsics: Input x Lerner Output Reinforcement r Critic