Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms

Size: px

Start display at page:

Download "Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms"

Cathleen Anthony
5 years ago
Views:

1 Mchine Lerning, 39, , c 2000 Kluwer Acdemic Publishers. Printed in The Netherlnds. Convergence Results for Single-Step On-Policy Reinforcement-Lerning Algorithms SATINDER SINGH AT&T Lbs-Reserch, 180 Prk Avenue, Florhm Prk, NJ 07932, USA bvej@reserch.tt.com TOMMI JAAKKOLA tommi@i.mit.edu Deprtment of Computer Science, Msschusetts Institute of Technology, Cmbridge, MA 02139, USA MICHAEL L. LITTMAN Deprtment of Computer Science, Duke University, Durhm, NC , USA mlittmn@cs.duke.edu CSABA SZEPESVÁRI Mindmker Ltd., Konkoly Thege M. u , Budpest 1121, Hungry szepes@mindmker.hu Editor: Sridhr Mhdevn Abstrct. An importnt ppliction of reinforcement lerning (RL) is to finite-stte control problems nd one of the most difficult problems in lerning for control is blncing the explortion/exploittion trdeoff. Existing theoreticl results for RL give very little guidnce on resonble wys to perform explortion. In this pper, we exmine the convergence of single-step on-policy RL lgorithms for control. On-policy lgorithms cnnot seprte explortion from lerning nd therefore must confront the explortion problem directly. We prove convergence results for severl relted on-policy lgorithms with both decying explortion nd persistent explortion. We lso provide exmples of explortion strtegies tht cn be followed during lerning tht result in convergence to both optiml vlues nd optiml policies. Keywords: reinforcement-lerning, on-policy, convergence, Mrkov decision processes 1. Introduction Most reinforcement-lerning (RL) lgorithms (Kelbling et l., 1996; Sutton & Brto, 1998) for solving discrete optiml control problems use evlution or vlue functions to cche the results of experience. This is useful becuse close pproximtions to optiml vlue functions led directly to good control policies (Willims & Bird, 1993; Singh & Yee, 1994). Different RL lgorithms combine new experience with old vlue functions to produce new nd sttisticlly improved vlue functions in different wys. All such lgorithms fce trdeoff between exploittion nd explortion (Thrun, 1992; Kumr & Vriy, 1986; Dyn & Sejnowski, 1996), i.e., between choosing ctions tht re best ccording to the current stte of knowledge, nd ctions tht re not the current best but improve the stte of knowledge nd potentilly yield higher pyoffs in the future. Following Sutton nd Brto (1998), we distinguish between two types of RL lgorithms: on-policy nd off-policy. Off-policy lgorithms my updte estimted vlue functions on the

2 288 SINGH ET AL. bsis of hypotheticl ctions, i.e., ctions other thn those ctully executed in this sense Q-lerning (Wtkins & Dyn, 1992) is n off-policy lgorithm. On-policy lgorithms, on the other hnd, updte vlue functions strictly on the bsis of the experience gined from executing some (possibly non-sttionry) policy. This distinction is importnt becuse offpolicy lgorithms cn (t lest conceptully) seprte explortion from control while onpolicy lgorithms cnnot. More precisely, in the cse of on-policy lgorithms, convergence proof requires more detils of the explortion to be specified thn for off-policy lgorithms, since the updte rule depends gret del on the ctions tken by the system. On-policy lgorithms my prove to be importnt for severl resons. The nlogue of the on-policy/off-policy distinction for RL prediction problems is the trjectory-bsed/trjectoryfree distinction. Trjectory-bsed lgorithms pper superior to trjectory-free lgorithms for prediction when prmeterized function pproximtors re used (Tsitsiklis & Vn Roy, 1996). These results crry over empiriclly to the control cse s well (Boyn & Moore, 1995; Sutton, 1996). In ddition, multi-step prediction lgorithms such s TD(λ) (Sutton, 1988) re more flexible nd dt efficient thn single-step lgorithms (TD(0)), nd most nturl multi-step lgorithms for control re on-policy. Another motivtion for studying on-policy lgorithms is the considertion of the interction between explortion nd optiml ctions, identified by Sutton nd Brto (1998) nd John (1994). Consider robot lerning to mximize rewrd in dngerous environment. Throughout its existence, it will need to execute explortion ctions to help it lern bout its options. However, some of these explortion ctions will led to bd outcomes. An onpolicy lerner will fctor in the costs of explortion, nd tend to void entering prts of the stte spce where explortion is more dngerous. A suggestive exmple ppers in figure 1. This 3-stte deterministic MDP hs two ctions: l nd r. For discount fctor of γ = 0.9, the optiml ction choice from stte y is r (vlue 75.8 s opposed to 74.9 for ction r). On the other hnd, if explortory ctions re tken 50% of the time, the risk of picking dngerous ction r in stte z becomes too gret. Now, for discount fctor of γ = 0.9, the optiml ction choice from stte y is l (vlue 58.8 s opposed to 58.6 for ction r). In other environments, the difference cn be even greter, necessitting the ppliction of on-policy lerning methods. In this pper, we exmine the convergence of single-step (vlue updtes bsed on the vlue of the next timestep only), on-policy RL lgorithms for control. We do not ddress either Figure 1. The optiml ction to tke in this smll, deterministic MDP depends on the explortion strtegy. If no explortion ctions re tken, the optiml ction from stte y is r. If explortion ctions re tken, costly ction r will sometimes be chosen from stte z, mking l the best choice from y.

3 CONVERGENCE OF ON-POLICY RL ALGORITHMS 289 function pproximtion or multi-step lgorithms; this is the subject of our ongoing reserch. Erlier work hs shown tht there re off-policy RL lgorithms tht converge to optiml vlue functions (Wtkins & Dyn, 1992; Dyn, 1992; Jkkol et l., 1994; Tsitsiklis, 1994; Gullplli & Brto, 1994; Littmn & Szepesvári, 1996); we prove convergence results for severl relted on-policy lgorithms. We lso provide exmples of policies tht cn be followed during lerning tht result in convergence to both optiml vlues nd optiml policies. These results generlize nturlly to off-policy lgorithms, such s Q-lerning, showing the convergence of mny RL lgorithms to optiml policies. 2. Solving Mrkov decision problems Mrkov decision processes (MDPs) re widely used to model controlled dynmicl systems in control theory, opertions reserch nd rtificil intelligence (Putermn, 1994; Bertseks, 1995; Brto et l., 1995; Sutton & Brto, 1998). Let S = 1, 2,...,N denote the discrete set of sttes of the system, nd let A be the discrete set of ctions vilble to the system. The probbility of mking trnsition from stte s to stte s on ction is denoted Pss nd the rndom pyoff ssocited with tht trnsition is denoted r(s, ). A policy mps ech stte to probbility distribution over ctions this mpping cn be invrint over time (sttionry) or chnge s function of the interction history (non-sttionry). For ny policy π, we define vlue function { } V π (s) = E π γ t r t s 0 = s, t=0 which is the expected vlue of the infinite-horizon sum of the discounted pyoffs when the system is strted in stte s nd the policy π is followed forever. Note tht r t nd s t re the pyoff nd stte respectively t timestep t, nd (r t, s t ) is stochstic process, where (r t, s t+1 ) depends only on (s t, t ) governed by the rules tht r t is distributed s r(s t, t ) nd the probbility tht s t+1 = s is P t s t s. Here, t is the ction tken by the system t timestep t. The discount fctor, 0 γ < 1, mkes pyoffs in the future less vluble thn more immedite pyoffs. The solution of n MDP is n optiml policy π tht simultneously mximizes the vlue of every stte s S. It is known tht sttionry deterministic optiml policy exists for every MDP (c.f. Bertseks (1995)). Herefter, unless explicitly noted, ll policies re ssumed to be sttionry. The vlue function ssocited with π is denoted V. Often it is convenient to ssocite vlues not with sttes but with stte-ction pirs, clled Q vlues s in Wtkins Q-lerning (Wtkins, 1989): nd Q π (s, ) = R(s, ) + γ E{V π (s )}, Q (s, ) = R(s, ) + γ E{V (s )},

4 290 SINGH ET AL. where s is the rndom next stte on executing ction in stte s, nd R(s, ) is expected vlue of r(s, ). Clerly, π (s) = rgmx Q (s, ), nd V (s) = mx Q (s, ). The optiml Q vlues stisfy the recursive Bellmn optimlity equtions (Bellmn, 1957), Q (s, ) = R(s, ) + γ s P ss mx Q (s, b), s,. (1) b In reinforcement lerning, the quntities tht define the MDP, P nd R, re not known in dvnce. An RL lgorithm must find n optiml policy by intercting with the MDP directly; becuse effective lerning typiclly requires the lgorithm to revisit every stte mny times, we ssume the MDP is communicting (every stte cn be reched from every other stte) Off-policy nd on-policy lgorithms Most RL lgorithms for solving MDPs re itertive, producing sequence of estimtes of either the optiml (Q-)vlue function or the optiml policy or both by repetedly combining old estimtes with the results of new tril to produce new estimtes. An RL lgorithm cn be decomposed into two components. The lerning policy is nonsttionry policy tht mps experience (sttes visited, ctions chosen, rewrds received) into current choice of ction. The updte rule is how the lgorithm uses experience to chnge its estimte of the optiml vlue function. In n off-policy lgorithm, the updte rule need not hve ny reltion to the lerning policy. Q-lerning (Wtkins, 1989) is n off-policy lgorithm tht estimtes the optiml Q-vlue function s follows: Q t+1 (s t, t ) = (1 α t (s t, t ))Q t (s t, t ) [ ] + α t (s t, t ) r t + γ mx (Q t(s t+1, b)), (2) b where Q t is the estimte t the beginning of the tth timestep, nd s t, t, r t, nd α t re the stte, ction, rewrd, nd step size (lerning rte) t timestep t. Eqution (2) is n offpolicy lgorithm s the updte of Q t (s t, t ) depends on mx b (Q t (s t+1, b)), which relies on compring vrious hypotheticl ctions b. The convergence of the Q-lerning lgorithm does not put ny strong requirements on the lerning policy other thn tht every ction is experienced in every stte infinitely often. This cn be ccomplished, for exmple, using the rndom-wlk lerning policy, which chooses ctions uniformly t rndom. Lter, we describe severl other lerning policies tht result in convergence when combined with the Q-lerning updte rule. The updte rule for SARSA(0) (Rummery, 1994; lso see, Rummery & Nirnjn, 1994; John, 1994, 1995; Singh & Sutton, 1995; Sutton, 1996): Q t+1 (s t, t ) = (1 α t (s t, t ))Q t (s t, t ) + α t (s t, t )[r t + γ Q t (s t+1, t+1 )], (3)

5 CONVERGENCE OF ON-POLICY RL ALGORITHMS 291 specil cse of SARSA(λ) with λ = 0, is quite similr to the updte rule for Q-lerning. The min difference is tht Q-lerning mkes n updte bsed on the greedy Q vlue of the successor stte, s t+1, while SARSA(0) 1 uses the Q vlue of the ction t+1 ctully chosen by the lerning policy. This mkes SARSA(0) n on-policy lgorithm, nd therefore its conditions for convergence depend gret del on the lerning policy. In prticulr, becuse SARSA(0) lerns the vlue of its own ctions, the Q vlues cn converge to optimlity in the limit only if the lerning policy chooses ctions optimlly in the limit. Section 3 provides some positive convergence results for two significnt clsses of lerning policies. Under greedy lerning policy (i.e., lwys select the ction tht is best ccording to the current estimte), the updte rules for Q-lerning nd SARSA(0) re identicl. The resulting RL lgorithm would not converge to optiml solutions, in generl, becuse the need for infinite explortion would not be stisfied. This helps illustrte the tension between dequte explortion nd exploittion with regrd to convergence to optimlity. It is worth noting, however, tht the pproch of using greedy lerning policy hs yielded some impressive successes, including the world s finest bckgmmon-plying progrm (Tesuro, 1995), nd stte-of-the-rt systems for spce shuttle scheduling (Zhng & Dietterich, 1995), elevtor control (Crites & Brto, 1996), nd cellulr telephone resource lloction (Singh & Bertseks, 1997). All these pplictions cn be viewed s exploiting on-policy lgorithms, lthough the on-policy versus off-policy distinction is not meningful when no explicit explortion is used Lerning policies A lerning policy selects n ction t timestep t s function of the history of sttes, ctions, nd rewrds experienced so fr. In this pper, we consider severl lerning policies tht mke decisions bsed on summry of history consisting of the current timestep t, the current stte s, the current estimte Q of the optiml Q-vlue function, nd the number of times stte s hs been visited before time t, n t (s). Such lerning policy cn be expressed s the probbilities Pr( s, t, Q, n t (s)), the probbility tht ction is selected given the history. We divide lerning policies for MDPs into two brod ctegories: decying explortion lerning policy tht becomes more nd more like the greedy lerning policy over time, nd persistent explortion lerning policy tht does not. The dvntge of decying explortion policies is tht the ctions tken by the system my converge to the optiml ones eventully, but with the price tht their bility to dpt slows down. In contrst to this, persistent explortion lerning policies cn retin their dptivity forever, but with the price tht the ctions of the system will not converge to optimlity in the stndrd sense. We prove the convergence of SARSA(0) to optiml policies in the stndrd sense for clss of decying explortion lerning policies, nd to optiml policies in specil sense defined below for clss of persistent explortion lerning policies. Consider the clss of decying explortion lerning policies chrcterized by the following two properties: 1. ech ction is executed infinitely often in every stte tht is visited infinitely often, nd 2. in the limit, the lerning policy is greedy with respect to the Q-vlue function with probbility 1.

6 292 SINGH ET AL. We lbel lerning policies stisfying the bove conditions s GLIE, which stnds for greedy in the limit with infinite explortion. An exmple of such lerning policy is form of Boltzmnn explortion: Pr( s, t, Q) = e β t (s)q(s,) b A eβ t (s)q(s,b), where β t (s) is the stte-specific explortion coefficient for time t, which controls the rte of explortion in the lerning policy. To meet Condition 2 bove, we would like β t to be infinite in the limit, while to meet Condition 1 bove we would like β t to not pproch infinity too fst. In Appendix B, we show tht β t (s) = ln n t (s)/c t (s) stisfies the bove requirements (where n t (s) t is the number of times stte s hs been visited in t timesteps, nd C t (s) is defined in Appendix B). Another exmple of GLIE lerning policy is form of ɛ-greedy explortion (Sutton, 1996), which t timestep t in stte s picks rndom explortion ction with probbility ɛ t (s) nd the greedy ction with probbility 1 ɛ t (s). In Appendix B, we show tht if ɛ t (s) = c/n t (s) for 0 < c < 1, then ɛ-greedy explortion is GLIE. We lso nlyze restricted rnk-bsed rndomized (RRR) lerning policies, clss of persistent explortion lerning policies commonly used in prctice. An RRR lerning policy selects ctions probbilisticlly ccording to the rnks of their Q vlues, choosing the greedy ction with the highest probbility nd the ction with the lowest Q vlue with the lowest probbility. Different lerning policies cn be specified by different choices of the function T : {1,...,m} Rtht mps ction rnks to probbilities. Here, m is the number of ctions. For consistency, we require tht T (1) T (2) T(m)nd m i=1 T (i) = 1. At timestep t, the RRR lerning policy chooses n ction by first rnking the vilble ctions ccording to the Q vlues ssigned by the current Q-vlue function Q t for the current stte s t. We use the nottion ρ(q,s,) to be the rnk of ction in stte s bsed on Q(s, ) (e.g., if ρ(q,s,) = 1 then = rgmx b Q(s, b)), with ties broken rbitrrily. Once the ctions re rnked, the ith rnked ction is chosen with probbility T (i); tht is, ction is chosen with probbility T (ρ(q, s, )). The RRR lerning policy is restricted in tht it does not directly choose ctions it simply ssigns probbilities to ctions ccording to their rnks. Therefore, n RRR lerning policy hs the form Pr( s, t, Q) = T (ρ(q t, s, )). To illustrte the use of the T function, we specify three well-known lerning policies s RRR lerning policies by the pproprite definition of T. The rndom-wlk lerning policy chooses ction in stte s with probbility 1/m. To chieve this behvior with the RRR lerning policy, simply define T (i) = 1/m for ll i; ctions will be chosen uniformly t rndom regrdless of their rnk. The greedy lerning policy cn be specified by T (1) = 1, T (i) = 0 when 1 < i m; it deterministiclly selects the ction with the highest Q vlue. Similrly, ɛ-greedy explortion cn be specified by defining T (1) = 1 ɛ + ɛ/m, T(i) = ɛ/m, 1 < i m. This policy tkes the greedy ction with probbility 1 ɛ nd rndom ction otherwise. To stisfy the condition tht T (1) T (2) T(m),we require tht 0 ɛ 1. Another commonly used persistent explortion lerning policy is Boltzmnn explortion with fixed explortion prmeter. Note there is no choice of T tht specifies Boltzmnn explortion; Boltzmnn explortion is not n RRR lerning policy s the probbility of choosing n ction depends on the ctul Q vlues nd not only on the rnks of ctions in Q( ).

7 CONVERGENCE OF ON-POLICY RL ALGORITHMS Results Below we prove results on the convergence of SARSA(0) under the two seprte cses of GLIE nd RRR lerning policies Convergence of SARSA(0) under GLIE lerning policies To ensure the convergence of SARSA(0), we require lookup-tble representtion for the Q vlues nd infinite visits to every stte-ction pir, just s for Q-lerning. Unlike Q-lerning, however, SARSA(0) is n on-policy lgorithm nd, in order to chieve its convergence to optimlity, we hve to further ssume tht the lerning policy becomes greedy in the limit. To stte these ssumptions nd the resulting convergence more formlly, we note first tht due to the dependence on the lerning policy, SARSA(0) does not directly fll under the previously published convergence theorems (Dyn & Sejnowski, 1994; Jkkol et l., 1994; Tsitsiklis, 1994; Szepesvári & Littmn, 1996). Only slight extension is needed, however, nd this is presented in the form of Lemm 1 below (extending Theorem 1 of Jkkol et l., 1994, nd Lemm 12 of Szepesvári & Littmn (1996)). For clrity, we will not present the lemm in full generlity. Lemm 1. Consider stochstic process (α t, t,f t ), t 0, where α t, t,f t : X R stisfy the equtions t+1 (x) = (1 α t (x)) t (x) + α t (x)f t (x), x X, t = 0, 1, 2,... Let P t be sequence of incresing σ -fields such tht α 0 nd 0 re P 0 -mesurble nd α t, t nd F t 1 re P t -mesurble, t = 1, 2,...Assume tht the following hold: 1. the set X is finite α t (x) 1, t α t(x) =, t α2 t (x)< w.p E{F t ( ) P t } W κ t W + c t, where κ [0, 1) nd c t converges to zero w.p Vr{F t (x) P t } K(1+ t W ) 2,where K is some constnt. Then, t converges to zero with probbility one (w.p.1). Let us first clrify how this lemm reltes to the lerning lgorithms tht re the focus of this pper. We cn cpture the sequence of visited sttes s t nd selected ctions t in the definition of the lerning rtes α t s follows: define x t = (s t, t ) nd further require tht α t (x) = 0 whenever x x t. With these definitions, the itertive process reduces to t+1 (s t, t ) = (1 α t (s t, t )) t (s t, t ) + α t (s t, t )F t (s t, t ), which resembles more closely the updtes of the on-line lgorithms such s SARSA(0) (Eq. (3)). Also, note tht the lemm shows the convergence of to zero rther thn to some non-zero optiml vlues. The intended mening of is Q t Q, i.e., the difference between the current Q vlues, Q t, nd the trget Q vlues, Q, tht re ttined symptoticlly.

8 294 SINGH ET AL. The extension provided by our formultion of the lemm is the fct tht the contrction property (the third condition) need not be strict; strict contrction is now required to hold only symptoticlly. This relxtion mkes the theorem more widely pplicble. Proof: While we hve stted tht the lemm extends previous results such s the Theorem 1 of Jkkol et l. (1994) nd Lemm 12 of Szepesvári & Littmn (1996), the proof of our lemm is, however, lredy lmost fully contined in the proofs of these results (requiring only minor, lrgely nottionl chnges). Moreover, the lemm lso follows from Proposition 4.5 of Bertseks (1995), nd in Appendix A we present proof bsed on this proposition. We cn now use Lemm 1 to show the convergence of SARSA(0). Theorem 1. Consider finite stte-ction MDP nd fix GLIE lerning policy π given s set of probbilities Pr( s, t, n t (s), Q). Assume tht t is chosen ccording to π nd t time step t, πuses Q = Q t, where the Q t vlues re computed by the SARSA(0) rule (see Eq. (3)). Then Q t converges to Q nd the lerning policy π t converges to n optiml policy π provided tht the conditions on the immedite rewrds, stte trnsitions nd lerning rtes listed in Section 2 hold nd if the following dditionl conditions re stisfied: 1. The Q vlues re stored in lookup tble. 2. The lerning rtes stisfy 0 α t (s, ) 1, t α t(s, ) = nd t α2 t (s, ) < nd α t (s, ) = 0 unless (s, ) = (s t, t ). 3. Vr{r (s, )} <. Proof: The correspondence to Lemm 1 follows from ssociting X with the set of sttection pirs (s, ), α t (x) with α t (s, ) nd t (s, ) with Q t (s, ) Q (s, ). It follows tht where t+1 (s t, t ) = (1 α t (s t, t )) t (s t, t ) + α t (s t, t )F t (s t, t ), F t (s t, t ) = r t + γ mx Q t(s t+1, b) Q (s t, t ) b A [ ] + γ Q t (s t+1, t+1 ) mx Q t(s t+1, b) b A def = r t + γ mx b A Q t(s t+1, b) Q (s t, t ) + C t (Q) def = F Q t (s t, t ) + C t (s t, t ), where Ft Q would be the corresponding F t in Lemm 1 if the lgorithm under considertion were Q-lerning. We define F t (s, ) = Ft Q (s, ) = C t (s, ) = 0if(s,) (s t, t ) (so F t (s, ) = Ft Q (s, ) + C t (s, ) for ll (s, )) nd denote the σ -field generted by the

9 CONVERGENCE OF ON-POLICY RL ALGORITHMS 295 rndom vribles {s t,α t, t,r t 1,...,s 1,α 1, 1,Q 0 }by P t. Note tht Q t, Q t 1,...,Q 0 re P t -mesurble nd, thus, both t nd F t 1 re P t -mesurble, stisfying the mesurbility conditions of Lemm 1. It is well-known tht for Q-lerning E{Ft Q (, ) P t } γ t for ll t, where is the mximum norm. In other words, the expected updte opertor is contrction mpping. The only difference between the current F t nd Ft Q for Q-lerning is the presence of C t. Therefore, E{F t (, ) P t } E { F Q t (, ) P t } + E{Ct (, ) P t } (4) γ t + E{C t (, ) P t }. (5) Identifying c t = E{C t (, ) P t } in Lemm 1, we re left with showing tht c t converges to zero w.p.1. This, however, follows () from our ssumption of GLIE policy (i.e., tht nongreedy ctions re chosen with vnishing probbilities), (b) the ssumption of finiteness of the MDP, nd (c) the fct tht Q t (s, ) stys bounded during lerning. To verify the boundedness property, we note tht the SARSA(0) Q vlues cn be upper bounded by the Q vlues of Q-lerning process tht updtes exctly the sme stte-ction pirs in the sme order s the SARSA(0) process. Similrly, the SARSA(0) Q vlues re lower bounded by the Q vlues of Q-lerning process tht uses min insted of mx in the updte rule (c.f. Eq. (2)) nd updtes exctly the sme stte-ction pirs in the sme order s the SARSA(0) process. Both the lower-bounding nd the upper-bounding Q-lerning processes re convergent nd hve bounded Q vlues. The condition on the vrince of F t follows from the similr property of Ft Q. Note tht if GLIE lerning policy is used with the Q-lerning updte rule, one gets convergence to both the optiml Q-vlue function nd n optiml policy. This begins to ddress significnt outstnding question in the theory of reinforcement lerning: How do you lern policy tht chieves high rewrd in the limit nd during lerning? Previous convergence results for Q-lerning gurntee tht the optiml Q-vlue function is reched in the limit; this is importnt becuse the longer the lerning process goes on, the closer to optiml the greedy policy with respect to the lerned Q-vlue function will be. However, this provides no useful guidnce for selecting ctions during lerning. Our results, in contrst, show tht it is possible to follow policy during lerning tht pproches optimlity over time. The properties of GLIE policies imply tht for ny RL lgorithm tht converges to the optiml vlue function nd whose estimtes sty bounded (e.g., Q-lerning, nd ARTDP of Brto et l. (1995)), using GLIE lerning policies will ensure concurrent convergence to n optiml policy. However, to get n implementble RL lgorithm, one still hs to specify suitble lerning policy tht gurntees tht every ction is ttempted in every stte infinitely often (i.e., t α t(s, ) = ). In Appendix B, we prove tht, if the probbility of choosing ny prticulr ction in ny given stte sums up to infinity, then the bove condition is indeed stisfied. To illustrte this, in Appendix B we derive two lerning strtegies tht re GLIE.

10 296 SINGH ET AL Convergence of SARSA(0) under RRR lerning policies This section proves two seprte results concerning clss of persistent explortion lerning policies: (1) the SARSA(0) updte rule combined with n RRR lerning policy converges to well-defined Q-vlue function nd policy, nd (2) the resulting policy is optiml, in sense we will define. As mentioned erlier, n RRR lerning policy chooses ctions probbilisticlly by their rnking ccording to the current Q-vlue function; specific lerning policy is specified by the function T, probbility distribution over ction rnks. A restricted policy π : S (A, {1,...,m}) rnks ctions in ech stte (recll tht m denotes the number of ctions), i.e., π(s) is bijection between A nd {1,...,m}. For convenience, we use the nottion π(s,) to denote the ssigned rnk of ction in stte s, i.e., to denote π(s)(). The mpping π represents policy in the sense tht n gent following restricted policy π from stte s chooses ction with probbility T ( π(s,)), the probbility ssigned to the rnk, π(s,), of ction in stte s. Consider wht hppens when the SARSA(0) updte rule is used to lern the vlue of fixed restricted policy π. Stndrd convergence results for Q-lerning cn esily be used to show tht the Q t vlues will converge to the Q-vlue function of π. Specificlly, Q t will converge to Q π, defined s the unique solution to Q π (s, ) = R(s, ) + γ s S Pss A T( π(s, ))Q π (s, ), (s, ) S A. (6) When n RRR lerning policy is followed, the sitution becomes bit more complex. Upon entering stte s, the probbility tht the lerning policy will choose, for exmple, the rnk 1 ction is fixed t T (1); however, the identity of tht ction chnges ccording to the current Q-vlue function estimte Q t (, ). The nturl extension of Eq. (6) to n RRR lerning policy would be for the trget of convergence of Q t in SARSA(0) to be Q(s, ) = R(s, ) + γ s S Pss A T(ρ( Q, s, )) Q(s, ), (s, ) S A. (7) Recll tht ρ( Q,s, ) represents the rnk of ction ccording to the Q vlues Q of stte s. The only chnge between Eqs. (6) nd (7) is tht the ltter uses n ssignment of rnks tht is bsed upon the recursively defined Q-vlue function Q, wheres the former uses fixed ssignment of rnks. Using the theory of generlized MDPs (Szepesvári & Littmn, 1996), we cn show tht this difference is not importnt from the perspective of proving the existence nd uniqueness of the solution to Eq. (7). Define Q(s, ) = T (ρ(q, s, ))Q(s, ); (8) A now Eq. (7) cn be rewritten Q(s, ) = R(s, ) + γ s S P ss Q(s, ), (s, ) S A. (9)

11 CONVERGENCE OF ON-POLICY RL ALGORITHMS 297 As long s stisfies the non-expnsion property tht Q(s, ) Q (s, ) mx Q(s, ) Q (s, ) for ll Q-vlue functions Q nd Q nd ll sttes s, then Eq. (9) hs solution nd it is unique (Szepesvári & Littmn, 1996); this is proven in Appendix C. The non-expnsion property of cn be verified by the following rgument. Consider fmily of opertors i Q(s, ) = ith lrgest vlue of Q(s, ) for ech 1 i m. These re ll non-expnsions (see Appendix C). Define Q(s, ) = i T (i) i Q(s, ); it is non-expnsion s long s every i is nd T is fixed probbility distribution (see Appendix C). It is cler tht Q(s, ) = Q(s, ) s defined in Eq. (8), so is non-expnsion lso. Therefore, Q exists nd is unique. We next show tht Q is, in fct, the trget of convergence for SARSA(0). Theorem 2. In finite stte-ction MDPs, the Q t vlues computed by the SARSA(0) rule (see Eq. (3)) converge to Q if the lerning policy is RRR, the conditions on the immedite rewrds nd stte trnsitions listed in Section 2 hold, nd if the following dditionl conditions re stisfied: 1. Pr( t+1 = Q t, s t+1 ) = T (ρ(q t, s t+1, t+1 )). 2. The Q vlues re stored in lookup tble. 3. The lerning rtes stisfy 0 α t (s, ) 1, t α t(s, ) =, t α2 t (s,)<,nd α t (s, ) = 0 unless (s, ) = (s t, t ). 4. Vr{r (s, )} <. Proof: The result redily follows from Lemm 1 (or Theorem 1 of Jkkol et l. (1994)) nd the proof follows nerly identicl lines s tht of Theorem 1. First, we ssocite X (of Lemm 1) with the set of stte-ction pirs (s, ) nd α t (x) with α t (s, ), but here we set t (s, ) = Q t (s, ) Q(s, ). Agin, it follows tht where now t+1 (s t, t ) = (1 α t (s t, t )) t (s t, t ) + α t (s t, t )F t (s t, t ), F t (s t, t ) = r t + γ Q t (s t+1, t+1 ) Q(s t, t ). Further, we define F t (s, ) = C t (s, ) = 0if(s,) (s t, t ) nd denote the σ -field generted by the rndom vribles {s t,α t, t,r t 1,...,s 1,α 1, 1,Q 0 } by P t. Note tht Q t, Q t 1,...,Q 0 re P t -mesurble nd, thus, both t nd F t 1 re P t -mesurble, stisfying the mesurbility conditions of Lemm 1.

12 298 SINGH ET AL. Substituting the right-hnd side of Eq. (7) for Q(s t, t ) in the definition of F t together with the properties of smpling r t, s t+1 nd t+1 yields tht E{F t (s t, t ) P t } ( = γ E{Q t (s t+1, t+1 ) P t } s S = γ ( s S P t s t s s S A P t s t s γ Q t Q = γ t, A P t s t s A T (ρ(q t, s, ))Q t (s, ) T (ρ( Q, s, )) Q(s, ) where in the first eqution we hve exploited the fct tht E{r t s t, t }=R(s t, t ), in the second eqution tht nd tht Pr(s t+1 s t, t ) = P t s t s t+1 ) T (ρ( Q, s, )) Q(s, ) Pr( t+1 = Q t, s t+1 ) = T (ρ(q t, s t+1, )) (Condition 1), wheres the inequlity comes from the properties of rnk-bsed verging (see Lemm 7 nd Theorems 9 nd 10 of Szepesvári & Littmn (1996), lso Appendix C). Finlly, it is not hrd to prove tht the vrince of F t given the pst P t stisfies Condition 4 nd, therefore, we do not include it here. We hve shown tht SARSA(0) with n RRR lerning policy converges to Q. Next, we show tht Q is, in sense, n optiml Q-vlue function. An optiml restricted policy is one tht hs the highest expected totl discounted rewrd of ll restricted policies. The greedy restricted policy for Q-vlue function Q is π(s,) = ρ(q,s,); it ssigns ech ction the rnk of its corresponding Q vlue. Note tht this is the policy dictted by the RRR lerning policy for fixed Q-vlue function Q. The greedy restricted policy for Q (the optiml Q-vlue function of the MDP) isnot n optiml restricted policy in generl, so the Q-lerning rule in Eq. (2) does not find n optiml restricted policy. However, the next theorem shows tht the greedy restricted policy for Q (Eq. (7)) is n optiml restricted policy. This Q function is very similr to Q, except tht ctions re weighted ccording to the greedy restricted policy insted of the stndrd greedy policy. )

13 CONVERGENCE OF ON-POLICY RL ALGORITHMS 299 Theorem 3. The greedy restricted policy with respect to Q is n optiml restricted policy. Proof: We construct n lternte MDP so tht every restricted policy in the originl MDP is in one-to-one correspondence with (nd hs the sme vlue s) deterministic sttionry policy in the lternte MDP. Note tht, s result of the equlity of vlue functions, the optiml policy of the lternte MDP will correspond to n optiml restricted policy of the originl MDP (the restricted policy tht chieves the best vlues for ech of the sttes) nd, thus, the theorem will follow if we show tht the optiml policy in the lternte MDP corresponds to the greedy restricted policy with respect to Q. The lternte MDP is defined by S, Ā, R, P,γ. Its ction spce, Ā, is the set of ll bijections from A to {1,...,m}, i.e., Ā = (A, {1,...,m}). The rewrds re R(s,µ)= A T(µ())R(s, ), nd the trnsition probbilities re given by P µ ss = A T(µ())P ss. Here, µ is n element of Ā. One cn redily check tht the vlue of restricted policy π is just the vlue of π in the lternte MDP. The vlue of the greedy restricted policy with respect to Q in the originl MDP is V (s) = A T (ρ( Q, s, )) Q(s, ). (10) Substituting the definition of Q from Eq. (7) into Eq. (10) results in V (s) = ( T (ρ( Q, s, )) R(s, ) + γ A s S Pss A ) T(ρ( Q, s, )) Q(s, ). Using Eq. (10) once gin, we find tht V stisfies the recurrence eqution V (s) = ( T (ρ( Q, s, )) R(s, ) + γ A s S P ss V(s ) ). (11) Menwhile, the optimum vlue of the lternte MDP stisfies V (s) = mx µ Ā = mx µ Ā = mx µ Ā ( R(s,µ)+γ s S ( P µ ss V (s ) ) T (µ())r(s, ) + γ A s S ( T (µ()) R(s, ) + γ A s S ( A P ss V (s ) ) ) T (µ())pss V (s ) ). (12)

14 300 SINGH ET AL. The highest vlue permuttion is the one tht ssigns the highest probbilities to the ctions with the highest Q vlues nd the lowest probbilities to the ctions with the lowest Q vlues. Therefore, the recurrence in Eq. (12) is the sme s tht in Eq. (11), so, by uniqueness, V = V. This mens the greedy restricted policy with respect to Q is the optiml restricted policy. As corollry of Theorem 2, given communicting MDP nd n RL lgorithm tht follows n RRR lerning policy specified by T where T (i) > 0 for ll 1 i m, SARSA(0) converges to n optiml restricted policy. 3 The results of this section show tht RRR lerning policies with the SARSA(0) updte rule converge to optiml restricted policies. In contrst to Q-lerning, this mens tht the lerner cn dopt its symptotic policy t ny time during lerning nd still converge to optimlity in this modified sense. However, the fct tht convergence depends on decying the lerning rte to zero mens tht this pproch is somewht self-contrdictory; in the limit, the lerner is still exploring, but it is not ble to lern nything new from its discoveries. 4. Conclusion In this pper, we hve provided convergence results for SARSA(0) under two different lerning policy clsses; one ensures optiml behvior in the limit nd the other ensures behvior optiml with respect to constrints imposed by the explortion strtegy. To the best of our knowledge, these constitute the first convergence results for ny on-policy lgorithm. However, these re very bsic results becuse they pply only to the lookuptble cse, nd more importntly becuse they do not seem to extend nturlly to generl multi-step on-policy lgorithms. Appendix A: proof of Lemm 1 For completeness we present Proposition 4.5 of Bertseks (1995). Lemm 2. Let r t+1 (i) = (1 γ t (i))r t (i) + γ t (i){(h t r t )(i) + w t (i) + u t (i)}, where i = 1, 2,...,n nd t = 0, 1, 2,..., let F t be n incresing sequence of σ -fields, nd ssume the following: 1. γ t 0, t=0 γ t(i) =, t=0 γ2 t (i)< (.s.); 2. (A) for ll i, t : E[w t (i) F t ] = 0; 3. (B) there exist A, B Rs.t. for ll i, t : E[w 2 t (i) F t] A + B r t 2 ; 4. there exists n r R n, positive vector ξ, nd sclr β [0, 1) s.t. for ll t 0 H t r t r ξ β r t r ξ ; 5. there exists θ t 0,θ t 0w.p.1 nd for ll i, t : u t (i) θ t ( r t ξ +1). Then r t r w.p.1. For convenience, we repet Lemm 1.

15 CONVERGENCE OF ON-POLICY RL ALGORITHMS 301 Lemm 1. Consider stochstic process (α t, t,f t ),t 0,where α t, t,f t : X R, which stisfies the equtions t+1 (x) = (1 α t (x)) t (x) + α t (x)f t (x), x X, t = 0, 1, 2,... Let P t be sequence of incresing σ -fields such tht α 0, 0 re P 0 -mesurble nd α t, t nd F t 1 re P t -mesurble, t = 1, 2,...Assume tht the following hold: 1. the set of possible sttes X is finite α t (x) 1, t α t(x) =, t α2 t (x)< w.p E{F t ( ) P t } W κ t W + c t, where κ [0, 1) nd c t converges to zero w.p Vr{F t (x) P t } K(1+ t W ) 2,where K is some constnt. Then, t converges to zero with probbility one (w.p.1). Proof: We pply Lemm 2. For simplicity, we present the proof for the cse when W = (1, 1,...,1). Let { Ft, if E[F F t = t P t ] κ t ; sign(e[f t P t ])κ t, otherwise. Further, let b t = F t F t. Then, by the construction of F t, E[ F t P t ] κ t nd E[b t P t ] c t. Now, if we identify {1, 2,...,n}with X, nd define F t = P t, γ t = α t, r t = t, H t r t = E[ F t P t ], w t = F t E[ F t P t ] + b t E[b t P t ], u t = E[b t P t ] nd r = 0, then we see tht the conditions of Lemm 2 re stisfied nd thus r t = t converges to r = 0 w.p.1. Appendix B: GLIE lerning policies Here, we present conditions on the explortion prmeter in the commonly used Boltzmnn explortion nd ɛ-greedy explortion strtegies to ensure tht both infinite explortion nd greedy in the limit conditions re stisfied. In communicting MDP, every stte gets visited infinitely often s long s ech ction is chosen infinitely often in ech stte (this is consequence of the Borel-Cntelli Lemm (Breimn, 1992); ll we hve to ensure is tht in ech stte ech ction gets chosen infinitely often in the limit. Consider some stte s. Let t s (i) represent the timestep t which the ith visit to stte s occurs. Consider some ction. The probbility with which ction is executed t the ith visit to stte s is denoted Pr( s, t s (i)) (i.e, Pr( = t s t = s, t s (i) = t)). We would like to show tht if the sum of the probbilities with which ction is chosen is infinite, i.e., i=1 Pr( s, t s(i)) =, then the number of times ction gets executed in stte s is infinite w.p.1. This would follow directly from the Borel-Cntelli Lemm if the probbilities of selecting ction t the different i were independent. However, in our cse the rndom choice of ction t the ith visit to stte s ffects the probbilities t the i + 1st visit to stte s (through the evolution of the Q-vlue function), so we need n extension of the Borel-Cntelli Lemm (c.f. Corollry 5.29 of Breimn (1992)):

16 302 SINGH ET AL. Lemm 3 (Extended Borel-Cntelli Lemm). Let F i be n incresing sequence of σ -fields nd let A i be F i -mesurble. Then { } ω : Pr(A i F i 1 ) = ={ω:ω A i i.o.} holds w.p.1. i=0 We hve the following: Lemm 4. Consider communicting MDP nd the reinforced decision process (x 0, 0, r 0,...,x t, t,r t,...). Let n t (s) denote the number of visits to stte s up to time t, n t (s, ) denote the number of times ction hs been chosen in stte s during the first t timesteps (n t (s, ) n t (s)), nd t s (i) denote the time when stte s ws visited the ith time. Assume tht the ction t time step t, t, is selected purely on the bsis of the sttistics D t : Pr( t = D t, t 1, D t 1,..., 0,D 0 )=Pr( t = D t ), (B.1) where D t is computed from the full t-step history (x 0, 0, r 0,...,x t ). Further, ssume tht the ction selection policy π is such tht { { } ω : lim n t (s)(ω) = ω: t i=0 Pr ( } ) ts (i) = D ts (i) (ω) =.s. Then, for ll (s, ) pirs n t (s).s. nd n t (s, ).s. (B.2) The sttistics D t could be for exmple (s t, t, n t (s), Q t ), where Q t is computed by the SARSA(0) updte rule (3). Proof: Fix n rbitrry pir (s, ) nd let F i be the sigm field generted by the rndom vribles {D ts (i+1), ts (i), D ts (i),..., ts (0),D ts (0)}. Let A i ={ ts (i) =}. Then A i is F i -mesurble. Further, by Eq. (B.1) Pr(A i F i 1 ) = Pr ( ts (i) = ) D ts (i), ts (i 1), D ts (i 1),..., ts (0),D ts (0) = Pr ( ts (i) = D ts (i)),

17 CONVERGENCE OF ON-POLICY RL ALGORITHMS 303 nd thus, by Eq. (B.2) nd Lemm 3, lmost surely { } ω : lim n t (s)(ω) = t { ω: i=0 Pr ( ts (i) = } ) Dts (i) (ω) = ={ω:ω A i for infinitely mny is} { } = ω : lim n t (s, ) =. t This proves tht if stte s is visited infinitely often then ction is lso chosen infinitely often in tht stte. Now let S be the set of sttes visited i.o. by s t, i.e., if S (ω) = S 0 then S 0 is the set of sttes which occur i.o. in the sequence {s 0 (ω), s 1 (ω),...,s t (ω),...}. Clerly, the events {S = S 0 }, S 0 S form complete event system. Thus, S 0 S P(S = S 0 ) = 1. Now let S 0 be nontrivil subset of S. Then, since the MDP is communicting, there exists pir of sttes s, s nd n ction, such tht s S 0, s S 0 nd Pss > 0. Then, Pr(S = S 0 ) = Pr(S = S 0, s S ) + Pr(S = S 0, s S ). Here, both events re impossible, so Pr(S = S 0 ) = 0. Since the MDP is finite, lso Pr(S = )=0nd so Pr(S = S) = 1. This yields tht Pr(lim t n t (s) = )=1for ll s, thus, finishing the proof. B.1. Boltzmnn explortion In Boltzmnn explortion, Pr( s, t, Q, n t (s)) = e β t (s)q(s,) b A eβ t (s)q(s,b), where β t (s) is the stte-specific explortion coefficient for time t. Let the number of visits to stte s in timestep t be denoted s n t (s) nd ssume tht r(s, ) hs finite rnge. We know tht i=1 c/i = ; therefore, to meet the conditions of Lemm 4, we will ensure tht for ll ctions A,Pr( s,t s (i)) c/i (with c 1). To do tht we need for ll : e β t (s)q t (s,) b A eβ t (s)q t (s,b) c n t (s) e β t (s)q t (s,b) n t (s)e β t (s)q t (s,) c b A n t (s)e β t (s)q t (s,) cme β t(s)q t (s,b mx ) n t (s) cm eβ t(s)(q t (s,b mx) Q t (s,)) ln n t (s) ln cm β t (s)(q t (s, b mx ) Q t (s, )), where b mx = rgmx b A Q t (s, b) bove nd m is the number of ctions. Further, let c = 1/m. Tken together, this mens tht we wnt β t (s) ln n t (s)/c t (s) where C t (s) =

18 304 SINGH ET AL. mx Q t (s, b mx ) Q t (s, ). Note tht C t (s) is bounded becuse the Q vlues remin bounded (since r(s, ) hs bounded rnge). Since for every s, lim t n t (s) =, lso lim β ln n t (s) t(s) lim = ; t t C t (s) this mens tht Boltzmnn explortion with β t (s) = ln n t (s)/c t (s) will be greedy in the limit. B.2. ɛ-greedy explortion In ɛ-greedy explortion we pick rndom explortion ction with probbility ɛ t (s) nd the greedy ction with probbility 1 ɛ t (s). Let ɛ t (s) = c/n t (s) with 0 < c < 1. Then, Pr( s, t s (i)) ɛ t (s)/m, where m is the number of ctions. Therefore, Lemm 4 combined with the fct tht i=1 c/i = implies tht for ll s, i=1 Pr( s, t s(i)) =. Since lso by Lemm 4 for ll s, lim t n t (s) =, nd, therefore, lim t ɛ t (s) = 0, ensuring tht the lerning policy is greedy in the limit. Therefore, if ɛ t (s) = c/n t (s) then ɛ-greedy explortion is GLIE for 0 < c < 1. Appendix C: generlized Mrkov decision processes In this section, we give proofs of severl properties ssocited with generlized MDPs, which re described in more detil by Szepesvári & Littmn (1996). Define the Q-vlue function Q(s, ) = R(s, ) + γ s S P ss Q(s, ), (s, ) S A. (C.1) Here, we ssume 0 γ<1. The importnt property for to stisfy is the non-expnsion property: Q(s, ) Q (s, ) mx Q(s, ) Q (s, ) for ll Q-vlue functions Q nd Q nd ll sttes s. We begin by showing tht n verge over ctions with fixed set of weights stisfies the non-expnsion property. Lemm 5. The function Q(s, ) = p Q(s, ) stisfies the non-expnsion property, where 0 p 1 nd p = 1.

19 CONVERGENCE OF ON-POLICY RL ALGORITHMS 305 Proof: This follows directly from definitions. If Q nd Q re Q-vlue functions, we hve Q(s, ) Q (s, ) = p (Q(s, ) Q (s, )) p Q(s, ) Q (s, ) mx Q(s, ) Q (s, ). A corollry is tht fixed-weight verge of functions tht stisfy the non-expnsion property lso stisfies the non-expnsion property. We cn use Lemm 5 to prove the existence nd uniqueness of the Q-vlue function. Lemm 6. As long s stisfies the non-expnsion property, Eq. (C.1) hs solution nd it is unique. Proof: Define the opertor L on Q-vlue functions s (LQ)(s, ) = R(s, ) + γ s S P ss Q(s, ), for ll (s, ) S A. We cn rewrite Eq. (C.1) s Q(s, ) = (LQ)(s, ), which hs unique solution if L is contrction with respect to the mx norm. To see tht L is contrction, consider two Q-vlue functions Q nd Q.Wehve LQ LQ γmx s Q(s, ) Q (s, ) < Q Q, where we hve used Lemm 5, the fct tht γ<1, nd the non-expnsion property of. Finlly, define fmily of rnk-bsed opertors: i Q(s, ) = ith lrgest vlue of Q(s, ), for ech 1 i m. We show tht these opertors stisfy the non-expnsion property. Lemm 7. The i Q(s, ) opertors stisfy the non-expnsion property. Proof: Let Q nd Q be Q-vlue functions nd fix s S. Without loss of generlity, ssume i Q(s, ) i Q (s, ). Let be the ith lrgest vlue of Q(s, ): Q(s, ) = i Q(s, ). We exmine two cses seprtely nd show tht the non-expnsion property is stisfied either wy. If Q (s, ) i Q (s, ), then

20 306 SINGH ET AL. i i Q(s, ) Q (s, ) = Q(s, ) i Q (s, ) Q(s, ) Q (s, ) mx Q(s, ) Q (s, ). On the other hnd, if Q (s, )> i Q (s,), tht mens tht the rnk of in Q, ρ(q,s, ) is smller thn i. This implies tht there is some such tht ρ(q,s, )<i nd ρ(q,s, ) i (otherwise there would be i ctions with rnks less thn i in Q ). For this, i i Q(s, ) Q (s, ) = i i Q(s, ) Q (s, ) Q(s, ) Q (s, ) mx Q(s, ) Q (s, ). Acknowledgments We thnk Richrd S. Sutton for help nd encourgement. We lso thnk Nicols Meuleu nd the nonymous reviewers for comments nd suggestions. This reserch ws prtilly supported by NSF grnt IIS (Stinder Singh), OTKA Grnt No. F20132 (Csb Szepesvári), Hungrin Ministry of Eduction Grnt No. FKFP 1354/1997 (Csb Szepesvári), nd NSF CAREER grnt IRI (Michel Littmn). Notes 1. The nme is reference to the fct tht it is single-step lgorithm tht mkes updtes on the bsis of Stte, Action, Rewrd, Stte, Action 5-tuple. 2. Here W denotes weighted mximum norm with weight W = (w 1,...,w n ), w i > 0: if x R n then x W = mx i ( x i /w i ). 3. We conjecture tht the sme result does not hold for persistent Boltzmnn explortion becuse relted synchronous lgorithms do not hve unique trget of convergence (Littmn, 1996). References Brto, A. G., Brdtke S. J., & Singh S. (1995). Lerning to ct using rel-time dynmic progrmming. Artificil Intelligence, 72(1), Bellmn, R. (1957). Dynmic Progrmming. Princeton, NJ: Princeton University Press. Bertseks, D. P. (1995). Dynmic Progrmming nd Optiml Control. (Vol. 1 nd 2). Belmont, Msschusetts: Athen Scientific.

21 CONVERGENCE OF ON-POLICY RL ALGORITHMS 307 Boyn, J. A. & Moore, A. W. (1995). Generliztion in reinforcement lerning: Sfely pproximting the vlue function. In G. Tesuro, D. S. Touretzky, & T. K. Leen (Eds.), Advnces in neurl informtion processing systems (Vol. 7, pp ). Cmbridge, MA: The MIT Press. Breimn, L. (1992). Probbility. Phildelphi, Pennsylvni: Society for Industril nd Applied Mthemtics. Crites, R. H. & Brto, A. G. (1996). Improving elevtor performnce using reinforcement lerning. Advnces in neurl informtion processing systems (Vol. 8). MIT Press. Dyn, P. (1992). The convergence of TD(λ) for generl λ. Mchine Lerning, 8(3), Dyn, P. & Sejnowski, T. J. (1994). TD(λ) converges with probbility 1. Mchine Lerning, 14(3). Dyn, P. & Sejnowski, T. J. (1996). Explortion bonuses nd dul control. Mchine Lerning, 25, Gullplli, V. & Brto, A. G. (1994). Convergence of indirect dptive synchronous vlue itertion lgorithms. In J. D. Cown, G. Tesuro, & J. Alspector (Eds.), Advnces in neurl informtion processing systems (Vol. 6, pp ). Sn Mteo, CA: Morgn Kufmnn. Jkkol, T., Jordn, M. I., & Singh, S. (1994). On the convergence of stochstic itertive dynmic progrmming lgorithms. Neurl Computtion, 6(6), John, G. H. (1994). When the best move isn t optiml: Q-lerning with explortion. In Proceedings of the Twelfth Ntionl Conference on Artificil Intelligence, Settle, WA, p John, G. H. (1995). When the best move isn t optiml: Q-lerning with explortion. Unpublished mnuscript, vilble t URL ftp://strry.stnford.edu/pub/gjohn/ppers/rein-nips.ps. Kelbling, L. P., Littmn, M. L., & Moore, A. W. (1996). Reinforcement lerning: A survey. Journl of Artificil Intelligence Reserch, 4: Kumr, P. R. & Vriy, P. P. (1986). Stochstic systems: estimtion, identifiction, nd dptive control. Englewood Cliffs, NJ: Prentice Hll. Littmn, M. L. & Szepesvári, C. (1996). A generlized reinforcement-lerning model: Convergence nd pplictions. In L. Sitt (Ed.), Proceedings of the Thirteenth Interntionl Conference on Mchine Lerning (pp ). Littmn, M. L. (1996). Algorithms for sequentil decision mking. Ph.D. Thesis, Deprtment of Computer Science, Brown University, Februry. Also Technicl Report CS Putermn, M. L. (1994). Mrkov decision processes discrete stochstic dynmic progrmming. New York, NY: John Wiley & Sons, Inc. Rummery, G. A. (1994). Problem solving with reinforcement lerning. Ph.D. Thesis, Cmbridge University Engineering Deprtment. Rummery, G. A. & Nirnjn, M. (1994). On-line Q-lerning using connectionist systems. Technicl Report CUED/F-INFENG/TR 166, Cmbridge University Engineering Deprtment. Singh, S. & Bertseks, D. P. (1997). Reinforcement lerning for dynmic chnnel lloction in cellulr telephone systems. Advnces in neurl informtion processing systems (Vol. 9, pp ). MIT Press. Singh, S. & Sutton, R. S. (1996). Reinforcement lerning with replcing eligibility trces. Mchine Lerning, 22(1 3): Singh, S. & Yee, R. C. (1994). An upper bound on the loss from pproximte optiml-vlue functions. Mchine Lerning, 16, Sutton, R. S. & Brto, A. G. (1998). An introduction to reinforcement lerning. The MIT Press. Sutton, R. S. (1988). Lerning to predict by the method of temporl differences. Mchine Lerning, 3(1): Sutton, R. S. (1996). Generliztion in reinforcement lerning: successful exmples using sprse corse coding. In D. S. Touretzky, M. C. Mozer, & M. E. Hsselmo (Eds.), Advnces in neurl informtion processing systems (Vol. 8). Cmbridge, MA: The MIT Press. Szepesvári, C. & Littmn, M. L. (1996). Generlized Mrkov decision processes: dynmic-progrmming nd reinforcement-lerning lgorithms. Technicl Report CS-96-11, Brown University, Providence, RI. Tesuro, G. J. (1995). Temporl difference lerning nd TD-gmmon. Communictions of the ACM, 38(3), Thrun, S. B. (1992). The role of explortion in lerning control. In D. A. White & D. A. Sofge (Eds.), Hndbook of intelligent control: neurl, fuzzy, nd dptive pproches. New York, NY: Vn Nostrnd Reinhold. Tsitsiklis, J. N. (1994). Asynchronous stochstic pproximtion nd Q-lerning. Mchine Lerning, 16(3): , September Tsitsiklis, J. N. & Vn Roy, B. (1996). An nlysis of temporl-difference lerning with function pproximtion. Technicl Report LIDS-P-2322, Msschusetts Institute of Technology, Mrch. Avilble through URL

22 308 SINGH ET AL. To pper in IEEE Trnsctions on Automtic Control. Wtkins, C. J. C. H. (1989). Lerning from delyed rewrds. Ph.D. Thesis, King s College, Cmbridge, UK. Wtkins, C. J. C. H. & Dyn, P. (1992). Q-lerning. Mchine Lerning, 8(3): Willims, R. J. & Bird, L. C., III (1993). Tight performnce bounds on greedy policies bsed on imperfect vlue functions. Technicl Report NU-CCS-93-14, Northestern University, College of Computer Science, Boston, MA. Zhng, W. & Dietterich, T. G. (1995). High-performnce job-shop scheduling with time dely TD(λ) network. Advnces in neurl informtion processing systems (Vol. 8, pp ). MIT Press. Received July 9, 1997 Accepted August 12, 1999 Finl mnuscript August 12, 1999

Reinforcement Learning

Reinforcement Learning Reinforcement Lerning Tom Mitchell, Mchine Lerning, chpter 13 Outline Introduction Comprison with inductive lerning Mrkov Decision Processes: the model Optiml policy: The tsk Q Lerning: Q function Algorithm