Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms

Size: px
Start display at page:

Download "Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms"

Transcription

1 Mchine Lerning, 39, , c 2000 Kluwer Acdemic Publishers. Printed in The Netherlnds. Convergence Results for Single-Step On-Policy Reinforcement-Lerning Algorithms SATINDER SINGH AT&T Lbs-Reserch, 180 Prk Avenue, Florhm Prk, NJ 07932, USA bvej@reserch.tt.com TOMMI JAAKKOLA tommi@i.mit.edu Deprtment of Computer Science, Msschusetts Institute of Technology, Cmbridge, MA 02139, USA MICHAEL L. LITTMAN Deprtment of Computer Science, Duke University, Durhm, NC , USA mlittmn@cs.duke.edu CSABA SZEPESVÁRI Mindmker Ltd., Konkoly Thege M. u , Budpest 1121, Hungry szepes@mindmker.hu Editor: Sridhr Mhdevn Abstrct. An importnt ppliction of reinforcement lerning (RL) is to finite-stte control problems nd one of the most difficult problems in lerning for control is blncing the explortion/exploittion trdeoff. Existing theoreticl results for RL give very little guidnce on resonble wys to perform explortion. In this pper, we exmine the convergence of single-step on-policy RL lgorithms for control. On-policy lgorithms cnnot seprte explortion from lerning nd therefore must confront the explortion problem directly. We prove convergence results for severl relted on-policy lgorithms with both decying explortion nd persistent explortion. We lso provide exmples of explortion strtegies tht cn be followed during lerning tht result in convergence to both optiml vlues nd optiml policies. Keywords: reinforcement-lerning, on-policy, convergence, Mrkov decision processes 1. Introduction Most reinforcement-lerning (RL) lgorithms (Kelbling et l., 1996; Sutton & Brto, 1998) for solving discrete optiml control problems use evlution or vlue functions to cche the results of experience. This is useful becuse close pproximtions to optiml vlue functions led directly to good control policies (Willims & Bird, 1993; Singh & Yee, 1994). Different RL lgorithms combine new experience with old vlue functions to produce new nd sttisticlly improved vlue functions in different wys. All such lgorithms fce trdeoff between exploittion nd explortion (Thrun, 1992; Kumr & Vriy, 1986; Dyn & Sejnowski, 1996), i.e., between choosing ctions tht re best ccording to the current stte of knowledge, nd ctions tht re not the current best but improve the stte of knowledge nd potentilly yield higher pyoffs in the future. Following Sutton nd Brto (1998), we distinguish between two types of RL lgorithms: on-policy nd off-policy. Off-policy lgorithms my updte estimted vlue functions on the

2 288 SINGH ET AL. bsis of hypotheticl ctions, i.e., ctions other thn those ctully executed in this sense Q-lerning (Wtkins & Dyn, 1992) is n off-policy lgorithm. On-policy lgorithms, on the other hnd, updte vlue functions strictly on the bsis of the experience gined from executing some (possibly non-sttionry) policy. This distinction is importnt becuse offpolicy lgorithms cn (t lest conceptully) seprte explortion from control while onpolicy lgorithms cnnot. More precisely, in the cse of on-policy lgorithms, convergence proof requires more detils of the explortion to be specified thn for off-policy lgorithms, since the updte rule depends gret del on the ctions tken by the system. On-policy lgorithms my prove to be importnt for severl resons. The nlogue of the on-policy/off-policy distinction for RL prediction problems is the trjectory-bsed/trjectoryfree distinction. Trjectory-bsed lgorithms pper superior to trjectory-free lgorithms for prediction when prmeterized function pproximtors re used (Tsitsiklis & Vn Roy, 1996). These results crry over empiriclly to the control cse s well (Boyn & Moore, 1995; Sutton, 1996). In ddition, multi-step prediction lgorithms such s TD(λ) (Sutton, 1988) re more flexible nd dt efficient thn single-step lgorithms (TD(0)), nd most nturl multi-step lgorithms for control re on-policy. Another motivtion for studying on-policy lgorithms is the considertion of the interction between explortion nd optiml ctions, identified by Sutton nd Brto (1998) nd John (1994). Consider robot lerning to mximize rewrd in dngerous environment. Throughout its existence, it will need to execute explortion ctions to help it lern bout its options. However, some of these explortion ctions will led to bd outcomes. An onpolicy lerner will fctor in the costs of explortion, nd tend to void entering prts of the stte spce where explortion is more dngerous. A suggestive exmple ppers in figure 1. This 3-stte deterministic MDP hs two ctions: l nd r. For discount fctor of γ = 0.9, the optiml ction choice from stte y is r (vlue 75.8 s opposed to 74.9 for ction r). On the other hnd, if explortory ctions re tken 50% of the time, the risk of picking dngerous ction r in stte z becomes too gret. Now, for discount fctor of γ = 0.9, the optiml ction choice from stte y is l (vlue 58.8 s opposed to 58.6 for ction r). In other environments, the difference cn be even greter, necessitting the ppliction of on-policy lerning methods. In this pper, we exmine the convergence of single-step (vlue updtes bsed on the vlue of the next timestep only), on-policy RL lgorithms for control. We do not ddress either Figure 1. The optiml ction to tke in this smll, deterministic MDP depends on the explortion strtegy. If no explortion ctions re tken, the optiml ction from stte y is r. If explortion ctions re tken, costly ction r will sometimes be chosen from stte z, mking l the best choice from y.

3 CONVERGENCE OF ON-POLICY RL ALGORITHMS 289 function pproximtion or multi-step lgorithms; this is the subject of our ongoing reserch. Erlier work hs shown tht there re off-policy RL lgorithms tht converge to optiml vlue functions (Wtkins & Dyn, 1992; Dyn, 1992; Jkkol et l., 1994; Tsitsiklis, 1994; Gullplli & Brto, 1994; Littmn & Szepesvári, 1996); we prove convergence results for severl relted on-policy lgorithms. We lso provide exmples of policies tht cn be followed during lerning tht result in convergence to both optiml vlues nd optiml policies. These results generlize nturlly to off-policy lgorithms, such s Q-lerning, showing the convergence of mny RL lgorithms to optiml policies. 2. Solving Mrkov decision problems Mrkov decision processes (MDPs) re widely used to model controlled dynmicl systems in control theory, opertions reserch nd rtificil intelligence (Putermn, 1994; Bertseks, 1995; Brto et l., 1995; Sutton & Brto, 1998). Let S = 1, 2,...,N denote the discrete set of sttes of the system, nd let A be the discrete set of ctions vilble to the system. The probbility of mking trnsition from stte s to stte s on ction is denoted Pss nd the rndom pyoff ssocited with tht trnsition is denoted r(s, ). A policy mps ech stte to probbility distribution over ctions this mpping cn be invrint over time (sttionry) or chnge s function of the interction history (non-sttionry). For ny policy π, we define vlue function { } V π (s) = E π γ t r t s 0 = s, t=0 which is the expected vlue of the infinite-horizon sum of the discounted pyoffs when the system is strted in stte s nd the policy π is followed forever. Note tht r t nd s t re the pyoff nd stte respectively t timestep t, nd (r t, s t ) is stochstic process, where (r t, s t+1 ) depends only on (s t, t ) governed by the rules tht r t is distributed s r(s t, t ) nd the probbility tht s t+1 = s is P t s t s. Here, t is the ction tken by the system t timestep t. The discount fctor, 0 γ < 1, mkes pyoffs in the future less vluble thn more immedite pyoffs. The solution of n MDP is n optiml policy π tht simultneously mximizes the vlue of every stte s S. It is known tht sttionry deterministic optiml policy exists for every MDP (c.f. Bertseks (1995)). Herefter, unless explicitly noted, ll policies re ssumed to be sttionry. The vlue function ssocited with π is denoted V. Often it is convenient to ssocite vlues not with sttes but with stte-ction pirs, clled Q vlues s in Wtkins Q-lerning (Wtkins, 1989): nd Q π (s, ) = R(s, ) + γ E{V π (s )}, Q (s, ) = R(s, ) + γ E{V (s )},

4 290 SINGH ET AL. where s is the rndom next stte on executing ction in stte s, nd R(s, ) is expected vlue of r(s, ). Clerly, π (s) = rgmx Q (s, ), nd V (s) = mx Q (s, ). The optiml Q vlues stisfy the recursive Bellmn optimlity equtions (Bellmn, 1957), Q (s, ) = R(s, ) + γ s P ss mx Q (s, b), s,. (1) b In reinforcement lerning, the quntities tht define the MDP, P nd R, re not known in dvnce. An RL lgorithm must find n optiml policy by intercting with the MDP directly; becuse effective lerning typiclly requires the lgorithm to revisit every stte mny times, we ssume the MDP is communicting (every stte cn be reched from every other stte) Off-policy nd on-policy lgorithms Most RL lgorithms for solving MDPs re itertive, producing sequence of estimtes of either the optiml (Q-)vlue function or the optiml policy or both by repetedly combining old estimtes with the results of new tril to produce new estimtes. An RL lgorithm cn be decomposed into two components. The lerning policy is nonsttionry policy tht mps experience (sttes visited, ctions chosen, rewrds received) into current choice of ction. The updte rule is how the lgorithm uses experience to chnge its estimte of the optiml vlue function. In n off-policy lgorithm, the updte rule need not hve ny reltion to the lerning policy. Q-lerning (Wtkins, 1989) is n off-policy lgorithm tht estimtes the optiml Q-vlue function s follows: Q t+1 (s t, t ) = (1 α t (s t, t ))Q t (s t, t ) [ ] + α t (s t, t ) r t + γ mx (Q t(s t+1, b)), (2) b where Q t is the estimte t the beginning of the tth timestep, nd s t, t, r t, nd α t re the stte, ction, rewrd, nd step size (lerning rte) t timestep t. Eqution (2) is n offpolicy lgorithm s the updte of Q t (s t, t ) depends on mx b (Q t (s t+1, b)), which relies on compring vrious hypotheticl ctions b. The convergence of the Q-lerning lgorithm does not put ny strong requirements on the lerning policy other thn tht every ction is experienced in every stte infinitely often. This cn be ccomplished, for exmple, using the rndom-wlk lerning policy, which chooses ctions uniformly t rndom. Lter, we describe severl other lerning policies tht result in convergence when combined with the Q-lerning updte rule. The updte rule for SARSA(0) (Rummery, 1994; lso see, Rummery & Nirnjn, 1994; John, 1994, 1995; Singh & Sutton, 1995; Sutton, 1996): Q t+1 (s t, t ) = (1 α t (s t, t ))Q t (s t, t ) + α t (s t, t )[r t + γ Q t (s t+1, t+1 )], (3)

5 CONVERGENCE OF ON-POLICY RL ALGORITHMS 291 specil cse of SARSA(λ) with λ = 0, is quite similr to the updte rule for Q-lerning. The min difference is tht Q-lerning mkes n updte bsed on the greedy Q vlue of the successor stte, s t+1, while SARSA(0) 1 uses the Q vlue of the ction t+1 ctully chosen by the lerning policy. This mkes SARSA(0) n on-policy lgorithm, nd therefore its conditions for convergence depend gret del on the lerning policy. In prticulr, becuse SARSA(0) lerns the vlue of its own ctions, the Q vlues cn converge to optimlity in the limit only if the lerning policy chooses ctions optimlly in the limit. Section 3 provides some positive convergence results for two significnt clsses of lerning policies. Under greedy lerning policy (i.e., lwys select the ction tht is best ccording to the current estimte), the updte rules for Q-lerning nd SARSA(0) re identicl. The resulting RL lgorithm would not converge to optiml solutions, in generl, becuse the need for infinite explortion would not be stisfied. This helps illustrte the tension between dequte explortion nd exploittion with regrd to convergence to optimlity. It is worth noting, however, tht the pproch of using greedy lerning policy hs yielded some impressive successes, including the world s finest bckgmmon-plying progrm (Tesuro, 1995), nd stte-of-the-rt systems for spce shuttle scheduling (Zhng & Dietterich, 1995), elevtor control (Crites & Brto, 1996), nd cellulr telephone resource lloction (Singh & Bertseks, 1997). All these pplictions cn be viewed s exploiting on-policy lgorithms, lthough the on-policy versus off-policy distinction is not meningful when no explicit explortion is used Lerning policies A lerning policy selects n ction t timestep t s function of the history of sttes, ctions, nd rewrds experienced so fr. In this pper, we consider severl lerning policies tht mke decisions bsed on summry of history consisting of the current timestep t, the current stte s, the current estimte Q of the optiml Q-vlue function, nd the number of times stte s hs been visited before time t, n t (s). Such lerning policy cn be expressed s the probbilities Pr( s, t, Q, n t (s)), the probbility tht ction is selected given the history. We divide lerning policies for MDPs into two brod ctegories: decying explortion lerning policy tht becomes more nd more like the greedy lerning policy over time, nd persistent explortion lerning policy tht does not. The dvntge of decying explortion policies is tht the ctions tken by the system my converge to the optiml ones eventully, but with the price tht their bility to dpt slows down. In contrst to this, persistent explortion lerning policies cn retin their dptivity forever, but with the price tht the ctions of the system will not converge to optimlity in the stndrd sense. We prove the convergence of SARSA(0) to optiml policies in the stndrd sense for clss of decying explortion lerning policies, nd to optiml policies in specil sense defined below for clss of persistent explortion lerning policies. Consider the clss of decying explortion lerning policies chrcterized by the following two properties: 1. ech ction is executed infinitely often in every stte tht is visited infinitely often, nd 2. in the limit, the lerning policy is greedy with respect to the Q-vlue function with probbility 1.

6 292 SINGH ET AL. We lbel lerning policies stisfying the bove conditions s GLIE, which stnds for greedy in the limit with infinite explortion. An exmple of such lerning policy is form of Boltzmnn explortion: Pr( s, t, Q) = e β t (s)q(s,) b A eβ t (s)q(s,b), where β t (s) is the stte-specific explortion coefficient for time t, which controls the rte of explortion in the lerning policy. To meet Condition 2 bove, we would like β t to be infinite in the limit, while to meet Condition 1 bove we would like β t to not pproch infinity too fst. In Appendix B, we show tht β t (s) = ln n t (s)/c t (s) stisfies the bove requirements (where n t (s) t is the number of times stte s hs been visited in t timesteps, nd C t (s) is defined in Appendix B). Another exmple of GLIE lerning policy is form of ɛ-greedy explortion (Sutton, 1996), which t timestep t in stte s picks rndom explortion ction with probbility ɛ t (s) nd the greedy ction with probbility 1 ɛ t (s). In Appendix B, we show tht if ɛ t (s) = c/n t (s) for 0 < c < 1, then ɛ-greedy explortion is GLIE. We lso nlyze restricted rnk-bsed rndomized (RRR) lerning policies, clss of persistent explortion lerning policies commonly used in prctice. An RRR lerning policy selects ctions probbilisticlly ccording to the rnks of their Q vlues, choosing the greedy ction with the highest probbility nd the ction with the lowest Q vlue with the lowest probbility. Different lerning policies cn be specified by different choices of the function T : {1,...,m} Rtht mps ction rnks to probbilities. Here, m is the number of ctions. For consistency, we require tht T (1) T (2) T(m)nd m i=1 T (i) = 1. At timestep t, the RRR lerning policy chooses n ction by first rnking the vilble ctions ccording to the Q vlues ssigned by the current Q-vlue function Q t for the current stte s t. We use the nottion ρ(q,s,) to be the rnk of ction in stte s bsed on Q(s, ) (e.g., if ρ(q,s,) = 1 then = rgmx b Q(s, b)), with ties broken rbitrrily. Once the ctions re rnked, the ith rnked ction is chosen with probbility T (i); tht is, ction is chosen with probbility T (ρ(q, s, )). The RRR lerning policy is restricted in tht it does not directly choose ctions it simply ssigns probbilities to ctions ccording to their rnks. Therefore, n RRR lerning policy hs the form Pr( s, t, Q) = T (ρ(q t, s, )). To illustrte the use of the T function, we specify three well-known lerning policies s RRR lerning policies by the pproprite definition of T. The rndom-wlk lerning policy chooses ction in stte s with probbility 1/m. To chieve this behvior with the RRR lerning policy, simply define T (i) = 1/m for ll i; ctions will be chosen uniformly t rndom regrdless of their rnk. The greedy lerning policy cn be specified by T (1) = 1, T (i) = 0 when 1 < i m; it deterministiclly selects the ction with the highest Q vlue. Similrly, ɛ-greedy explortion cn be specified by defining T (1) = 1 ɛ + ɛ/m, T(i) = ɛ/m, 1 < i m. This policy tkes the greedy ction with probbility 1 ɛ nd rndom ction otherwise. To stisfy the condition tht T (1) T (2) T(m),we require tht 0 ɛ 1. Another commonly used persistent explortion lerning policy is Boltzmnn explortion with fixed explortion prmeter. Note there is no choice of T tht specifies Boltzmnn explortion; Boltzmnn explortion is not n RRR lerning policy s the probbility of choosing n ction depends on the ctul Q vlues nd not only on the rnks of ctions in Q( ).

7 CONVERGENCE OF ON-POLICY RL ALGORITHMS Results Below we prove results on the convergence of SARSA(0) under the two seprte cses of GLIE nd RRR lerning policies Convergence of SARSA(0) under GLIE lerning policies To ensure the convergence of SARSA(0), we require lookup-tble representtion for the Q vlues nd infinite visits to every stte-ction pir, just s for Q-lerning. Unlike Q-lerning, however, SARSA(0) is n on-policy lgorithm nd, in order to chieve its convergence to optimlity, we hve to further ssume tht the lerning policy becomes greedy in the limit. To stte these ssumptions nd the resulting convergence more formlly, we note first tht due to the dependence on the lerning policy, SARSA(0) does not directly fll under the previously published convergence theorems (Dyn & Sejnowski, 1994; Jkkol et l., 1994; Tsitsiklis, 1994; Szepesvári & Littmn, 1996). Only slight extension is needed, however, nd this is presented in the form of Lemm 1 below (extending Theorem 1 of Jkkol et l., 1994, nd Lemm 12 of Szepesvári & Littmn (1996)). For clrity, we will not present the lemm in full generlity. Lemm 1. Consider stochstic process (α t, t,f t ), t 0, where α t, t,f t : X R stisfy the equtions t+1 (x) = (1 α t (x)) t (x) + α t (x)f t (x), x X, t = 0, 1, 2,... Let P t be sequence of incresing σ -fields such tht α 0 nd 0 re P 0 -mesurble nd α t, t nd F t 1 re P t -mesurble, t = 1, 2,...Assume tht the following hold: 1. the set X is finite α t (x) 1, t α t(x) =, t α2 t (x)< w.p E{F t ( ) P t } W κ t W + c t, where κ [0, 1) nd c t converges to zero w.p Vr{F t (x) P t } K(1+ t W ) 2,where K is some constnt. Then, t converges to zero with probbility one (w.p.1). Let us first clrify how this lemm reltes to the lerning lgorithms tht re the focus of this pper. We cn cpture the sequence of visited sttes s t nd selected ctions t in the definition of the lerning rtes α t s follows: define x t = (s t, t ) nd further require tht α t (x) = 0 whenever x x t. With these definitions, the itertive process reduces to t+1 (s t, t ) = (1 α t (s t, t )) t (s t, t ) + α t (s t, t )F t (s t, t ), which resembles more closely the updtes of the on-line lgorithms such s SARSA(0) (Eq. (3)). Also, note tht the lemm shows the convergence of to zero rther thn to some non-zero optiml vlues. The intended mening of is Q t Q, i.e., the difference between the current Q vlues, Q t, nd the trget Q vlues, Q, tht re ttined symptoticlly.

8 294 SINGH ET AL. The extension provided by our formultion of the lemm is the fct tht the contrction property (the third condition) need not be strict; strict contrction is now required to hold only symptoticlly. This relxtion mkes the theorem more widely pplicble. Proof: While we hve stted tht the lemm extends previous results such s the Theorem 1 of Jkkol et l. (1994) nd Lemm 12 of Szepesvári & Littmn (1996), the proof of our lemm is, however, lredy lmost fully contined in the proofs of these results (requiring only minor, lrgely nottionl chnges). Moreover, the lemm lso follows from Proposition 4.5 of Bertseks (1995), nd in Appendix A we present proof bsed on this proposition. We cn now use Lemm 1 to show the convergence of SARSA(0). Theorem 1. Consider finite stte-ction MDP nd fix GLIE lerning policy π given s set of probbilities Pr( s, t, n t (s), Q). Assume tht t is chosen ccording to π nd t time step t, πuses Q = Q t, where the Q t vlues re computed by the SARSA(0) rule (see Eq. (3)). Then Q t converges to Q nd the lerning policy π t converges to n optiml policy π provided tht the conditions on the immedite rewrds, stte trnsitions nd lerning rtes listed in Section 2 hold nd if the following dditionl conditions re stisfied: 1. The Q vlues re stored in lookup tble. 2. The lerning rtes stisfy 0 α t (s, ) 1, t α t(s, ) = nd t α2 t (s, ) < nd α t (s, ) = 0 unless (s, ) = (s t, t ). 3. Vr{r (s, )} <. Proof: The correspondence to Lemm 1 follows from ssociting X with the set of sttection pirs (s, ), α t (x) with α t (s, ) nd t (s, ) with Q t (s, ) Q (s, ). It follows tht where t+1 (s t, t ) = (1 α t (s t, t )) t (s t, t ) + α t (s t, t )F t (s t, t ), F t (s t, t ) = r t + γ mx Q t(s t+1, b) Q (s t, t ) b A [ ] + γ Q t (s t+1, t+1 ) mx Q t(s t+1, b) b A def = r t + γ mx b A Q t(s t+1, b) Q (s t, t ) + C t (Q) def = F Q t (s t, t ) + C t (s t, t ), where Ft Q would be the corresponding F t in Lemm 1 if the lgorithm under considertion were Q-lerning. We define F t (s, ) = Ft Q (s, ) = C t (s, ) = 0if(s,) (s t, t ) (so F t (s, ) = Ft Q (s, ) + C t (s, ) for ll (s, )) nd denote the σ -field generted by the

9 CONVERGENCE OF ON-POLICY RL ALGORITHMS 295 rndom vribles {s t,α t, t,r t 1,...,s 1,α 1, 1,Q 0 }by P t. Note tht Q t, Q t 1,...,Q 0 re P t -mesurble nd, thus, both t nd F t 1 re P t -mesurble, stisfying the mesurbility conditions of Lemm 1. It is well-known tht for Q-lerning E{Ft Q (, ) P t } γ t for ll t, where is the mximum norm. In other words, the expected updte opertor is contrction mpping. The only difference between the current F t nd Ft Q for Q-lerning is the presence of C t. Therefore, E{F t (, ) P t } E { F Q t (, ) P t } + E{Ct (, ) P t } (4) γ t + E{C t (, ) P t }. (5) Identifying c t = E{C t (, ) P t } in Lemm 1, we re left with showing tht c t converges to zero w.p.1. This, however, follows () from our ssumption of GLIE policy (i.e., tht nongreedy ctions re chosen with vnishing probbilities), (b) the ssumption of finiteness of the MDP, nd (c) the fct tht Q t (s, ) stys bounded during lerning. To verify the boundedness property, we note tht the SARSA(0) Q vlues cn be upper bounded by the Q vlues of Q-lerning process tht updtes exctly the sme stte-ction pirs in the sme order s the SARSA(0) process. Similrly, the SARSA(0) Q vlues re lower bounded by the Q vlues of Q-lerning process tht uses min insted of mx in the updte rule (c.f. Eq. (2)) nd updtes exctly the sme stte-ction pirs in the sme order s the SARSA(0) process. Both the lower-bounding nd the upper-bounding Q-lerning processes re convergent nd hve bounded Q vlues. The condition on the vrince of F t follows from the similr property of Ft Q. Note tht if GLIE lerning policy is used with the Q-lerning updte rule, one gets convergence to both the optiml Q-vlue function nd n optiml policy. This begins to ddress significnt outstnding question in the theory of reinforcement lerning: How do you lern policy tht chieves high rewrd in the limit nd during lerning? Previous convergence results for Q-lerning gurntee tht the optiml Q-vlue function is reched in the limit; this is importnt becuse the longer the lerning process goes on, the closer to optiml the greedy policy with respect to the lerned Q-vlue function will be. However, this provides no useful guidnce for selecting ctions during lerning. Our results, in contrst, show tht it is possible to follow policy during lerning tht pproches optimlity over time. The properties of GLIE policies imply tht for ny RL lgorithm tht converges to the optiml vlue function nd whose estimtes sty bounded (e.g., Q-lerning, nd ARTDP of Brto et l. (1995)), using GLIE lerning policies will ensure concurrent convergence to n optiml policy. However, to get n implementble RL lgorithm, one still hs to specify suitble lerning policy tht gurntees tht every ction is ttempted in every stte infinitely often (i.e., t α t(s, ) = ). In Appendix B, we prove tht, if the probbility of choosing ny prticulr ction in ny given stte sums up to infinity, then the bove condition is indeed stisfied. To illustrte this, in Appendix B we derive two lerning strtegies tht re GLIE.

10 296 SINGH ET AL Convergence of SARSA(0) under RRR lerning policies This section proves two seprte results concerning clss of persistent explortion lerning policies: (1) the SARSA(0) updte rule combined with n RRR lerning policy converges to well-defined Q-vlue function nd policy, nd (2) the resulting policy is optiml, in sense we will define. As mentioned erlier, n RRR lerning policy chooses ctions probbilisticlly by their rnking ccording to the current Q-vlue function; specific lerning policy is specified by the function T, probbility distribution over ction rnks. A restricted policy π : S (A, {1,...,m}) rnks ctions in ech stte (recll tht m denotes the number of ctions), i.e., π(s) is bijection between A nd {1,...,m}. For convenience, we use the nottion π(s,) to denote the ssigned rnk of ction in stte s, i.e., to denote π(s)(). The mpping π represents policy in the sense tht n gent following restricted policy π from stte s chooses ction with probbility T ( π(s,)), the probbility ssigned to the rnk, π(s,), of ction in stte s. Consider wht hppens when the SARSA(0) updte rule is used to lern the vlue of fixed restricted policy π. Stndrd convergence results for Q-lerning cn esily be used to show tht the Q t vlues will converge to the Q-vlue function of π. Specificlly, Q t will converge to Q π, defined s the unique solution to Q π (s, ) = R(s, ) + γ s S Pss A T( π(s, ))Q π (s, ), (s, ) S A. (6) When n RRR lerning policy is followed, the sitution becomes bit more complex. Upon entering stte s, the probbility tht the lerning policy will choose, for exmple, the rnk 1 ction is fixed t T (1); however, the identity of tht ction chnges ccording to the current Q-vlue function estimte Q t (, ). The nturl extension of Eq. (6) to n RRR lerning policy would be for the trget of convergence of Q t in SARSA(0) to be Q(s, ) = R(s, ) + γ s S Pss A T(ρ( Q, s, )) Q(s, ), (s, ) S A. (7) Recll tht ρ( Q,s, ) represents the rnk of ction ccording to the Q vlues Q of stte s. The only chnge between Eqs. (6) nd (7) is tht the ltter uses n ssignment of rnks tht is bsed upon the recursively defined Q-vlue function Q, wheres the former uses fixed ssignment of rnks. Using the theory of generlized MDPs (Szepesvári & Littmn, 1996), we cn show tht this difference is not importnt from the perspective of proving the existence nd uniqueness of the solution to Eq. (7). Define Q(s, ) = T (ρ(q, s, ))Q(s, ); (8) A now Eq. (7) cn be rewritten Q(s, ) = R(s, ) + γ s S P ss Q(s, ), (s, ) S A. (9)

11 CONVERGENCE OF ON-POLICY RL ALGORITHMS 297 As long s stisfies the non-expnsion property tht Q(s, ) Q (s, ) mx Q(s, ) Q (s, ) for ll Q-vlue functions Q nd Q nd ll sttes s, then Eq. (9) hs solution nd it is unique (Szepesvári & Littmn, 1996); this is proven in Appendix C. The non-expnsion property of cn be verified by the following rgument. Consider fmily of opertors i Q(s, ) = ith lrgest vlue of Q(s, ) for ech 1 i m. These re ll non-expnsions (see Appendix C). Define Q(s, ) = i T (i) i Q(s, ); it is non-expnsion s long s every i is nd T is fixed probbility distribution (see Appendix C). It is cler tht Q(s, ) = Q(s, ) s defined in Eq. (8), so is non-expnsion lso. Therefore, Q exists nd is unique. We next show tht Q is, in fct, the trget of convergence for SARSA(0). Theorem 2. In finite stte-ction MDPs, the Q t vlues computed by the SARSA(0) rule (see Eq. (3)) converge to Q if the lerning policy is RRR, the conditions on the immedite rewrds nd stte trnsitions listed in Section 2 hold, nd if the following dditionl conditions re stisfied: 1. Pr( t+1 = Q t, s t+1 ) = T (ρ(q t, s t+1, t+1 )). 2. The Q vlues re stored in lookup tble. 3. The lerning rtes stisfy 0 α t (s, ) 1, t α t(s, ) =, t α2 t (s,)<,nd α t (s, ) = 0 unless (s, ) = (s t, t ). 4. Vr{r (s, )} <. Proof: The result redily follows from Lemm 1 (or Theorem 1 of Jkkol et l. (1994)) nd the proof follows nerly identicl lines s tht of Theorem 1. First, we ssocite X (of Lemm 1) with the set of stte-ction pirs (s, ) nd α t (x) with α t (s, ), but here we set t (s, ) = Q t (s, ) Q(s, ). Agin, it follows tht where now t+1 (s t, t ) = (1 α t (s t, t )) t (s t, t ) + α t (s t, t )F t (s t, t ), F t (s t, t ) = r t + γ Q t (s t+1, t+1 ) Q(s t, t ). Further, we define F t (s, ) = C t (s, ) = 0if(s,) (s t, t ) nd denote the σ -field generted by the rndom vribles {s t,α t, t,r t 1,...,s 1,α 1, 1,Q 0 } by P t. Note tht Q t, Q t 1,...,Q 0 re P t -mesurble nd, thus, both t nd F t 1 re P t -mesurble, stisfying the mesurbility conditions of Lemm 1.

12 298 SINGH ET AL. Substituting the right-hnd side of Eq. (7) for Q(s t, t ) in the definition of F t together with the properties of smpling r t, s t+1 nd t+1 yields tht E{F t (s t, t ) P t } ( = γ E{Q t (s t+1, t+1 ) P t } s S = γ ( s S P t s t s s S A P t s t s γ Q t Q = γ t, A P t s t s A T (ρ(q t, s, ))Q t (s, ) T (ρ( Q, s, )) Q(s, ) where in the first eqution we hve exploited the fct tht E{r t s t, t }=R(s t, t ), in the second eqution tht nd tht Pr(s t+1 s t, t ) = P t s t s t+1 ) T (ρ( Q, s, )) Q(s, ) Pr( t+1 = Q t, s t+1 ) = T (ρ(q t, s t+1, )) (Condition 1), wheres the inequlity comes from the properties of rnk-bsed verging (see Lemm 7 nd Theorems 9 nd 10 of Szepesvári & Littmn (1996), lso Appendix C). Finlly, it is not hrd to prove tht the vrince of F t given the pst P t stisfies Condition 4 nd, therefore, we do not include it here. We hve shown tht SARSA(0) with n RRR lerning policy converges to Q. Next, we show tht Q is, in sense, n optiml Q-vlue function. An optiml restricted policy is one tht hs the highest expected totl discounted rewrd of ll restricted policies. The greedy restricted policy for Q-vlue function Q is π(s,) = ρ(q,s,); it ssigns ech ction the rnk of its corresponding Q vlue. Note tht this is the policy dictted by the RRR lerning policy for fixed Q-vlue function Q. The greedy restricted policy for Q (the optiml Q-vlue function of the MDP) isnot n optiml restricted policy in generl, so the Q-lerning rule in Eq. (2) does not find n optiml restricted policy. However, the next theorem shows tht the greedy restricted policy for Q (Eq. (7)) is n optiml restricted policy. This Q function is very similr to Q, except tht ctions re weighted ccording to the greedy restricted policy insted of the stndrd greedy policy. )

13 CONVERGENCE OF ON-POLICY RL ALGORITHMS 299 Theorem 3. The greedy restricted policy with respect to Q is n optiml restricted policy. Proof: We construct n lternte MDP so tht every restricted policy in the originl MDP is in one-to-one correspondence with (nd hs the sme vlue s) deterministic sttionry policy in the lternte MDP. Note tht, s result of the equlity of vlue functions, the optiml policy of the lternte MDP will correspond to n optiml restricted policy of the originl MDP (the restricted policy tht chieves the best vlues for ech of the sttes) nd, thus, the theorem will follow if we show tht the optiml policy in the lternte MDP corresponds to the greedy restricted policy with respect to Q. The lternte MDP is defined by S, Ā, R, P,γ. Its ction spce, Ā, is the set of ll bijections from A to {1,...,m}, i.e., Ā = (A, {1,...,m}). The rewrds re R(s,µ)= A T(µ())R(s, ), nd the trnsition probbilities re given by P µ ss = A T(µ())P ss. Here, µ is n element of Ā. One cn redily check tht the vlue of restricted policy π is just the vlue of π in the lternte MDP. The vlue of the greedy restricted policy with respect to Q in the originl MDP is V (s) = A T (ρ( Q, s, )) Q(s, ). (10) Substituting the definition of Q from Eq. (7) into Eq. (10) results in V (s) = ( T (ρ( Q, s, )) R(s, ) + γ A s S Pss A ) T(ρ( Q, s, )) Q(s, ). Using Eq. (10) once gin, we find tht V stisfies the recurrence eqution V (s) = ( T (ρ( Q, s, )) R(s, ) + γ A s S P ss V(s ) ). (11) Menwhile, the optimum vlue of the lternte MDP stisfies V (s) = mx µ Ā = mx µ Ā = mx µ Ā ( R(s,µ)+γ s S ( P µ ss V (s ) ) T (µ())r(s, ) + γ A s S ( T (µ()) R(s, ) + γ A s S ( A P ss V (s ) ) ) T (µ())pss V (s ) ). (12)

14 300 SINGH ET AL. The highest vlue permuttion is the one tht ssigns the highest probbilities to the ctions with the highest Q vlues nd the lowest probbilities to the ctions with the lowest Q vlues. Therefore, the recurrence in Eq. (12) is the sme s tht in Eq. (11), so, by uniqueness, V = V. This mens the greedy restricted policy with respect to Q is the optiml restricted policy. As corollry of Theorem 2, given communicting MDP nd n RL lgorithm tht follows n RRR lerning policy specified by T where T (i) > 0 for ll 1 i m, SARSA(0) converges to n optiml restricted policy. 3 The results of this section show tht RRR lerning policies with the SARSA(0) updte rule converge to optiml restricted policies. In contrst to Q-lerning, this mens tht the lerner cn dopt its symptotic policy t ny time during lerning nd still converge to optimlity in this modified sense. However, the fct tht convergence depends on decying the lerning rte to zero mens tht this pproch is somewht self-contrdictory; in the limit, the lerner is still exploring, but it is not ble to lern nything new from its discoveries. 4. Conclusion In this pper, we hve provided convergence results for SARSA(0) under two different lerning policy clsses; one ensures optiml behvior in the limit nd the other ensures behvior optiml with respect to constrints imposed by the explortion strtegy. To the best of our knowledge, these constitute the first convergence results for ny on-policy lgorithm. However, these re very bsic results becuse they pply only to the lookuptble cse, nd more importntly becuse they do not seem to extend nturlly to generl multi-step on-policy lgorithms. Appendix A: proof of Lemm 1 For completeness we present Proposition 4.5 of Bertseks (1995). Lemm 2. Let r t+1 (i) = (1 γ t (i))r t (i) + γ t (i){(h t r t )(i) + w t (i) + u t (i)}, where i = 1, 2,...,n nd t = 0, 1, 2,..., let F t be n incresing sequence of σ -fields, nd ssume the following: 1. γ t 0, t=0 γ t(i) =, t=0 γ2 t (i)< (.s.); 2. (A) for ll i, t : E[w t (i) F t ] = 0; 3. (B) there exist A, B Rs.t. for ll i, t : E[w 2 t (i) F t] A + B r t 2 ; 4. there exists n r R n, positive vector ξ, nd sclr β [0, 1) s.t. for ll t 0 H t r t r ξ β r t r ξ ; 5. there exists θ t 0,θ t 0w.p.1 nd for ll i, t : u t (i) θ t ( r t ξ +1). Then r t r w.p.1. For convenience, we repet Lemm 1.

15 CONVERGENCE OF ON-POLICY RL ALGORITHMS 301 Lemm 1. Consider stochstic process (α t, t,f t ),t 0,where α t, t,f t : X R, which stisfies the equtions t+1 (x) = (1 α t (x)) t (x) + α t (x)f t (x), x X, t = 0, 1, 2,... Let P t be sequence of incresing σ -fields such tht α 0, 0 re P 0 -mesurble nd α t, t nd F t 1 re P t -mesurble, t = 1, 2,...Assume tht the following hold: 1. the set of possible sttes X is finite α t (x) 1, t α t(x) =, t α2 t (x)< w.p E{F t ( ) P t } W κ t W + c t, where κ [0, 1) nd c t converges to zero w.p Vr{F t (x) P t } K(1+ t W ) 2,where K is some constnt. Then, t converges to zero with probbility one (w.p.1). Proof: We pply Lemm 2. For simplicity, we present the proof for the cse when W = (1, 1,...,1). Let { Ft, if E[F F t = t P t ] κ t ; sign(e[f t P t ])κ t, otherwise. Further, let b t = F t F t. Then, by the construction of F t, E[ F t P t ] κ t nd E[b t P t ] c t. Now, if we identify {1, 2,...,n}with X, nd define F t = P t, γ t = α t, r t = t, H t r t = E[ F t P t ], w t = F t E[ F t P t ] + b t E[b t P t ], u t = E[b t P t ] nd r = 0, then we see tht the conditions of Lemm 2 re stisfied nd thus r t = t converges to r = 0 w.p.1. Appendix B: GLIE lerning policies Here, we present conditions on the explortion prmeter in the commonly used Boltzmnn explortion nd ɛ-greedy explortion strtegies to ensure tht both infinite explortion nd greedy in the limit conditions re stisfied. In communicting MDP, every stte gets visited infinitely often s long s ech ction is chosen infinitely often in ech stte (this is consequence of the Borel-Cntelli Lemm (Breimn, 1992); ll we hve to ensure is tht in ech stte ech ction gets chosen infinitely often in the limit. Consider some stte s. Let t s (i) represent the timestep t which the ith visit to stte s occurs. Consider some ction. The probbility with which ction is executed t the ith visit to stte s is denoted Pr( s, t s (i)) (i.e, Pr( = t s t = s, t s (i) = t)). We would like to show tht if the sum of the probbilities with which ction is chosen is infinite, i.e., i=1 Pr( s, t s(i)) =, then the number of times ction gets executed in stte s is infinite w.p.1. This would follow directly from the Borel-Cntelli Lemm if the probbilities of selecting ction t the different i were independent. However, in our cse the rndom choice of ction t the ith visit to stte s ffects the probbilities t the i + 1st visit to stte s (through the evolution of the Q-vlue function), so we need n extension of the Borel-Cntelli Lemm (c.f. Corollry 5.29 of Breimn (1992)):

16 302 SINGH ET AL. Lemm 3 (Extended Borel-Cntelli Lemm). Let F i be n incresing sequence of σ -fields nd let A i be F i -mesurble. Then { } ω : Pr(A i F i 1 ) = ={ω:ω A i i.o.} holds w.p.1. i=0 We hve the following: Lemm 4. Consider communicting MDP nd the reinforced decision process (x 0, 0, r 0,...,x t, t,r t,...). Let n t (s) denote the number of visits to stte s up to time t, n t (s, ) denote the number of times ction hs been chosen in stte s during the first t timesteps (n t (s, ) n t (s)), nd t s (i) denote the time when stte s ws visited the ith time. Assume tht the ction t time step t, t, is selected purely on the bsis of the sttistics D t : Pr( t = D t, t 1, D t 1,..., 0,D 0 )=Pr( t = D t ), (B.1) where D t is computed from the full t-step history (x 0, 0, r 0,...,x t ). Further, ssume tht the ction selection policy π is such tht { { } ω : lim n t (s)(ω) = ω: t i=0 Pr ( } ) ts (i) = D ts (i) (ω) =.s. Then, for ll (s, ) pirs n t (s).s. nd n t (s, ).s. (B.2) The sttistics D t could be for exmple (s t, t, n t (s), Q t ), where Q t is computed by the SARSA(0) updte rule (3). Proof: Fix n rbitrry pir (s, ) nd let F i be the sigm field generted by the rndom vribles {D ts (i+1), ts (i), D ts (i),..., ts (0),D ts (0)}. Let A i ={ ts (i) =}. Then A i is F i -mesurble. Further, by Eq. (B.1) Pr(A i F i 1 ) = Pr ( ts (i) = ) D ts (i), ts (i 1), D ts (i 1),..., ts (0),D ts (0) = Pr ( ts (i) = D ts (i)),

17 CONVERGENCE OF ON-POLICY RL ALGORITHMS 303 nd thus, by Eq. (B.2) nd Lemm 3, lmost surely { } ω : lim n t (s)(ω) = t { ω: i=0 Pr ( ts (i) = } ) Dts (i) (ω) = ={ω:ω A i for infinitely mny is} { } = ω : lim n t (s, ) =. t This proves tht if stte s is visited infinitely often then ction is lso chosen infinitely often in tht stte. Now let S be the set of sttes visited i.o. by s t, i.e., if S (ω) = S 0 then S 0 is the set of sttes which occur i.o. in the sequence {s 0 (ω), s 1 (ω),...,s t (ω),...}. Clerly, the events {S = S 0 }, S 0 S form complete event system. Thus, S 0 S P(S = S 0 ) = 1. Now let S 0 be nontrivil subset of S. Then, since the MDP is communicting, there exists pir of sttes s, s nd n ction, such tht s S 0, s S 0 nd Pss > 0. Then, Pr(S = S 0 ) = Pr(S = S 0, s S ) + Pr(S = S 0, s S ). Here, both events re impossible, so Pr(S = S 0 ) = 0. Since the MDP is finite, lso Pr(S = )=0nd so Pr(S = S) = 1. This yields tht Pr(lim t n t (s) = )=1for ll s, thus, finishing the proof. B.1. Boltzmnn explortion In Boltzmnn explortion, Pr( s, t, Q, n t (s)) = e β t (s)q(s,) b A eβ t (s)q(s,b), where β t (s) is the stte-specific explortion coefficient for time t. Let the number of visits to stte s in timestep t be denoted s n t (s) nd ssume tht r(s, ) hs finite rnge. We know tht i=1 c/i = ; therefore, to meet the conditions of Lemm 4, we will ensure tht for ll ctions A,Pr( s,t s (i)) c/i (with c 1). To do tht we need for ll : e β t (s)q t (s,) b A eβ t (s)q t (s,b) c n t (s) e β t (s)q t (s,b) n t (s)e β t (s)q t (s,) c b A n t (s)e β t (s)q t (s,) cme β t(s)q t (s,b mx ) n t (s) cm eβ t(s)(q t (s,b mx) Q t (s,)) ln n t (s) ln cm β t (s)(q t (s, b mx ) Q t (s, )), where b mx = rgmx b A Q t (s, b) bove nd m is the number of ctions. Further, let c = 1/m. Tken together, this mens tht we wnt β t (s) ln n t (s)/c t (s) where C t (s) =

18 304 SINGH ET AL. mx Q t (s, b mx ) Q t (s, ). Note tht C t (s) is bounded becuse the Q vlues remin bounded (since r(s, ) hs bounded rnge). Since for every s, lim t n t (s) =, lso lim β ln n t (s) t(s) lim = ; t t C t (s) this mens tht Boltzmnn explortion with β t (s) = ln n t (s)/c t (s) will be greedy in the limit. B.2. ɛ-greedy explortion In ɛ-greedy explortion we pick rndom explortion ction with probbility ɛ t (s) nd the greedy ction with probbility 1 ɛ t (s). Let ɛ t (s) = c/n t (s) with 0 < c < 1. Then, Pr( s, t s (i)) ɛ t (s)/m, where m is the number of ctions. Therefore, Lemm 4 combined with the fct tht i=1 c/i = implies tht for ll s, i=1 Pr( s, t s(i)) =. Since lso by Lemm 4 for ll s, lim t n t (s) =, nd, therefore, lim t ɛ t (s) = 0, ensuring tht the lerning policy is greedy in the limit. Therefore, if ɛ t (s) = c/n t (s) then ɛ-greedy explortion is GLIE for 0 < c < 1. Appendix C: generlized Mrkov decision processes In this section, we give proofs of severl properties ssocited with generlized MDPs, which re described in more detil by Szepesvári & Littmn (1996). Define the Q-vlue function Q(s, ) = R(s, ) + γ s S P ss Q(s, ), (s, ) S A. (C.1) Here, we ssume 0 γ<1. The importnt property for to stisfy is the non-expnsion property: Q(s, ) Q (s, ) mx Q(s, ) Q (s, ) for ll Q-vlue functions Q nd Q nd ll sttes s. We begin by showing tht n verge over ctions with fixed set of weights stisfies the non-expnsion property. Lemm 5. The function Q(s, ) = p Q(s, ) stisfies the non-expnsion property, where 0 p 1 nd p = 1.

19 CONVERGENCE OF ON-POLICY RL ALGORITHMS 305 Proof: This follows directly from definitions. If Q nd Q re Q-vlue functions, we hve Q(s, ) Q (s, ) = p (Q(s, ) Q (s, )) p Q(s, ) Q (s, ) mx Q(s, ) Q (s, ). A corollry is tht fixed-weight verge of functions tht stisfy the non-expnsion property lso stisfies the non-expnsion property. We cn use Lemm 5 to prove the existence nd uniqueness of the Q-vlue function. Lemm 6. As long s stisfies the non-expnsion property, Eq. (C.1) hs solution nd it is unique. Proof: Define the opertor L on Q-vlue functions s (LQ)(s, ) = R(s, ) + γ s S P ss Q(s, ), for ll (s, ) S A. We cn rewrite Eq. (C.1) s Q(s, ) = (LQ)(s, ), which hs unique solution if L is contrction with respect to the mx norm. To see tht L is contrction, consider two Q-vlue functions Q nd Q.Wehve LQ LQ γmx s Q(s, ) Q (s, ) < Q Q, where we hve used Lemm 5, the fct tht γ<1, nd the non-expnsion property of. Finlly, define fmily of rnk-bsed opertors: i Q(s, ) = ith lrgest vlue of Q(s, ), for ech 1 i m. We show tht these opertors stisfy the non-expnsion property. Lemm 7. The i Q(s, ) opertors stisfy the non-expnsion property. Proof: Let Q nd Q be Q-vlue functions nd fix s S. Without loss of generlity, ssume i Q(s, ) i Q (s, ). Let be the ith lrgest vlue of Q(s, ): Q(s, ) = i Q(s, ). We exmine two cses seprtely nd show tht the non-expnsion property is stisfied either wy. If Q (s, ) i Q (s, ), then

20 306 SINGH ET AL. i i Q(s, ) Q (s, ) = Q(s, ) i Q (s, ) Q(s, ) Q (s, ) mx Q(s, ) Q (s, ). On the other hnd, if Q (s, )> i Q (s,), tht mens tht the rnk of in Q, ρ(q,s, ) is smller thn i. This implies tht there is some such tht ρ(q,s, )<i nd ρ(q,s, ) i (otherwise there would be i ctions with rnks less thn i in Q ). For this, i i Q(s, ) Q (s, ) = i i Q(s, ) Q (s, ) Q(s, ) Q (s, ) mx Q(s, ) Q (s, ). Acknowledgments We thnk Richrd S. Sutton for help nd encourgement. We lso thnk Nicols Meuleu nd the nonymous reviewers for comments nd suggestions. This reserch ws prtilly supported by NSF grnt IIS (Stinder Singh), OTKA Grnt No. F20132 (Csb Szepesvári), Hungrin Ministry of Eduction Grnt No. FKFP 1354/1997 (Csb Szepesvári), nd NSF CAREER grnt IRI (Michel Littmn). Notes 1. The nme is reference to the fct tht it is single-step lgorithm tht mkes updtes on the bsis of Stte, Action, Rewrd, Stte, Action 5-tuple. 2. Here W denotes weighted mximum norm with weight W = (w 1,...,w n ), w i > 0: if x R n then x W = mx i ( x i /w i ). 3. We conjecture tht the sme result does not hold for persistent Boltzmnn explortion becuse relted synchronous lgorithms do not hve unique trget of convergence (Littmn, 1996). References Brto, A. G., Brdtke S. J., & Singh S. (1995). Lerning to ct using rel-time dynmic progrmming. Artificil Intelligence, 72(1), Bellmn, R. (1957). Dynmic Progrmming. Princeton, NJ: Princeton University Press. Bertseks, D. P. (1995). Dynmic Progrmming nd Optiml Control. (Vol. 1 nd 2). Belmont, Msschusetts: Athen Scientific.

21 CONVERGENCE OF ON-POLICY RL ALGORITHMS 307 Boyn, J. A. & Moore, A. W. (1995). Generliztion in reinforcement lerning: Sfely pproximting the vlue function. In G. Tesuro, D. S. Touretzky, & T. K. Leen (Eds.), Advnces in neurl informtion processing systems (Vol. 7, pp ). Cmbridge, MA: The MIT Press. Breimn, L. (1992). Probbility. Phildelphi, Pennsylvni: Society for Industril nd Applied Mthemtics. Crites, R. H. & Brto, A. G. (1996). Improving elevtor performnce using reinforcement lerning. Advnces in neurl informtion processing systems (Vol. 8). MIT Press. Dyn, P. (1992). The convergence of TD(λ) for generl λ. Mchine Lerning, 8(3), Dyn, P. & Sejnowski, T. J. (1994). TD(λ) converges with probbility 1. Mchine Lerning, 14(3). Dyn, P. & Sejnowski, T. J. (1996). Explortion bonuses nd dul control. Mchine Lerning, 25, Gullplli, V. & Brto, A. G. (1994). Convergence of indirect dptive synchronous vlue itertion lgorithms. In J. D. Cown, G. Tesuro, & J. Alspector (Eds.), Advnces in neurl informtion processing systems (Vol. 6, pp ). Sn Mteo, CA: Morgn Kufmnn. Jkkol, T., Jordn, M. I., & Singh, S. (1994). On the convergence of stochstic itertive dynmic progrmming lgorithms. Neurl Computtion, 6(6), John, G. H. (1994). When the best move isn t optiml: Q-lerning with explortion. In Proceedings of the Twelfth Ntionl Conference on Artificil Intelligence, Settle, WA, p John, G. H. (1995). When the best move isn t optiml: Q-lerning with explortion. Unpublished mnuscript, vilble t URL ftp://strry.stnford.edu/pub/gjohn/ppers/rein-nips.ps. Kelbling, L. P., Littmn, M. L., & Moore, A. W. (1996). Reinforcement lerning: A survey. Journl of Artificil Intelligence Reserch, 4: Kumr, P. R. & Vriy, P. P. (1986). Stochstic systems: estimtion, identifiction, nd dptive control. Englewood Cliffs, NJ: Prentice Hll. Littmn, M. L. & Szepesvári, C. (1996). A generlized reinforcement-lerning model: Convergence nd pplictions. In L. Sitt (Ed.), Proceedings of the Thirteenth Interntionl Conference on Mchine Lerning (pp ). Littmn, M. L. (1996). Algorithms for sequentil decision mking. Ph.D. Thesis, Deprtment of Computer Science, Brown University, Februry. Also Technicl Report CS Putermn, M. L. (1994). Mrkov decision processes discrete stochstic dynmic progrmming. New York, NY: John Wiley & Sons, Inc. Rummery, G. A. (1994). Problem solving with reinforcement lerning. Ph.D. Thesis, Cmbridge University Engineering Deprtment. Rummery, G. A. & Nirnjn, M. (1994). On-line Q-lerning using connectionist systems. Technicl Report CUED/F-INFENG/TR 166, Cmbridge University Engineering Deprtment. Singh, S. & Bertseks, D. P. (1997). Reinforcement lerning for dynmic chnnel lloction in cellulr telephone systems. Advnces in neurl informtion processing systems (Vol. 9, pp ). MIT Press. Singh, S. & Sutton, R. S. (1996). Reinforcement lerning with replcing eligibility trces. Mchine Lerning, 22(1 3): Singh, S. & Yee, R. C. (1994). An upper bound on the loss from pproximte optiml-vlue functions. Mchine Lerning, 16, Sutton, R. S. & Brto, A. G. (1998). An introduction to reinforcement lerning. The MIT Press. Sutton, R. S. (1988). Lerning to predict by the method of temporl differences. Mchine Lerning, 3(1): Sutton, R. S. (1996). Generliztion in reinforcement lerning: successful exmples using sprse corse coding. In D. S. Touretzky, M. C. Mozer, & M. E. Hsselmo (Eds.), Advnces in neurl informtion processing systems (Vol. 8). Cmbridge, MA: The MIT Press. Szepesvári, C. & Littmn, M. L. (1996). Generlized Mrkov decision processes: dynmic-progrmming nd reinforcement-lerning lgorithms. Technicl Report CS-96-11, Brown University, Providence, RI. Tesuro, G. J. (1995). Temporl difference lerning nd TD-gmmon. Communictions of the ACM, 38(3), Thrun, S. B. (1992). The role of explortion in lerning control. In D. A. White & D. A. Sofge (Eds.), Hndbook of intelligent control: neurl, fuzzy, nd dptive pproches. New York, NY: Vn Nostrnd Reinhold. Tsitsiklis, J. N. (1994). Asynchronous stochstic pproximtion nd Q-lerning. Mchine Lerning, 16(3): , September Tsitsiklis, J. N. & Vn Roy, B. (1996). An nlysis of temporl-difference lerning with function pproximtion. Technicl Report LIDS-P-2322, Msschusetts Institute of Technology, Mrch. Avilble through URL

22 308 SINGH ET AL. To pper in IEEE Trnsctions on Automtic Control. Wtkins, C. J. C. H. (1989). Lerning from delyed rewrds. Ph.D. Thesis, King s College, Cmbridge, UK. Wtkins, C. J. C. H. & Dyn, P. (1992). Q-lerning. Mchine Lerning, 8(3): Willims, R. J. & Bird, L. C., III (1993). Tight performnce bounds on greedy policies bsed on imperfect vlue functions. Technicl Report NU-CCS-93-14, Northestern University, College of Computer Science, Boston, MA. Zhng, W. & Dietterich, T. G. (1995). High-performnce job-shop scheduling with time dely TD(λ) network. Advnces in neurl informtion processing systems (Vol. 8, pp ). MIT Press. Received July 9, 1997 Accepted August 12, 1999 Finl mnuscript August 12, 1999

Reinforcement Learning

Reinforcement Learning Reinforcement Lerning Tom Mitchell, Mchine Lerning, chpter 13 Outline Introduction Comprison with inductive lerning Mrkov Decision Processes: the model Optiml policy: The tsk Q Lerning: Q function Algorithm

More information

Reinforcement learning II

Reinforcement learning II CS 1675 Introduction to Mchine Lerning Lecture 26 Reinforcement lerning II Milos Huskrecht milos@cs.pitt.edu 5329 Sennott Squre Reinforcement lerning Bsics: Input x Lerner Output Reinforcement r Critic

More information

2D1431 Machine Learning Lab 3: Reinforcement Learning

2D1431 Machine Learning Lab 3: Reinforcement Learning 2D1431 Mchine Lerning Lb 3: Reinforcement Lerning Frnk Hoffmnn modified by Örjn Ekeberg December 7, 2004 1 Introduction In this lb you will lern bout dynmic progrmming nd reinforcement lerning. It is ssumed

More information

1 Online Learning and Regret Minimization

1 Online Learning and Regret Minimization 2.997 Decision-Mking in Lrge-Scle Systems My 10 MIT, Spring 2004 Hndout #29 Lecture Note 24 1 Online Lerning nd Regret Minimiztion In this lecture, we consider the problem of sequentil decision mking in

More information

Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Module 6 Vlue Itertion CS 886 Sequentil Decision Mking nd Reinforcement Lerning University of Wterloo Mrkov Decision Process Definition Set of sttes: S Set of ctions (i.e., decisions): A Trnsition model:

More information

Bellman Optimality Equation for V*

Bellman Optimality Equation for V* Bellmn Optimlity Eqution for V* The vlue of stte under n optiml policy must equl the expected return for the best ction from tht stte: V (s) mx Q (s,) A(s) mx A(s) mx A(s) Er t 1 V (s t 1 ) s t s, t s

More information

p-adic Egyptian Fractions

p-adic Egyptian Fractions p-adic Egyptin Frctions Contents 1 Introduction 1 2 Trditionl Egyptin Frctions nd Greedy Algorithm 2 3 Set-up 3 4 p-greedy Algorithm 5 5 p-egyptin Trditionl 10 6 Conclusion 1 Introduction An Egyptin frction

More information

19 Optimal behavior: Game theory

19 Optimal behavior: Game theory Intro. to Artificil Intelligence: Dle Schuurmns, Relu Ptrscu 1 19 Optiml behvior: Gme theory Adversril stte dynmics hve to ccount for worst cse Compute policy π : S A tht mximizes minimum rewrd Let S (,

More information

Administrivia CSE 190: Reinforcement Learning: An Introduction

Administrivia CSE 190: Reinforcement Learning: An Introduction Administrivi CSE 190: Reinforcement Lerning: An Introduction Any emil sent to me bout the course should hve CSE 190 in the subject line! Chpter 4: Dynmic Progrmming Acknowledgment: A good number of these

More information

New Expansion and Infinite Series

New Expansion and Infinite Series Interntionl Mthemticl Forum, Vol. 9, 204, no. 22, 06-073 HIKARI Ltd, www.m-hikri.com http://dx.doi.org/0.2988/imf.204.4502 New Expnsion nd Infinite Series Diyun Zhng College of Computer Nnjing University

More information

The Regulated and Riemann Integrals

The Regulated and Riemann Integrals Chpter 1 The Regulted nd Riemnn Integrls 1.1 Introduction We will consider severl different pproches to defining the definite integrl f(x) dx of function f(x). These definitions will ll ssign the sme vlue

More information

W. We shall do so one by one, starting with I 1, and we shall do it greedily, trying

W. We shall do so one by one, starting with I 1, and we shall do it greedily, trying Vitli covers 1 Definition. A Vitli cover of set E R is set V of closed intervls with positive length so tht, for every δ > 0 nd every x E, there is some I V with λ(i ) < δ nd x I. 2 Lemm (Vitli covering)

More information

Chapter 4: Dynamic Programming

Chapter 4: Dynamic Programming Chpter 4: Dynmic Progrmming Objectives of this chpter: Overview of collection of clssicl solution methods for MDPs known s dynmic progrmming (DP) Show how DP cn be used to compute vlue functions, nd hence,

More information

THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS.

THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS. THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS RADON ROSBOROUGH https://intuitiveexplntionscom/picrd-lindelof-theorem/ This document is proof of the existence-uniqueness theorem

More information

{ } = E! & $ " k r t +k +1

{ } = E! & $  k r t +k +1 Chpter 4: Dynmic Progrmming Objectives of this chpter: Overview of collection of clssicl solution methods for MDPs known s dynmic progrmming (DP) Show how DP cn be used to compute vlue functions, nd hence,

More information

Advanced Calculus: MATH 410 Notes on Integrals and Integrability Professor David Levermore 17 October 2004

Advanced Calculus: MATH 410 Notes on Integrals and Integrability Professor David Levermore 17 October 2004 Advnced Clculus: MATH 410 Notes on Integrls nd Integrbility Professor Dvid Levermore 17 October 2004 1. Definite Integrls In this section we revisit the definite integrl tht you were introduced to when

More information

CS 188 Introduction to Artificial Intelligence Fall 2018 Note 7

CS 188 Introduction to Artificial Intelligence Fall 2018 Note 7 CS 188 Introduction to Artificil Intelligence Fll 2018 Note 7 These lecture notes re hevily bsed on notes originlly written by Nikhil Shrm. Decision Networks In the third note, we lerned bout gme trees

More information

Chapter 5 : Continuous Random Variables

Chapter 5 : Continuous Random Variables STAT/MATH 395 A - PROBABILITY II UW Winter Qurter 216 Néhémy Lim Chpter 5 : Continuous Rndom Vribles Nottions. N {, 1, 2,...}, set of nturl numbers (i.e. ll nonnegtive integers); N {1, 2,...}, set of ll

More information

Continuous Random Variables

Continuous Random Variables STAT/MATH 395 A - PROBABILITY II UW Winter Qurter 217 Néhémy Lim Continuous Rndom Vribles Nottion. The indictor function of set S is rel-vlued function defined by : { 1 if x S 1 S (x) if x S Suppose tht

More information

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below.

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below. Dulity #. Second itertion for HW problem Recll our LP emple problem we hve been working on, in equlity form, is given below.,,,, 8 m F which, when written in slightly different form, is 8 F Recll tht we

More information

Math 1B, lecture 4: Error bounds for numerical methods

Math 1B, lecture 4: Error bounds for numerical methods Mth B, lecture 4: Error bounds for numericl methods Nthn Pflueger 4 September 0 Introduction The five numericl methods descried in the previous lecture ll operte by the sme principle: they pproximte the

More information

Review of Calculus, cont d

Review of Calculus, cont d Jim Lmbers MAT 460 Fll Semester 2009-10 Lecture 3 Notes These notes correspond to Section 1.1 in the text. Review of Clculus, cont d Riemnn Sums nd the Definite Integrl There re mny cses in which some

More information

Theoretical foundations of Gaussian quadrature

Theoretical foundations of Gaussian quadrature Theoreticl foundtions of Gussin qudrture 1 Inner product vector spce Definition 1. A vector spce (or liner spce) is set V = {u, v, w,...} in which the following two opertions re defined: (A) Addition of

More information

Review of basic calculus

Review of basic calculus Review of bsic clculus This brief review reclls some of the most importnt concepts, definitions, nd theorems from bsic clculus. It is not intended to tech bsic clculus from scrtch. If ny of the items below

More information

A Generalized Reinforcement-Learning Model: Convergence and. Applications

A Generalized Reinforcement-Learning Model: Convergence and. Applications URL ftp://iserv.iki.kfki.hu/pub/ppers/icml96/szepes.greinf.ps.z WWW http://iserv.iki.kfki.hu/dptlb.html A Generlized Reinforcement-Lerning Model: Convergence nd Applictions Michel L. Littmn Deprtment of

More information

Scalable Learning in Stochastic Games

Scalable Learning in Stochastic Games Sclble Lerning in Stochstic Gmes Michel Bowling nd Mnuel Veloso Computer Science Deprtment Crnegie Mellon University Pittsburgh PA, 15213-3891 Abstrct Stochstic gmes re generl model of interction between

More information

P 3 (x) = f(0) + f (0)x + f (0) 2. x 2 + f (0) . In the problem set, you are asked to show, in general, the n th order term is a n = f (n) (0)

P 3 (x) = f(0) + f (0)x + f (0) 2. x 2 + f (0) . In the problem set, you are asked to show, in general, the n th order term is a n = f (n) (0) 1 Tylor polynomils In Section 3.5, we discussed how to pproximte function f(x) round point in terms of its first derivtive f (x) evluted t, tht is using the liner pproximtion f() + f ()(x ). We clled this

More information

A recursive construction of efficiently decodable list-disjunct matrices

A recursive construction of efficiently decodable list-disjunct matrices CSE 709: Compressed Sensing nd Group Testing. Prt I Lecturers: Hung Q. Ngo nd Atri Rudr SUNY t Bufflo, Fll 2011 Lst updte: October 13, 2011 A recursive construction of efficiently decodble list-disjunct

More information

Lecture 1. Functional series. Pointwise and uniform convergence.

Lecture 1. Functional series. Pointwise and uniform convergence. 1 Introduction. Lecture 1. Functionl series. Pointwise nd uniform convergence. In this course we study mongst other things Fourier series. The Fourier series for periodic function f(x) with period 2π is

More information

CMDA 4604: Intermediate Topics in Mathematical Modeling Lecture 19: Interpolation and Quadrature

CMDA 4604: Intermediate Topics in Mathematical Modeling Lecture 19: Interpolation and Quadrature CMDA 4604: Intermedite Topics in Mthemticl Modeling Lecture 19: Interpoltion nd Qudrture In this lecture we mke brief diversion into the res of interpoltion nd qudrture. Given function f C[, b], we sy

More information

Exam 2, Mathematics 4701, Section ETY6 6:05 pm 7:40 pm, March 31, 2016, IH-1105 Instructor: Attila Máté 1

Exam 2, Mathematics 4701, Section ETY6 6:05 pm 7:40 pm, March 31, 2016, IH-1105 Instructor: Attila Máté 1 Exm, Mthemtics 471, Section ETY6 6:5 pm 7:4 pm, Mrch 1, 16, IH-115 Instructor: Attil Máté 1 17 copies 1. ) Stte the usul sufficient condition for the fixed-point itertion to converge when solving the eqution

More information

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives Block #6: Properties of Integrls, Indefinite Integrls Gols: Definition of the Definite Integrl Integrl Clcultions using Antiderivtives Properties of Integrls The Indefinite Integrl 1 Riemnn Sums - 1 Riemnn

More information

ODE: Existence and Uniqueness of a Solution

ODE: Existence and Uniqueness of a Solution Mth 22 Fll 213 Jerry Kzdn ODE: Existence nd Uniqueness of Solution The Fundmentl Theorem of Clculus tells us how to solve the ordinry differentil eqution (ODE) du = f(t) dt with initil condition u() =

More information

Chapter 0. What is the Lebesgue integral about?

Chapter 0. What is the Lebesgue integral about? Chpter 0. Wht is the Lebesgue integrl bout? The pln is to hve tutoril sheet ech week, most often on Fridy, (to be done during the clss) where you will try to get used to the ides introduced in the previous

More information

Entropy and Ergodic Theory Notes 10: Large Deviations I

Entropy and Ergodic Theory Notes 10: Large Deviations I Entropy nd Ergodic Theory Notes 10: Lrge Devitions I 1 A chnge of convention This is our first lecture on pplictions of entropy in probbility theory. In probbility theory, the convention is tht ll logrithms

More information

Riemann Sums and Riemann Integrals

Riemann Sums and Riemann Integrals Riemnn Sums nd Riemnn Integrls Jmes K. Peterson Deprtment of Biologicl Sciences nd Deprtment of Mthemticl Sciences Clemson University August 26, 203 Outline Riemnn Sums Riemnn Integrls Properties Abstrct

More information

Online Supplements to Performance-Based Contracts for Outpatient Medical Services

Online Supplements to Performance-Based Contracts for Outpatient Medical Services Jing, Png nd Svin: Performnce-bsed Contrcts Article submitted to Mnufcturing & Service Opertions Mngement; mnuscript no. MSOM-11-270.R2 1 Online Supplements to Performnce-Bsed Contrcts for Outptient Medicl

More information

Lecture 14: Quadrature

Lecture 14: Quadrature Lecture 14: Qudrture This lecture is concerned with the evlution of integrls fx)dx 1) over finite intervl [, b] The integrnd fx) is ssumed to be rel-vlues nd smooth The pproximtion of n integrl by numericl

More information

An approximation to the arithmetic-geometric mean. G.J.O. Jameson, Math. Gazette 98 (2014), 85 95

An approximation to the arithmetic-geometric mean. G.J.O. Jameson, Math. Gazette 98 (2014), 85 95 An pproximtion to the rithmetic-geometric men G.J.O. Jmeson, Mth. Gzette 98 (4), 85 95 Given positive numbers > b, consider the itertion given by =, b = b nd n+ = ( n + b n ), b n+ = ( n b n ) /. At ech

More information

Recitation 3: More Applications of the Derivative

Recitation 3: More Applications of the Derivative Mth 1c TA: Pdric Brtlett Recittion 3: More Applictions of the Derivtive Week 3 Cltech 2012 1 Rndom Question Question 1 A grph consists of the following: A set V of vertices. A set E of edges where ech

More information

Reversals of Signal-Posterior Monotonicity for Any Bounded Prior

Reversals of Signal-Posterior Monotonicity for Any Bounded Prior Reversls of Signl-Posterior Monotonicity for Any Bounded Prior Christopher P. Chmbers Pul J. Hely Abstrct Pul Milgrom (The Bell Journl of Economics, 12(2): 380 391) showed tht if the strict monotone likelihood

More information

20 MATHEMATICS POLYNOMIALS

20 MATHEMATICS POLYNOMIALS 0 MATHEMATICS POLYNOMIALS.1 Introduction In Clss IX, you hve studied polynomils in one vrible nd their degrees. Recll tht if p(x) is polynomil in x, the highest power of x in p(x) is clled the degree of

More information

CS 188: Artificial Intelligence Spring 2007

CS 188: Artificial Intelligence Spring 2007 CS 188: Artificil Intelligence Spring 2007 Lecture 3: Queue-Bsed Serch 1/23/2007 Srini Nrynn UC Berkeley Mny slides over the course dpted from Dn Klein, Sturt Russell or Andrew Moore Announcements Assignment

More information

8 Laplace s Method and Local Limit Theorems

8 Laplace s Method and Local Limit Theorems 8 Lplce s Method nd Locl Limit Theorems 8. Fourier Anlysis in Higher DImensions Most of the theorems of Fourier nlysis tht we hve proved hve nturl generliztions to higher dimensions, nd these cn be proved

More information

Best Approximation. Chapter The General Case

Best Approximation. Chapter The General Case Chpter 4 Best Approximtion 4.1 The Generl Cse In the previous chpter, we hve seen how n interpolting polynomil cn be used s n pproximtion to given function. We now wnt to find the best pproximtion to given

More information

Riemann Sums and Riemann Integrals

Riemann Sums and Riemann Integrals Riemnn Sums nd Riemnn Integrls Jmes K. Peterson Deprtment of Biologicl Sciences nd Deprtment of Mthemticl Sciences Clemson University August 26, 2013 Outline 1 Riemnn Sums 2 Riemnn Integrls 3 Properties

More information

Chapter 4 Contravariance, Covariance, and Spacetime Diagrams

Chapter 4 Contravariance, Covariance, and Spacetime Diagrams Chpter 4 Contrvrince, Covrince, nd Spcetime Digrms 4. The Components of Vector in Skewed Coordintes We hve seen in Chpter 3; figure 3.9, tht in order to show inertil motion tht is consistent with the Lorentz

More information

Jim Lambers MAT 169 Fall Semester Lecture 4 Notes

Jim Lambers MAT 169 Fall Semester Lecture 4 Notes Jim Lmbers MAT 169 Fll Semester 2009-10 Lecture 4 Notes These notes correspond to Section 8.2 in the text. Series Wht is Series? An infinte series, usully referred to simply s series, is n sum of ll of

More information

Numerical Integration

Numerical Integration Chpter 5 Numericl Integrtion Numericl integrtion is the study of how the numericl vlue of n integrl cn be found. Methods of function pproximtion discussed in Chpter??, i.e., function pproximtion vi the

More information

Near-Bayesian Exploration in Polynomial Time

Near-Bayesian Exploration in Polynomial Time J. Zico Kolter kolter@cs.stnford.edu Andrew Y. Ng ng@cs.stnford.edu Computer Science Deprtment, Stnford University, CA 94305 Abstrct We consider the explortion/exploittion problem in reinforcement lerning

More information

Sufficient condition on noise correlations for scalable quantum computing

Sufficient condition on noise correlations for scalable quantum computing Sufficient condition on noise correltions for sclble quntum computing John Presill, 2 Februry 202 Is quntum computing sclble? The ccurcy threshold theorem for quntum computtion estblishes tht sclbility

More information

UNIFORM CONVERGENCE. Contents 1. Uniform Convergence 1 2. Properties of uniform convergence 3

UNIFORM CONVERGENCE. Contents 1. Uniform Convergence 1 2. Properties of uniform convergence 3 UNIFORM CONVERGENCE Contents 1. Uniform Convergence 1 2. Properties of uniform convergence 3 Suppose f n : Ω R or f n : Ω C is sequence of rel or complex functions, nd f n f s n in some sense. Furthermore,

More information

Strong Bisimulation. Overview. References. Actions Labeled transition system Transition semantics Simulation Bisimulation

Strong Bisimulation. Overview. References. Actions Labeled transition system Transition semantics Simulation Bisimulation Strong Bisimultion Overview Actions Lbeled trnsition system Trnsition semntics Simultion Bisimultion References Robin Milner, Communiction nd Concurrency Robin Milner, Communicting nd Mobil Systems 32

More information

Acceptance Sampling by Attributes

Acceptance Sampling by Attributes Introduction Acceptnce Smpling by Attributes Acceptnce smpling is concerned with inspection nd decision mking regrding products. Three spects of smpling re importnt: o Involves rndom smpling of n entire

More information

Math 270A: Numerical Linear Algebra

Math 270A: Numerical Linear Algebra Mth 70A: Numericl Liner Algebr Instructor: Michel Holst Fll Qurter 014 Homework Assignment #3 Due Give to TA t lest few dys before finl if you wnt feedbck. Exercise 3.1. (The Bsic Liner Method for Liner

More information

Research Article Moment Inequalities and Complete Moment Convergence

Research Article Moment Inequalities and Complete Moment Convergence Hindwi Publishing Corportion Journl of Inequlities nd Applictions Volume 2009, Article ID 271265, 14 pges doi:10.1155/2009/271265 Reserch Article Moment Inequlities nd Complete Moment Convergence Soo Hk

More information

Non-Linear & Logistic Regression

Non-Linear & Logistic Regression Non-Liner & Logistic Regression If the sttistics re boring, then you've got the wrong numbers. Edwrd R. Tufte (Sttistics Professor, Yle University) Regression Anlyses When do we use these? PART 1: find

More information

Applicable Analysis and Discrete Mathematics available online at

Applicable Analysis and Discrete Mathematics available online at Applicble Anlysis nd Discrete Mthemtics vilble online t http://pefmth.etf.rs Appl. Anl. Discrete Mth. 4 (2010), 23 31. doi:10.2298/aadm100201012k NUMERICAL ANALYSIS MEETS NUMBER THEORY: USING ROOTFINDING

More information

UNIT 1 FUNCTIONS AND THEIR INVERSES Lesson 1.4: Logarithmic Functions as Inverses Instruction

UNIT 1 FUNCTIONS AND THEIR INVERSES Lesson 1.4: Logarithmic Functions as Inverses Instruction Lesson : Logrithmic Functions s Inverses Prerequisite Skills This lesson requires the use of the following skills: determining the dependent nd independent vribles in n exponentil function bsed on dt from

More information

Applying Q-Learning to Flappy Bird

Applying Q-Learning to Flappy Bird Applying Q-Lerning to Flppy Bird Moritz Ebeling-Rump, Mnfred Ko, Zchry Hervieux-Moore Abstrct The field of mchine lerning is n interesting nd reltively new re of reserch in rtificil intelligence. In this

More information

Reinforcement learning

Reinforcement learning Reinforcement lerning Regulr MDP Given: Trnition model P Rewrd function R Find: Policy π Reinforcement lerning Trnition model nd rewrd function initilly unknown Still need to find the right policy Lern

More information

LECTURE NOTE #12 PROF. ALAN YUILLE

LECTURE NOTE #12 PROF. ALAN YUILLE LECTURE NOTE #12 PROF. ALAN YUILLE 1. Clustering, K-mens, nd EM Tsk: set of unlbeled dt D = {x 1,..., x n } Decompose into clsses w 1,..., w M where M is unknown. Lern clss models p(x w)) Discovery of

More information

Euler, Ioachimescu and the trapezium rule. G.J.O. Jameson (Math. Gazette 96 (2012), )

Euler, Ioachimescu and the trapezium rule. G.J.O. Jameson (Math. Gazette 96 (2012), ) Euler, Iochimescu nd the trpezium rule G.J.O. Jmeson (Mth. Gzette 96 (0), 36 4) The following results were estblished in recent Gzette rticle [, Theorems, 3, 4]. Given > 0 nd 0 < s

More information

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling CSE 547/Stt 548: Mchine Lerning for Big Dt Lecture Multi-Armed Bndits: Non-dptive nd Adptive Smpling Instructor: Shm Kkde 1 The (stochstic) multi-rmed bndit problem The bsic prdigm is s follows: K Independent

More information

Improper Integrals. Type I Improper Integrals How do we evaluate an integral such as

Improper Integrals. Type I Improper Integrals How do we evaluate an integral such as Improper Integrls Two different types of integrls cn qulify s improper. The first type of improper integrl (which we will refer to s Type I) involves evluting n integrl over n infinite region. In the grph

More information

A Fast and Reliable Policy Improvement Algorithm

A Fast and Reliable Policy Improvement Algorithm A Fst nd Relible Policy Improvement Algorithm Ysin Abbsi-Ydkori Peter L. Brtlett Stephen J. Wright Queenslnd University of Technology UC Berkeley nd QUT University of Wisconsin-Mdison Abstrct We introduce

More information

The First Fundamental Theorem of Calculus. If f(x) is continuous on [a, b] and F (x) is any antiderivative. f(x) dx = F (b) F (a).

The First Fundamental Theorem of Calculus. If f(x) is continuous on [a, b] and F (x) is any antiderivative. f(x) dx = F (b) F (a). The Fundmentl Theorems of Clculus Mth 4, Section 0, Spring 009 We now know enough bout definite integrls to give precise formultions of the Fundmentl Theorems of Clculus. We will lso look t some bsic emples

More information

Numerical integration

Numerical integration 2 Numericl integrtion This is pge i Printer: Opque this 2. Introduction Numericl integrtion is problem tht is prt of mny problems in the economics nd econometrics literture. The orgniztion of this chpter

More information

CS667 Lecture 6: Monte Carlo Integration 02/10/05

CS667 Lecture 6: Monte Carlo Integration 02/10/05 CS667 Lecture 6: Monte Crlo Integrtion 02/10/05 Venkt Krishnrj Lecturer: Steve Mrschner 1 Ide The min ide of Monte Crlo Integrtion is tht we cn estimte the vlue of n integrl by looking t lrge number of

More information

Decomposition of terms in Lucas sequences

Decomposition of terms in Lucas sequences Journl of Logic & Anlysis 1:4 009 1 3 ISSN 1759-9008 1 Decomposition of terms in Lucs sequences ABDELMADJID BOUDAOUD Let P, Q be non-zero integers such tht D = P 4Q is different from zero. The sequences

More information

( dg. ) 2 dt. + dt. dt j + dh. + dt. r(t) dt. Comparing this equation with the one listed above for the length of see that

( dg. ) 2 dt. + dt. dt j + dh. + dt. r(t) dt. Comparing this equation with the one listed above for the length of see that Arc Length of Curves in Three Dimensionl Spce If the vector function r(t) f(t) i + g(t) j + h(t) k trces out the curve C s t vries, we cn mesure distnces long C using formul nerly identicl to one tht we

More information

and that at t = 0 the object is at position 5. Find the position of the object at t = 2.

and that at t = 0 the object is at position 5. Find the position of the object at t = 2. 7.2 The Fundmentl Theorem of Clculus 49 re mny, mny problems tht pper much different on the surfce but tht turn out to be the sme s these problems, in the sense tht when we try to pproimte solutions we

More information

13: Diffusion in 2 Energy Groups

13: Diffusion in 2 Energy Groups 3: Diffusion in Energy Groups B. Rouben McMster University Course EP 4D3/6D3 Nucler Rector Anlysis (Rector Physics) 5 Sept.-Dec. 5 September Contents We study the diffusion eqution in two energy groups

More information

A REVIEW OF CALCULUS CONCEPTS FOR JDEP 384H. Thomas Shores Department of Mathematics University of Nebraska Spring 2007

A REVIEW OF CALCULUS CONCEPTS FOR JDEP 384H. Thomas Shores Department of Mathematics University of Nebraska Spring 2007 A REVIEW OF CALCULUS CONCEPTS FOR JDEP 384H Thoms Shores Deprtment of Mthemtics University of Nebrsk Spring 2007 Contents Rtes of Chnge nd Derivtives 1 Dierentils 4 Are nd Integrls 5 Multivrite Clculus

More information

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite Unit #8 : The Integrl Gols: Determine how to clculte the re described by function. Define the definite integrl. Eplore the reltionship between the definite integrl nd re. Eplore wys to estimte the definite

More information

Solution for Assignment 1 : Intro to Probability and Statistics, PAC learning

Solution for Assignment 1 : Intro to Probability and Statistics, PAC learning Solution for Assignment 1 : Intro to Probbility nd Sttistics, PAC lerning 10-701/15-781: Mchine Lerning (Fll 004) Due: Sept. 30th 004, Thursdy, Strt of clss Question 1. Bsic Probbility ( 18 pts) 1.1 (

More information

Intuitionistic Fuzzy Lattices and Intuitionistic Fuzzy Boolean Algebras

Intuitionistic Fuzzy Lattices and Intuitionistic Fuzzy Boolean Algebras Intuitionistic Fuzzy Lttices nd Intuitionistic Fuzzy oolen Algebrs.K. Tripthy #1, M.K. Stpthy *2 nd P.K.Choudhury ##3 # School of Computing Science nd Engineering VIT University Vellore-632014, TN, Indi

More information

f(x) dx, If one of these two conditions is not met, we call the integral improper. Our usual definition for the value for the definite integral

f(x) dx, If one of these two conditions is not met, we call the integral improper. Our usual definition for the value for the definite integral Improper Integrls Every time tht we hve evluted definite integrl such s f(x) dx, we hve mde two implicit ssumptions bout the integrl:. The intervl [, b] is finite, nd. f(x) is continuous on [, b]. If one

More information

Riemann is the Mann! (But Lebesgue may besgue to differ.)

Riemann is the Mann! (But Lebesgue may besgue to differ.) Riemnn is the Mnn! (But Lebesgue my besgue to differ.) Leo Livshits My 2, 2008 1 For finite intervls in R We hve seen in clss tht every continuous function f : [, b] R hs the property tht for every ɛ >

More information

Chapter 14. Matrix Representations of Linear Transformations

Chapter 14. Matrix Representations of Linear Transformations Chpter 4 Mtrix Representtions of Liner Trnsformtions When considering the Het Stte Evolution, we found tht we could describe this process using multipliction by mtrix. This ws nice becuse computers cn

More information

SUMMER KNOWHOW STUDY AND LEARNING CENTRE

SUMMER KNOWHOW STUDY AND LEARNING CENTRE SUMMER KNOWHOW STUDY AND LEARNING CENTRE Indices & Logrithms 2 Contents Indices.2 Frctionl Indices.4 Logrithms 6 Exponentil equtions. Simplifying Surds 13 Opertions on Surds..16 Scientific Nottion..18

More information

Bernoulli Numbers Jeff Morton

Bernoulli Numbers Jeff Morton Bernoulli Numbers Jeff Morton. We re interested in the opertor e t k d k t k, which is to sy k tk. Applying this to some function f E to get e t f d k k tk d k f f + d k k tk dk f, we note tht since f

More information

Convergence of Fourier Series and Fejer s Theorem. Lee Ricketson

Convergence of Fourier Series and Fejer s Theorem. Lee Ricketson Convergence of Fourier Series nd Fejer s Theorem Lee Ricketson My, 006 Abstrct This pper will ddress the Fourier Series of functions with rbitrry period. We will derive forms of the Dirichlet nd Fejer

More information

Tests for the Ratio of Two Poisson Rates

Tests for the Ratio of Two Poisson Rates Chpter 437 Tests for the Rtio of Two Poisson Rtes Introduction The Poisson probbility lw gives the probbility distribution of the number of events occurring in specified intervl of time or spce. The Poisson

More information

7.2 The Definite Integral

7.2 The Definite Integral 7.2 The Definite Integrl the definite integrl In the previous section, it ws found tht if function f is continuous nd nonnegtive, then the re under the grph of f on [, b] is given by F (b) F (), where

More information

Natural examples of rings are the ring of integers, a ring of polynomials in one variable, the ring

Natural examples of rings are the ring of integers, a ring of polynomials in one variable, the ring More generlly, we define ring to be non-empty set R hving two binry opertions (we ll think of these s ddition nd multipliction) which is n Abelin group under + (we ll denote the dditive identity by 0),

More information

Czechoslovak Mathematical Journal, 55 (130) (2005), , Abbotsford. 1. Introduction

Czechoslovak Mathematical Journal, 55 (130) (2005), , Abbotsford. 1. Introduction Czechoslovk Mthemticl Journl, 55 (130) (2005), 933 940 ESTIMATES OF THE REMAINDER IN TAYLOR S THEOREM USING THE HENSTOCK-KURZWEIL INTEGRAL, Abbotsford (Received Jnury 22, 2003) Abstrct. When rel-vlued

More information

MAA 4212 Improper Integrals

MAA 4212 Improper Integrals Notes by Dvid Groisser, Copyright c 1995; revised 2002, 2009, 2014 MAA 4212 Improper Integrls The Riemnn integrl, while perfectly well-defined, is too restrictive for mny purposes; there re functions which

More information

1.9 C 2 inner variations

1.9 C 2 inner variations 46 CHAPTER 1. INDIRECT METHODS 1.9 C 2 inner vritions So fr, we hve restricted ttention to liner vritions. These re vritions of the form vx; ǫ = ux + ǫφx where φ is in some liner perturbtion clss P, for

More information

On the degree of regularity of generalized van der Waerden triples

On the degree of regularity of generalized van der Waerden triples On the degree of regulrity of generlized vn der Werden triples Jcob Fox Msschusetts Institute of Technology, Cmbridge, MA 02139, USA Rdoš Rdoičić Deprtment of Mthemtics, Rutgers, The Stte University of

More information

REPRESENTATION THEORY OF PSL 2 (q)

REPRESENTATION THEORY OF PSL 2 (q) REPRESENTATION THEORY OF PSL (q) YAQIAO LI Following re notes from book [1]. The im is to show the qusirndomness of PSL (q), i.e., the group hs no low dimensionl representtion. 1. Representtion Theory

More information

New data structures to reduce data size and search time

New data structures to reduce data size and search time New dt structures to reduce dt size nd serch time Tsuneo Kuwbr Deprtment of Informtion Sciences, Fculty of Science, Kngw University, Hirtsuk-shi, Jpn FIT2018 1D-1, No2, pp1-4 Copyright (c)2018 by The Institute

More information

Cf. Linn Sennott, Stochastic Dynamic Programming and the Control of Queueing Systems, Wiley Series in Probability & Statistics, 1999.

Cf. Linn Sennott, Stochastic Dynamic Programming and the Control of Queueing Systems, Wiley Series in Probability & Statistics, 1999. Cf. Linn Sennott, Stochstic Dynmic Progrmming nd the Control of Queueing Systems, Wiley Series in Probbility & Sttistics, 1999. D.L.Bricker, 2001 Dept of Industril Engineering The University of Iow MDP

More information

Math 61CM - Solutions to homework 9

Math 61CM - Solutions to homework 9 Mth 61CM - Solutions to homework 9 Cédric De Groote November 30 th, 2018 Problem 1: Recll tht the left limit of function f t point c is defined s follows: lim f(x) = l x c if for ny > 0 there exists δ

More information

Coalgebra, Lecture 15: Equations for Deterministic Automata

Coalgebra, Lecture 15: Equations for Deterministic Automata Colger, Lecture 15: Equtions for Deterministic Automt Julin Slmnc (nd Jurrin Rot) Decemer 19, 2016 In this lecture, we will study the concept of equtions for deterministic utomt. The notes re self contined

More information

arxiv:math/ v2 [math.ho] 16 Dec 2003

arxiv:math/ v2 [math.ho] 16 Dec 2003 rxiv:mth/0312293v2 [mth.ho] 16 Dec 2003 Clssicl Lebesgue Integrtion Theorems for the Riemnn Integrl Josh Isrlowitz 244 Ridge Rd. Rutherford, NJ 07070 jbi2@njit.edu Februry 1, 2008 Abstrct In this pper,

More information

Student Activity 3: Single Factor ANOVA

Student Activity 3: Single Factor ANOVA MATH 40 Student Activity 3: Single Fctor ANOVA Some Bsic Concepts In designed experiment, two or more tretments, or combintions of tretments, is pplied to experimentl units The number of tretments, whether

More information

The steps of the hypothesis test

The steps of the hypothesis test ttisticl Methods I (EXT 7005) Pge 78 Mosquito species Time of dy A B C Mid morning 0.0088 5.4900 5.5000 Mid Afternoon.3400 0.0300 0.8700 Dusk 0.600 5.400 3.000 The Chi squre test sttistic is the sum of

More information

Estimation of Binomial Distribution in the Light of Future Data

Estimation of Binomial Distribution in the Light of Future Data British Journl of Mthemtics & Computer Science 102: 1-7, 2015, Article no.bjmcs.19191 ISSN: 2231-0851 SCIENCEDOMAIN interntionl www.sciencedomin.org Estimtion of Binomil Distribution in the Light of Future

More information

Lecture Note 9: Orthogonal Reduction

Lecture Note 9: Orthogonal Reduction MATH : Computtionl Methods of Liner Algebr 1 The Row Echelon Form Lecture Note 9: Orthogonl Reduction Our trget is to solve the norml eution: Xinyi Zeng Deprtment of Mthemticl Sciences, UTEP A t Ax = A

More information