A Generalized Reinforcement-Learning Model: Convergence and. Applications

Size: px
Start display at page:

Download "A Generalized Reinforcement-Learning Model: Convergence and. Applications"

Transcription

1 URL ftp://iserv.iki.kfki.hu/pub/ppers/icml96/szepes.greinf.ps.z WWW A Generlized Reinforcement-Lerning Model: Convergence nd Applictions Michel L. Littmn Deprtment of Computer Science Brown Universit Providence, RI , USA mlittmn@cs.brown.edu Abstrct Reinforcement lerning is the process b which n utonomous gent uses its eperience intercting with n environment to improve its behvior. The Mrkov decision process (mdp) model is populr w of formlizing the reinforcement-lerning problem, but it is b no mens the onl w. In this pper, we show how mn of the importnt theoreticl results concerning reinforcement lerning in mdps etend to generlized mdp model tht includes mdps, two-pler gmes nd mdps under worst-cse optimlit criterion s specil cses. The bsis of this etension is stochstic-pproimtion theorem tht reduces snchronous convergence to snchronous convergence. 1 INTRODUCTION Reinforcement lerning is the process b which n gent improves its behvior in n environment vi eperience. A reinforcement-lerning scenrio is dened b the eperience presented to the gent t ech step, nd the criterion for evluting the gent's behvior. One prticulrl well-studied reinforcement-lerning scenrio is tht of single gent mimizing epected discounted totl rewrd in nite-stte environment; in this scenrio eperiences re of the form h; ; ; ri, with stte, ction, resulting stte, nd the gent's sclr immedite rewrd r. A discount prmeter Also Deprtment of Adptive Sstems, Joint Deprtment of the "Jozsef Attil" Universit, Szeged nd the Institute of Isotopes of the Hungrin Acdem of Sciences, Budpest 1525, P.O. Bo. 77. HUNGARY Csb Szepesvri Reserch Group of Articil Intelligence \Jozsef Attil" Universit, Szeged Szeged 6720, Ardi vrt tere 1. HUNGARY szepes@mth.u-szeged.hu 0 < 1 controls the degree to which future rewrds re signicnt compred to immedite rewrds. The theor of Mrkov decision processes hs been used s theoreticl foundtion for importnt results concerning this reinforcement-lerning scenrio. A (nite) Mrkov decision process (mdp) is dened b the tuple hs; A; P; Ri, where S is nite set of sttes, A - nite set of ctions, P trnsition function, nd R rewrd function. The optiml behvior for n gent in n mdp depends on the optimlit criterion; for the innite-horizon epected discounted totl-rewrd criterion, the optiml behvior cn be found b identifing the optiml vlue function, dened recursivel b V () = m R(; ) + P (; ; )V () for ll sttes 2 S, where R(; ) is the immedite rewrd for tking ction from stte, the discount fctor, nd P (; ; ) the probbilit tht stte is reched from stte when ction 2 A is chosen. These simultneous equtions, known s the Bellmn equtions, cn be solved using vriet of techniques rnging from successive pproimtion to liner progrmming (Putermn, 1994). In the bsence of complete informtion regrding the trnsition nd rewrd functions, reinforcementlerning methods cn be used to nd optiml vlue functions. Reserchers hve eplored model-free (direct) methods, such s Q-lerning (Wtkins nd Dn, 1992), nd model-bsed (indirect) methods, such s prioritized sweeping (Moore nd Atkeson, 1993), nd mn converge to optiml vlue functions under the proper conditions (Tsitsiklis, 1994; Jkkol et l., 1994; Gullplli nd Brto, 1994). Not ll reinforcement-lerning scenrios of interest cn ;

2 be modeled s mdps. For emple, gret del of reinforcement-lerning reserch hs been directed to the problem of solving two-pler gmes (e.g, Tesuro, 1995), nd the reinforcement-lerning lgorithms for solving mdps nd their convergence proofs do not ppl directl to gmes. In one form of two-pler gme, eperiences re of the form h; ; ; ri, where sttes nd contin dditionl informtion concerning which pler (mimizer or minimizer) gets to choose the ction in tht stte. There re deep similrities between mdps nd this tpe of gme; for emple, it is possible to dene set of Bellmn equtions for the optiml minim vlue of two-pler zero-sum gme, V () = 8 >< >: m 2A (R(; ) + P P (; ; )V ()); if mimizer moves in min 2A (R(; ) + P P (; ; )V ()); if minimizer moves in, where R(; ) is the rewrd to the mimizing pler. When 0 < 1, these equtions hve unique solution nd cn be solved b successive-pproimtion methods. In ddition, we show tht simple etensions of severl reinforcement-lerning lgorithms for mdps converge to optiml vlue functions in these gmes. In this pper, we introduce generlized Mrkov decision process model with pplictions to reinforcement lerning, nd list some importnt results concerning the model. Generlized mdps provide foundtion for the use of reinforcement lerning in mdps nd gmes, s well s in risk-sensitive reinforcement lerning (Heger, 1994), eplortion-sensitive reinforcement lerning (John, 1995), nd reinforcement lerning in simultneous-ction gmes (Littmn, 1994). Our min theorem ddresses conditions for the convergence of snchronous stochstic processes nd shows how these conditions relte to conditions for convergence of corresponding snchronous process; it cn be used to prove the convergence of model-free nd model-bsed reinforcement-lerning lgorithms under vriet of reinforcement-lerning scenrios. In Section 2, we present generlized mdps nd motivte their form vi two detiled emples. In Section 3, we describe stochstic-pproimtion theorem, nd in Section 4 we show severl pplictions of the theorem tht prove the convergence of lerning processes in generlized mdps. 2 THE GENERALIZED MODEL In this section, we introduce our generlized mdp model. We begin b summrizing some of the more signicnt results regrding the stndrd mdp model nd some importnt results for two-pler gmes. 2.1 MARKOV DECISION PROCESSES To provide point of deprture for our generliztion of Mrkov decision processes, we rst describe the use of reinforcement lerning in the mdps; proofs of the unttributed clims cn be found in Putermn's (1994) mdp book. The ultimte trget of lerning is n optiml polic. A polic is some function tht tells the gent which ctions should be chosen under which circumstnces. A polic is optiml under the epected discounted totl rewrd criterion if, with respect to the spce of ll possible policies, mimizes the epected discounted totl rewrd from ll sttes. Directl mimizing over the spce of ll possible policies is imprcticl. However, mdps hve n importnt propert tht mkes it unnecessr to consider such brod spce of possibilities. We s polic is sttionr nd deterministic if it mps directl from sttes to ctions, ignoring everthing else, nd we write () s the ction chosen b when the current stte is. In epected discounted totl rewrd mdp environments, there is lws sttionr deterministic polic tht is optiml; we will use the word \polic" to men sttionr deterministic polic, unless otherwise stted. The vlue function for polic, V, mps sttes to their epected discounted totl rewrd under polic. It cn be dened b the simultneous equtions V () = R(; ) + P (; ; )V (); for ll 2 S. The optiml vlue function V is the vlue function of n optiml polic; it is unique for 0 < 1. The mopic polic with respect to vlue function V is the polic V such tht V () = rg m R(; ) + P (; ; )V () An mopic polic with respect to the optiml vlue function is optiml. The Bellmn equtions cn be opertionlized in the form of the dnmic-progrmming opertor T, which :

3 mps vlue functions to vlue functions: [T V ]() = m R(; ) + P (; ; )V () For 0 < 1, successive pplictions of T to vlue function bring it closer nd closer to the optiml vlue function V, which is the unique ed point of T : V = T V. In reinforcement lerning, R nd P re not known in dvnce. In model-bsed reinforcement lerning, R nd P re estimted on-line, nd the vlue function is updted ccording to the pproimte dnmicprogrmming opertor derived from these estimtes; this lgorithm converges to the optiml vlue function under wide vriet of choices of the order sttes re updted (Gullplli nd Brto, 1994). The method of Q-lerning (Wtkins nd Dn, 1992) uses eperience to estimte the optiml vlue function without ever eplicitl pproimting R nd P. The lgorithm estimtes the optiml Q function Q (; ) = R(; ) + P (; ; )V (); from which the optiml vlue function cn be computed vi V () = m Q (; ). Given the eperience t step t h t ; t ; t ; r t i nd the current estimte Q t (; ) of the optiml Q function, Q-lerning updtes Q t+1 ( t ; t ) := (1? t ( t ; t ))Q t ( t ; t ) + t ( t ; t )(r t + m Q t ( t ; )); where 0 t (; ) 1 is time-dependent lerning rte controlling the blending rte of new estimtes with old estimtes for ech stte-ction pir. The estimted Q function converges to Q under the proper conditions (Wtkins nd Dn, 1992). 2.2 ALTERNATING MARKOV GAMES In lternting Mrkov gmes, two plers tke turns issuing ctions to mimize their epected discounted totl rewrd. The model is dened b the tuple hs 1 ; S 2 ; A; B; P; Ri, where S 1 is the set of sttes in which pler 1 issues ctions from the set A, S 2 is the set of sttes in which pler 2 issues ctions from the set B, P is the trnsition function, nd R is the rewrd function for pler 1. In the zero-sum gmes we consider, the rewrds to pler 2 (the minimizer) re simpl the dditive inverse of the rewrds to pler 1 (the mimizer). Mrkov decision processes re specil cse of lternting Mrkov gmes in which S 2 = ;; : Condon (1992) proves this nd the other unttributed results in this section. A populr optimlit criterion for lternting Mrkov gmes is discounted minim optimlit. Under this criterion, the mimizer chooses ctions to mimize its rewrd ginst the minimizer's best possible counter-polic. A pir of policies is in equilibrium if neither pler hs n incentive to chnge policies if the other pler's polic remins ed. The vlue function for pir of equilibrium policies is the optiml vlue function for the gme; it is unique when 0 < 1, nd cn be found b successive pproimtion. For both plers, there is lws deterministic sttionr optiml polic. An mopic polic with respect to the optiml vlue function is optiml. Dnmic-progrmming opertors, Bellmn equtions, nd reinforcement-lerning lgorithms cn be dened for lternting Mrkov gmes b strting with the definitions used in mdps nd chnging the mimum opertors to either mimums or minimums conditioned on the stte. We show below tht the resulting lgorithms shre their convergence properties with the nlogous lgorithms for mdps. 2.3 GENERALIZED MDPS In lternting Mrkov gmes nd mdps, optiml behvior cn be specied b the Bellmn equtions; n mopic polic with respect to the optiml vlue function is optiml. In this section, we generlize the Bellmn equtions to dene optiml behvior for brod clss of reinforcement-lerning models. The objective criterion used in these models is dditive in tht the the vlue of polic is some mesure of the totl rewrd received. The generlized Bellmn equtions cn be written V () = O R(; ) + M ; V () : (1) Here \ N " is n opertor tht summrizes vlues over ctions s function of the stte, nd \ L ; " is n opertor tht summrizes vlues over net sttes s function of then stte nd ction. For Mrkov decision processes, f(; ) = m f(; L ) nd ; g() = L P P (; ; )g(). For N lternting Mrkov gmes, ; is the sme nd f(; ) = m f(; ) or min f(; ) depending whether is in S 1 or S 2. Mn models cn be represented in this frmework; see Section 4. From reinforcement-lerning perspective, the vlue

4 functions dened b the generlized mdp model cn be interpreted s the totl vlue of the rewrds received b n gent selecting ctions in stochstic environment. The gent begins in stte, tkes ction, nd ends up in stte. The L ; opertor denes how the vlue of the net stte should be used in ssigning vlue to the current stte. The N opertor denes how n optiml gent should choose ctions. When 0 < 1 nd N nd L ; re nonepnsions, the generlized Bellmn equtions hve unique optiml solution, nd therefore, the optiml vlue function is well dened. The N opertor is non-epnsion if O f1 (; )? O f2 (; ) m jf 1 (; )?f 2 (; )j for ll f 1, f 2, nd. An nlogous condition denes when L ; is non-epnsion. Mn nturl opertors re non-epnsions, such s m, min, midpoint, medin, men, nd ed weighted verges of these opertions. Severl previousl described reinforcement-lerning scenrios re specil cses of this generlized mdp model including computing the epected return of ed polic (Sutton, 1988), nding the optiml risk-verse polic (Heger, 1994), nd nding the optiml eplortionsensitive polic (John, 1995). As with mdps, we cn dene dnmic-progrmming opertor O M ; [T V ]() = R(; ) + V () : (2) The opertor T is contrction mpping for 0 < 1. This mens sup j[t V 1 ]()? [T V 2 ]()j sup jv 1 ()? V 2 ()j where V 1 nd V 2 re rbitrr functions nd 0 < 1 is the inde of contrction. We cn dene notion of sttionr mopic policies with respect to vlue function V ; it is n (stochstic) polic V for which T V = T V where [T V ]() = () R(; ) + M ; V () Here () represents the probbilit tht n gent following would choose ction in stte. To be certin tht ever vlue function possesses mopic : polic, we require tht the opertor N stisf the following propert: N for ll functions f nd sttes, min f(; ) f(; ) m f(; ). The vlue function with respect to polic, V, cn be dened b the simultneous equtions V = T V ; it is unique when T is contrction mpping. A polic is optiml if it is mopic with respect to its own vlue function. If is n optiml polic, then V = V becuse it solves the Bellmn eqution: V = T V = T V. The net section describes generl theorem tht cn be used to prove the convergence of severl reinforcement-lerning lgorithms for these nd other models. 3 CONVERGENCE THEOREM The process of nding n optiml vlue function cn be viewed in the following generl w. At n moment in time, there is set of vlues representing the current pproimtion of the optiml vlue function. On ech itertion, we ppl some dnmic-progrmming opertor, perhps modied b eperience, to the current pproimtion to generte new pproimtion. Over time, we would like the pproimtion to tend towrd the optiml vlue function. In this process, there re two tpes of pproimtion going on simultneousl. The rst is n pproimtion of the dnmic-progrmming opertor for the underling model, nd the second is the use of the pproimte dnmic-progrmming opertor to nd the optiml vlue function. This section presents theorem tht gives set of conditions under which this tpe of simultneous stochstic pproimtion converges to n optiml vlue function. First, we need to dene the generl stochstic process. Let the set be the sttes of the model, nd the set B() of bounded, rel-vlued functions over be the set of vlue functions. Let T : B() B() be n rbitrr contrction mpping with ed point V. If we hd direct ccess to the contrction mpping T, we could use it to successivel pproimte V. In most reinforcement-lerning scenrios, T is not vilble nd we must use eperience to construct pproimtions of T. Consider sequence of rndom opertors T t : B() (B() B()) nd dene U t+1 = [T t U t ]V where V nd U 0 2 B() re rbitrr vlue functions. We s T t pproimtes T t V, if U t converges to T V with probbilit 1 uniforml

5 over 1. The ide is tht T t is rndomized version of T tht uses U t s \memor" to converge to T V. The following theorem shows tht, under the proper conditions, we cn use the sequence T t to estimte the ed point V of T. Theorem 1 Let T be n rbitrr mpping with ed point V, nd let T t pproimte T t V. Let V 0 be n rbitrr vlue function, nd dene V t+1 = [T t V t ]V t. If there eist functions 0 F t () 1 nd 0 G t () 1 stisfing the conditions below with probbilit one, then V t converges to V with probbilit 1 uniforml over : 1. for ll U 1 nd U 2 2 B(), nd ll 2, j([t t U 1 ]V )()? ([T t U 2 ]V )()j G t ()ju 1 ()? U 2 ()j; 2. for ll U nd V 2 B(), nd ll 2, j([t t U]V )()? ([T t U]V )()j F t () sup jv ( 0 )? V ( 0 )j; 0 3. for ll k > 0, n t=k G t() converges to zero uniforml in s n increses; nd, 4. there eists 0 < 1 such tht for ll 2 nd lrge enough t, F t () (1? G t ()): Note tht from the conditions of the theorem, it follows tht T is contrction opertor t V with inde of contrction. The theorem is proven in more detiled version of this pper (Szepesvri nd Littmn, 1996). We net describe some of the intuition behind the sttement of the theorem nd its conditions. The itertive pproimtion of V is performed b computing V t+1 = [T t V t ]V t, where T t pproimtes T with the help of the \memor" present in V t. Becuse of Conditions 1 nd 2, G t () is the etent to which the estimted vlue function depends on its present vlue nd F t () 1? G t () is the etent to which the estimted vlue function is bsed on \new" informtion (this resoning becomes clerer in the contet of the pplictions in Section 4). 1 A sequence of functions f n converges to f with probbilit 1 uniforml over if, for the events w for which fn(w; ) f, the convergence is uniform in. In some pplictions, such s Q-lerning, the contribution of new informtion needs to dec over time to insure tht the process converges. In this cse, G t () needs to converge to one; Condition 3 llows this s long s the convergence is slow enough to incorporte sucient informtion for the process to converge. Condition 4 links the vlues of G t () nd F t () through some quntit < 1. If it were somehow possible to updte the vlues snchronousl over the entire stte spce, the process would converge to V even when = 1. In the more interesting snchronous cse, when = 1 the long-term behvior of V t is not immeditel cler; it m even be tht V t converges to something other thn V. The requirement tht < 1 insures tht the use of outdted informtion in the snchronous updtes does not cuse problem in convergence. One of the most noteworth spects of this theorem is tht it shows how to reduce the problem of pproimting V to the problem of pproimting T t prticulr point V (in prticulr, it is enough if T cn be pproimted t V ); in mn cses, the ltter is much esier to chieve nd lso to prove. For emple, the theorem mkes the convergence of Q-lerning consequence of the clssicl Robbins-Monro theorem (Robbins nd Monro, 1951). 4 APPLICATIONS This section mkes use of Theorem 1 to prove the convergence of vrious reinforcement-lerning lgorithms. 4.1 GENERALIZED Q-LEARNING FOR EPECTED VALUE MODELS Consider the fmil of nite stte nd ction generlized mdps dened b the Bellmn equtions V () = O R(; ) + P (; ; )V () where the denition of N does not depend on R or P. A Q-lerning lgorithm for this clss of models cn be dened s follows. Given eperience h t ; t ; t ; r t i t time t nd n estimte Q t (; ) of the optiml Q function, let Q t+1 ( t ; t ) := (1? t ( t ; t ))Q t ( t ; t ) + t ( t ; t ) r t + O Qt ( t ; ) :

6 We cn derive the ssumptions necessr for this lerning lgorithm to stisf the conditions of Theorem 1 nd therefore converge to the optiml Q vlues. The dnmic-progrmming opertor dening the optiml Q function is [T Q](; ) = R(; ) + P (; ; ) O 0 Q(; 0 ): The rndomized pproimte dnmic-progrmming opertor tht gives rise to the Q-lerning rule is ([T t Q 0 ]Q)(; ) = If 8 >< >: (1? t (; ))Q 0 (; )+ t (; )(r t + N 0 Q(t ; 0 )); if = t nd = t Q 0 (; ); otherwise. t is rndoml selected ccording to the probbilit distribution dened b P ( t ; t ; ), is non-epnsion, N nd both the epected vlue nd the vrince of Q(t ; ) eist given the w t is smpled, N r t hs nite vrince nd epected vlue given t nd t equl to R( t ; t ), the lerning rtes re deced so tht t ( t = ; t = ) t (; ) = 1 nd P t ( t = ; t = ) t (; ) 2 < 1 with probbilit 1 uniforml over A 2, then stndrd result from the theor of stochstic pproimtion (Robbins nd Monro, 1951) sttes tht T t pproimtes T everwhere. Tht is, this method of using deced, eponentill weighted verge correctl computes the verge one-step rewrd. Let G t (; ) = 1? t (; ); if = t nd = t ; 1; otherwise, nd F t (; ) = t (; ); if = t nd = t ; 0; otherwise. These functions stisf the conditions of Theorem 1 (Condition 3 is implied b the restrictions plced on the sequence of lerning rtes t ). 2 This condition implies, mong other things, tht ever stte-ction pir is updted innitel often. Here, denotes the chrcteristic function. Theorem 1 therefore implies tht this generlized Q- lerning lgorithm converges to the optiml Q function with probbilit 1 uniforml over A. The convergence of Q-lerning for discounted mdps nd lternting Mrkov gmes follows trivill from this. Etensions of this result for undiscounted \ll-policiesproper" mdps (Bertseks nd Tsitsiklis, 1989), soft stte ggregtion lerning rule (Singh et l., 1995), nd \spreding" lerning rule re given in more detiled version of this pper (Szepesvri nd Littmn, 1996). 4.2 Q-LEARNING FOR MARKOV GAMES Mrkov gmes re generliztion of mdps nd lternting Mrkov gmes in which both plers simultneousl choose ctions t ech step. The bsic model is dened b the tuple hs; A; B; P; Ri nd discount fctor. As in lternting Mrkov gmes, the optimlit criterion is one of discounted minim optimlit, but becuse the plers move simultneousl, the Bellmn equtions tke on more comple form: V () = m min 2(A) b2b 2A R(; ; b) + 2S () 1 P (; ; b; )V () A : In these equtions, R(; ; b) is the immedite rewrd for the mimizer for tking ction in stte t the sme time the minimizer tkes ction b, P (; ; b; ) is the probbilit tht stte is reched from stte when the mimizer tkes ction nd the minimizer tkes ction b, nd (A) represents the set of discrete probbilit distributions over the set A. The sets S, A, nd B re nite. Once gin, optiml policies re policies tht re in equilibrium, nd there is lws pir of optiml policies tht re sttionr. Unlike mdps nd lternting Mrkov gmes, the optiml policies re sometimes stochstic; there re Mrkov gmes in which no deterministic polic is optiml. The stochstic nture of optiml policies eplins the need for the optimiztion over probbilit distributions in the Bellmn equtions, nd stems from the fct tht plers must void being \second guessed" during ction selection. An equivlent set of equtions cn be written with stochstic choice for the minimizer, nd lso with the roles of the mimizer nd minimizer reversed. The Q-lerning updte rule for Mrkov gmes (Littmn, 1994) given step t eperience h t ; t ;

7 b t ; t ; r t i hs the form 0 Q t+1 ( t ; t ; b t ) := (1? t ( t ; t ; b t ))Q t ( t ; t ; b t ) ;b + t ( t ; t ; b t )@ rt + where O f(; ; b) = m min O ;b 2(A) b2b 2A 1 Qt ( t ; ; b) A ; ()f(; ; b): The results of the previous section prove tht this rule converges to the optiml Q function under the proper conditions. 4.3 RISK-SENSITIVE MODELS Heger (1994) described n optimlit criterion for mdps in which onl the worst possible vlue of the net stte mkes contribution to the vlue of stte. An optiml polic under this criterion is one tht voids sttes for which bd outcome is possible, even if it is not probble; for this reson, the criterion hs riskverse qulit to it. The generlized Bellmn equtions for this criterion re V () = O R(; ) + min :P (;;)>0 V () : The rgument in Section 4.5 shows tht model-bsed reinforcement lerning cn be used to nd optiml policies in risk-sensitive models, s long s N does not depend on R or P, nd P is estimted in w tht preserves its zero vs. non-zero nture in the limit. For the model in which N f(; ) = m f(; ), Heger dened Q-lerning-like lgorithm tht converges to optiml policies without estimting R nd P online. In essence, the lerning lgorithm uses n updte rule nlogous to the rule in Q-lerning with the dditionl requirement tht the initil Q function be set optimisticll; tht is, Q 0 (; ) must be lrger thn Q (; ) for ll nd. Using Theorem 1 it is possible to prove the convergence of N generliztion of Heger's lgorithm to models where f(; ) = f(; (f; )) for some function (); tht is, s long s the summr vlue of f(; ) is equl to f(; ) for some. The proof is bsed on estimting the Q-lerning lgorithm from bove b n pproprite process where the Q function is updted onl if the received eperience tuple is n etremit ccording to the optimlit eqution; detils re given elsewhere (Szepesvri nd Littmn, 1996). 4.4 EPLORATION-SENSITIVE MODELS John (1995) considered the implictions of insisting tht reinforcement-lerning gents keep eploring forever; he found tht better lerning performnce cn be chieved if the Q-lerning rule is chnged to incorporte the condition of persistent eplortion. In John's formultion, the gent is forced to dopt polic from restricted set; in one emple, the gent must choose stochstic sttionr polic tht selects ctions t rndom 5% of the time. This pproch requires tht the denition of optimlit be chnged to reect the restriction on policies. The optiml vlue function is given b V () = sup 2P0 V (), where P 0 is the set of permitted (sttionr) policies, nd the ssocited Bellmn equtions re V () = sup L 2P 0 ()(R(; )+ ; g() = P P (; ; )g() nd N P (; ; )V ()); which corresponds to generlized mdp model with f(; ) = sup 2P0 P ()f(; ). Becuse () is probbilit distribution over for n given stte, N is non-epnsion nd, thus, the convergence of the ssocited Q-lerning lgorithm follows from the rguments in Section 4.1. As result, John's lerning rule gives the optiml polic under the revised optimlit criterion. 4.5 MODEL-BASED METHODS The dening ssumption in reinforcement lerning is tht the rewrd nd trnsition functions, R nd P, re not known in dvnce. Although Q-lerning shows tht optiml vlue functions cn be estimted without ever eplicitl lerning R nd P, lerning R nd P mkes more ecient use of eperience t the epense of dditionl storge nd computtion (Moore nd Atkeson, 1993). The prmeters of R nd P cn be glened from eperience b keeping sttistics for ech sttection pir on the epected rewrd nd the proportion of trnsitions to ech net stte. In model-bsed reinforcement lerning, R nd P re estimted on-line, nd the vlue function is updted ccording to the pproimte dnmic-progrmming opertor derived from these estimtes. Theorem 1 implies the convergence of wide vriet of model-bsed reinforcementlerning methods. The dnmic-progrmming opertor dening the optiml vlue for generlized mdps is given in Eqution 2. Here we ssume tht L ; m depend on P nd/or

8 R, but N m not. It is possible to etend the following rgument to llow N to depend on P nd R s well. In model-bsed reinforcement lerning, R nd P re estimted b the quntities R t nd P t, nd L ;;t is n estimte of the L ; opertor dened using R t nd P t. As long s ever stte-ction pir is visited in- nitel often, there re number of simple methods for computing R t nd P t tht converge to R nd P. A bit more cre is needed to insure tht L ;;t converges to L ;, however. For emple, in epected-rewrd mod- L ; els, g() = P P (; ; )g() nd the convergence of P t to P gurntees the convergence L ;;t of L L to ;. On the other hnd, in risk-sensitive model, ; g() = min:p (;;)>0 g() nd it is necessr to pproimte P in w tht insures tht the set of such tht P t (; ; ) > 0 converges to the set of such tht P (; ; ) > 0. This cn be ccomplished esil, for emple, b setting P t (; ; ) = 0 if no trnsition from to under hs been observed. Assuming P nd R re L estimted L in w tht results in the convergence of ;;t to ;, the sequence of dnmic-progrmming opertors T t dened b ([T t U]V )() = 8 >< >: N R t (; ) + L ;;t if 2 t U(); otherwise, V () pproimtes T for ll vlue functions. The set t S represents the set of sttes whose vlues re updted on step t; one populr choice is to set t = f t g. The functions nd G t () = 0; if 2 t ; 1; otherwise, F t () = ; if 2 t ; 0; otherwise, stisf the conditions of Theorem 1 s long s ech is in innitel mn t sets (Condition 3) nd the discount fctor is less thn 1 (Condition 4). As consequence of this rgument nd Theorem 1, model-bsed methods cn be used to nd optiml policies in mdps, lternting Mrkov gmes, Mrkov gmes, risk-sensitive mdps, nd eplortion-sensitive mdps. Also, letting R t = R nd P t = P for ll t, this result implies tht rel-time dnmic progrmming (Brto et l., 1995) converges to the optiml vlue function. ; 5 CONCLUSIONS In this pper, we presented generlized model of Mrkov decision processes, nd proved the convergence of severl reinforcement-lerning lgorithms in the generlized model. Other Results We hve derived collection of results (Szepesvri nd Littmn, 1996) for the generlized mdp model tht demonstrte its generl pplicbilit: the Bellmn equtions cn be solved b vlue itertion; mopic polic with respect to n pproimtel optiml vlue function gives n pproimtel optiml polic; when N hs prticulr \mimiztion" propert, polic itertion converges to the optiml vlue function, nd, for models with the mimiztion propert nd nite stte nd ction spces, both vlue itertion nd polic itertion identif optiml policies in pseudopolnomil time. Relted Work The work presented here is closel relted to severl previous reserch eorts. Szepesvri (1995) described relted generlized reinforcement-lerning model nd presented conditions under which there is n optiml (sttionr) polic tht is mopic with respect to the optiml vlue function. Tsitsiklis (1994) developed the connection between stochstic-pproimtion theor nd reinforcement lerning in mdps. Our work is similr in spirit to tht of Jkkol, Jordn, nd Singh (1994). We believe the form of Theorem 1 mkes it prticulrl convenient for proving the convergence of reinforcementlerning lgorithms; our theorem reduces the proof of the convergence of n snchronous process to simpler proof of convergence of corresponding snchronized one. This ide enbles us to prove the convergence of snchronous stochstic processes whose underling snchronous process is not of the Robbins- Monro tpe (e.g., risk-sensitive mdps, model-bsed lgorithms, etc.). Future Work There re mn res of interest in the theor of reinforcement lerning tht we would like to ddress in future work. The results in this pper primril concern reinforcement-lerning in contrctive models ( < 1 or ll-policies-proper), nd there re importnt non-contrctive reinforcement-lerning scenrios, for emple, reinforcement lerning under n verge-rewrd criterion (Mhdevn, 1996). It would be interesting to develop TD() lgorithm (Sutton, 1988) for generlized mdps. Theorem 1 is not re-

9 stricted to nite stte spces, nd it might be vluble to prove the convergence of reinforcement-lerning lgorithm for innite stte-spce model. Conclusion B identifing common elements mong severl reinforcement-lerning scenrios, we creted new clss of models tht generlizes eisting models in n interesting w. In the generlized frmework, we replicted the estblished convergence proofs for reinforcement lerning in Mrkov decision processes, nd proved new results concerning the convergence of reinforcement-lerning lgorithms in gme environments, under risk-sensitive ssumption, nd under n eplortion-sensitive ssumption. At the hert of our results is new stochstic-pproimtion theorem tht is es to ppl to new situtions. Acknowledgements Reserch supported b PHARE H /1022 nd OTKA Grnt no. F nd b Bellcore's Support for Doctorl Eduction Progrm. References Brto, A. G., Brdtke, S. J., nd Singh, S. P. (1995). Lerning to ct using rel-time dnmic progrmming. Articil Intelligence, 72(1):81{138. Bertseks, D. P. nd Tsitsiklis, J. N. (1989). Prllel nd Distributed Computtion: Numericl Methods. Prentice-Hll, Englewood Clis, NJ. Condon, A. (1992). The compleit of stochstic gmes. Informtion nd Computtion, 96(2):203{ 224. Gullplli, V. nd Brto, A. G. (1994). Convergence of indirect dptive snchronous vlue itertion lgorithms. In Cown, J. D., Tesuro, G., nd Alspector, J., editors, Advnces in Neurl Informtion Processing Sstems 6, pges 695{702, Sn Mteo, CA. Morgn Kufmnn. Heger, M. (1994). Considertion of risk in reinforcement lerning. In Proceedings of the Eleventh Interntionl Conference on Mchine Lerning, pges 105{111, Sn Frncisco, CA. Morgn Kufmnn. Jkkol, T., Jordn, M. I., nd Singh, S. P. (1994). On the convergence of stochstic itertive dnmic progrmming lgorithms. Neurl Computtion, 6(6). John, G. H. (1995). When the best move isn't optiml: Q-lerning with eplortion. Unpublished mnuscript. Littmn, M. L. (1994). Mrkov gmes s frmework for multi-gent reinforcement lerning. In Proceedings of the Eleventh Interntionl Conference on Mchine Lerning, pges 157{163, Sn Frncisco, CA. Morgn Kufmnn. Mhdevn, S. (1996). Averge rewrd reinforcement lerning: Foundtions, lgorithms, nd empiricl results. Mchine Lerning, 22(1/2/3):159{196. Moore, A. W. nd Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement lerning with less dt nd less rel time. Mchine Lerning, 13. Putermn, M. L. (1994). Mrkov Decision Processes Discrete Stochstic Dnmic Progrmming. John Wile & Sons, Inc., New York, NY. Robbins, H. nd Monro, S. (1951). A stochstic pproimtion method. Annls of Mthemticl Sttistics, 22:400{407. Singh, S., Jkkol, T., nd Jordn, M. (1995). Reinforcement lerning with soft stte ggregtion. In Tesuro, G., Touretzk, D. S., nd Leen, T. K., editors, Advnces in Neurl Informtion Processing Sstems 7, Cmbridge, MA. The MIT Press. Sutton, R. S. (1988). Lerning to predict b the method of temporl dierences. Mchine Lerning, 3(1):9{44. Szepesvri, C. (1995). Generl frmework for reinforcement lerning. In Proceedings of ICANN'95 Pris. Szepesvri, C. nd Littmn, M. L. (1996). Generlized Mrkov decision processes: Dnmicprogrmming nd reinforcement-lerning lgorithms. Technicl Report CS-96-11, Brown Universit, Providence, RI. Tesuro, G. (1995). Temporl dierence lerning nd TD-Gmmon. Communictions of the ACM, pges 58{67. Tsitsiklis, J. N. (1994). Asnchronous stochstic pproimtion nd Q-lerning. Mchine Lerning, 16(3). Wtkins, C. J. C. H. nd Dn, P. (1992). Q-lerning. Mchine Lerning, 8(3):279{292.

Reinforcement learning II

Reinforcement learning II CS 1675 Introduction to Mchine Lerning Lecture 26 Reinforcement lerning II Milos Huskrecht milos@cs.pitt.edu 5329 Sennott Squre Reinforcement lerning Bsics: Input x Lerner Output Reinforcement r Critic

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Lerning Tom Mitchell, Mchine Lerning, chpter 13 Outline Introduction Comprison with inductive lerning Mrkov Decision Processes: the model Optiml policy: The tsk Q Lerning: Q function Algorithm

More information

2D1431 Machine Learning Lab 3: Reinforcement Learning

2D1431 Machine Learning Lab 3: Reinforcement Learning 2D1431 Mchine Lerning Lb 3: Reinforcement Lerning Frnk Hoffmnn modified by Örjn Ekeberg December 7, 2004 1 Introduction In this lb you will lern bout dynmic progrmming nd reinforcement lerning. It is ssumed

More information

Administrivia CSE 190: Reinforcement Learning: An Introduction

Administrivia CSE 190: Reinforcement Learning: An Introduction Administrivi CSE 190: Reinforcement Lerning: An Introduction Any emil sent to me bout the course should hve CSE 190 in the subject line! Chpter 4: Dynmic Progrmming Acknowledgment: A good number of these

More information

Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms

Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms Mchine Lerning, 39, 287 308, 2000. c 2000 Kluwer Acdemic Publishers. Printed in The Netherlnds. Convergence Results for Single-Step On-Policy Reinforcement-Lerning Algorithms SATINDER SINGH AT&T Lbs-Reserch,

More information

Bellman Optimality Equation for V*

Bellman Optimality Equation for V* Bellmn Optimlity Eqution for V* The vlue of stte under n optiml policy must equl the expected return for the best ction from tht stte: V (s) mx Q (s,) A(s) mx A(s) mx A(s) Er t 1 V (s t 1 ) s t s, t s

More information

are fractions which may or may not be reduced to lowest terms, the mediant of ( a

are fractions which may or may not be reduced to lowest terms, the mediant of ( a GENERATING STERN BROCOT TYPE RATIONAL NUMBERS WITH MEDIANTS HAROLD REITER AND ARTHUR HOLSHOUSER Abstrct. The Stern Brocot tree is method of generting or orgnizing ll frctions in the intervl (0, 1 b strting

More information

{ } = E! & $ " k r t +k +1

{ } = E! & $  k r t +k +1 Chpter 4: Dynmic Progrmming Objectives of this chpter: Overview of collection of clssicl solution methods for MDPs known s dynmic progrmming (DP) Show how DP cn be used to compute vlue functions, nd hence,

More information

Chapter 4: Dynamic Programming

Chapter 4: Dynamic Programming Chpter 4: Dynmic Progrmming Objectives of this chpter: Overview of collection of clssicl solution methods for MDPs known s dynmic progrmming (DP) Show how DP cn be used to compute vlue functions, nd hence,

More information

Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Module 6 Vlue Itertion CS 886 Sequentil Decision Mking nd Reinforcement Lerning University of Wterloo Mrkov Decision Process Definition Set of sttes: S Set of ctions (i.e., decisions): A Trnsition model:

More information

1 Online Learning and Regret Minimization

1 Online Learning and Regret Minimization 2.997 Decision-Mking in Lrge-Scle Systems My 10 MIT, Spring 2004 Hndout #29 Lecture Note 24 1 Online Lerning nd Regret Minimiztion In this lecture, we consider the problem of sequentil decision mking in

More information

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling CSE 547/Stt 548: Mchine Lerning for Big Dt Lecture Multi-Armed Bndits: Non-dptive nd Adptive Smpling Instructor: Shm Kkde 1 The (stochstic) multi-rmed bndit problem The bsic prdigm is s follows: K Independent

More information

LINEAR ALGEBRA APPLIED

LINEAR ALGEBRA APPLIED 5.5 Applictions of Inner Product Spces 5.5 Applictions of Inner Product Spces 7 Find the cross product of two vectors in R. Find the liner or qudrtic lest squres pproimtion of function. Find the nth-order

More information

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below.

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below. Dulity #. Second itertion for HW problem Recll our LP emple problem we hve been working on, in equlity form, is given below.,,,, 8 m F which, when written in slightly different form, is 8 F Recll tht we

More information

The First Fundamental Theorem of Calculus. If f(x) is continuous on [a, b] and F (x) is any antiderivative. f(x) dx = F (b) F (a).

The First Fundamental Theorem of Calculus. If f(x) is continuous on [a, b] and F (x) is any antiderivative. f(x) dx = F (b) F (a). The Fundmentl Theorems of Clculus Mth 4, Section 0, Spring 009 We now know enough bout definite integrls to give precise formultions of the Fundmentl Theorems of Clculus. We will lso look t some bsic emples

More information

19 Optimal behavior: Game theory

19 Optimal behavior: Game theory Intro. to Artificil Intelligence: Dle Schuurmns, Relu Ptrscu 1 19 Optiml behvior: Gme theory Adversril stte dynmics hve to ccount for worst cse Compute policy π : S A tht mximizes minimum rewrd Let S (,

More information

Math 1B, lecture 4: Error bounds for numerical methods

Math 1B, lecture 4: Error bounds for numerical methods Mth B, lecture 4: Error bounds for numericl methods Nthn Pflueger 4 September 0 Introduction The five numericl methods descried in the previous lecture ll operte by the sme principle: they pproximte the

More information

Acceptance Sampling by Attributes

Acceptance Sampling by Attributes Introduction Acceptnce Smpling by Attributes Acceptnce smpling is concerned with inspection nd decision mking regrding products. Three spects of smpling re importnt: o Involves rndom smpling of n entire

More information

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite Unit #8 : The Integrl Gols: Determine how to clculte the re described by function. Define the definite integrl. Eplore the reltionship between the definite integrl nd re. Eplore wys to estimte the definite

More information

3.1 Exponential Functions and Their Graphs

3.1 Exponential Functions and Their Graphs . Eponentil Functions nd Their Grphs Sllbus Objective: 9. The student will sketch the grph of eponentil, logistic, or logrithmic function. 9. The student will evlute eponentil or logrithmic epressions.

More information

Sparse Greedy Minimax Probability Machine Classification

Sparse Greedy Minimax Probability Machine Classification Sprse Greed Minim Probbilit Mchine Clssifiction Thoms R. Strohmnn Deprtment of Computer Science Universit of Colordo, Boulder strohmn@cs.colordo.edu Gregor Z. Grudic Deprtment of Computer Science Universit

More information

Math 8 Winter 2015 Applications of Integration

Math 8 Winter 2015 Applications of Integration Mth 8 Winter 205 Applictions of Integrtion Here re few importnt pplictions of integrtion. The pplictions you my see on n exm in this course include only the Net Chnge Theorem (which is relly just the Fundmentl

More information

Operations with Polynomials

Operations with Polynomials 38 Chpter P Prerequisites P.4 Opertions with Polynomils Wht you should lern: How to identify the leding coefficients nd degrees of polynomils How to dd nd subtrct polynomils How to multiply polynomils

More information

Credibility Hypothesis Testing of Fuzzy Triangular Distributions

Credibility Hypothesis Testing of Fuzzy Triangular Distributions 666663 Journl of Uncertin Systems Vol.9, No., pp.6-74, 5 Online t: www.jus.org.uk Credibility Hypothesis Testing of Fuzzy Tringulr Distributions S. Smpth, B. Rmy Received April 3; Revised 4 April 4 Abstrct

More information

Logarithms. Logarithm is another word for an index or power. POWER. 2 is the power to which the base 10 must be raised to give 100.

Logarithms. Logarithm is another word for an index or power. POWER. 2 is the power to which the base 10 must be raised to give 100. Logrithms. Logrithm is nother word for n inde or power. THIS IS A POWER STATEMENT BASE POWER FOR EXAMPLE : We lred know tht; = NUMBER 10² = 100 This is the POWER Sttement OR 2 is the power to which the

More information

4.6 Numerical Integration

4.6 Numerical Integration .6 Numericl Integrtion 5.6 Numericl Integrtion Approimte definite integrl using the Trpezoidl Rule. Approimte definite integrl using Simpson s Rule. Anlze the pproimte errors in the Trpezoidl Rule nd Simpson

More information

The Trapezoidal Rule

The Trapezoidal Rule _.qd // : PM Pge 9 SECTION. Numericl Integrtion 9 f Section. The re of the region cn e pproimted using four trpezoids. Figure. = f( ) f( ) n The re of the first trpezoid is f f n. Figure. = Numericl Integrtion

More information

M344 - ADVANCED ENGINEERING MATHEMATICS

M344 - ADVANCED ENGINEERING MATHEMATICS M3 - ADVANCED ENGINEERING MATHEMATICS Lecture 18: Lplce s Eqution, Anltic nd Numericl Solution Our emple of n elliptic prtil differentil eqution is Lplce s eqution, lso clled the Diffusion Eqution. If

More information

Review of Calculus, cont d

Review of Calculus, cont d Jim Lmbers MAT 460 Fll Semester 2009-10 Lecture 3 Notes These notes correspond to Section 1.1 in the text. Review of Clculus, cont d Riemnn Sums nd the Definite Integrl There re mny cses in which some

More information

Research Article Harmonic Deformation of Planar Curves

Research Article Harmonic Deformation of Planar Curves Interntionl Journl of Mthemtics nd Mthemticl Sciences Volume, Article ID 9, pges doi:.55//9 Reserch Article Hrmonic Deformtion of Plnr Curves Eleutherius Symeonidis Mthemtisch-Geogrphische Fkultät, Ktholische

More information

A Fast and Reliable Policy Improvement Algorithm

A Fast and Reliable Policy Improvement Algorithm A Fst nd Relible Policy Improvement Algorithm Ysin Abbsi-Ydkori Peter L. Brtlett Stephen J. Wright Queenslnd University of Technology UC Berkeley nd QUT University of Wisconsin-Mdison Abstrct We introduce

More information

ARITHMETIC OPERATIONS. The real numbers have the following properties: a b c ab ac

ARITHMETIC OPERATIONS. The real numbers have the following properties: a b c ab ac REVIEW OF ALGEBRA Here we review the bsic rules nd procedures of lgebr tht you need to know in order to be successful in clculus. ARITHMETIC OPERATIONS The rel numbers hve the following properties: b b

More information

New data structures to reduce data size and search time

New data structures to reduce data size and search time New dt structures to reduce dt size nd serch time Tsuneo Kuwbr Deprtment of Informtion Sciences, Fculty of Science, Kngw University, Hirtsuk-shi, Jpn FIT2018 1D-1, No2, pp1-4 Copyright (c)2018 by The Institute

More information

Non-Linear & Logistic Regression

Non-Linear & Logistic Regression Non-Liner & Logistic Regression If the sttistics re boring, then you've got the wrong numbers. Edwrd R. Tufte (Sttistics Professor, Yle University) Regression Anlyses When do we use these? PART 1: find

More information

1 Part II: Numerical Integration

1 Part II: Numerical Integration Mth 4 Lb 1 Prt II: Numericl Integrtion This section includes severl techniques for getting pproimte numericl vlues for definite integrls without using ntiderivtives. Mthemticll, ect nswers re preferble

More information

5.7 Improper Integrals

5.7 Improper Integrals 458 pplictions of definite integrls 5.7 Improper Integrls In Section 5.4, we computed the work required to lift pylod of mss m from the surfce of moon of mss nd rdius R to height H bove the surfce of the

More information

Scalable Learning in Stochastic Games

Scalable Learning in Stochastic Games Sclble Lerning in Stochstic Gmes Michel Bowling nd Mnuel Veloso Computer Science Deprtment Crnegie Mellon University Pittsburgh PA, 15213-3891 Abstrct Stochstic gmes re generl model of interction between

More information

The Regulated and Riemann Integrals

The Regulated and Riemann Integrals Chpter 1 The Regulted nd Riemnn Integrls 1.1 Introduction We will consider severl different pproches to defining the definite integrl f(x) dx of function f(x). These definitions will ll ssign the sme vlue

More information

Here we study square linear systems and properties of their coefficient matrices as they relate to the solution set of the linear system.

Here we study square linear systems and properties of their coefficient matrices as they relate to the solution set of the linear system. Section 24 Nonsingulr Liner Systems Here we study squre liner systems nd properties of their coefficient mtrices s they relte to the solution set of the liner system Let A be n n Then we know from previous

More information

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives Block #6: Properties of Integrls, Indefinite Integrls Gols: Definition of the Definite Integrl Integrl Clcultions using Antiderivtives Properties of Integrls The Indefinite Integrl 1 Riemnn Sums - 1 Riemnn

More information

A REVIEW OF CALCULUS CONCEPTS FOR JDEP 384H. Thomas Shores Department of Mathematics University of Nebraska Spring 2007

A REVIEW OF CALCULUS CONCEPTS FOR JDEP 384H. Thomas Shores Department of Mathematics University of Nebraska Spring 2007 A REVIEW OF CALCULUS CONCEPTS FOR JDEP 384H Thoms Shores Deprtment of Mthemtics University of Nebrsk Spring 2007 Contents Rtes of Chnge nd Derivtives 1 Dierentils 4 Are nd Integrls 5 Multivrite Clculus

More information

Chapter 3 Exponential and Logarithmic Functions Section 3.1

Chapter 3 Exponential and Logarithmic Functions Section 3.1 Chpter 3 Eponentil nd Logrithmic Functions Section 3. EXPONENTIAL FUNCTIONS AND THEIR GRAPHS Eponentil Functions Eponentil functions re non-lgebric functions. The re clled trnscendentl functions. The eponentil

More information

Cf. Linn Sennott, Stochastic Dynamic Programming and the Control of Queueing Systems, Wiley Series in Probability & Statistics, 1999.

Cf. Linn Sennott, Stochastic Dynamic Programming and the Control of Queueing Systems, Wiley Series in Probability & Statistics, 1999. Cf. Linn Sennott, Stochstic Dynmic Progrmming nd the Control of Queueing Systems, Wiley Series in Probbility & Sttistics, 1999. D.L.Bricker, 2001 Dept of Industril Engineering The University of Iow MDP

More information

Exam 2, Mathematics 4701, Section ETY6 6:05 pm 7:40 pm, March 31, 2016, IH-1105 Instructor: Attila Máté 1

Exam 2, Mathematics 4701, Section ETY6 6:05 pm 7:40 pm, March 31, 2016, IH-1105 Instructor: Attila Máté 1 Exm, Mthemtics 471, Section ETY6 6:5 pm 7:4 pm, Mrch 1, 16, IH-115 Instructor: Attil Máté 1 17 copies 1. ) Stte the usul sufficient condition for the fixed-point itertion to converge when solving the eqution

More information

CS 188 Introduction to Artificial Intelligence Fall 2018 Note 7

CS 188 Introduction to Artificial Intelligence Fall 2018 Note 7 CS 188 Introduction to Artificil Intelligence Fll 2018 Note 7 These lecture notes re hevily bsed on notes originlly written by Nikhil Shrm. Decision Networks In the third note, we lerned bout gme trees

More information

Applying Q-Learning to Flappy Bird

Applying Q-Learning to Flappy Bird Applying Q-Lerning to Flppy Bird Moritz Ebeling-Rump, Mnfred Ko, Zchry Hervieux-Moore Abstrct The field of mchine lerning is n interesting nd reltively new re of reserch in rtificil intelligence. In this

More information

Reinforcement learning

Reinforcement learning Reinforcement lerning Regulr MDP Given: Trnition model P Rewrd function R Find: Policy π Reinforcement lerning Trnition model nd rewrd function initilly unknown Still need to find the right policy Lern

More information

Advanced Calculus: MATH 410 Notes on Integrals and Integrability Professor David Levermore 17 October 2004

Advanced Calculus: MATH 410 Notes on Integrals and Integrability Professor David Levermore 17 October 2004 Advnced Clculus: MATH 410 Notes on Integrls nd Integrbility Professor Dvid Levermore 17 October 2004 1. Definite Integrls In this section we revisit the definite integrl tht you were introduced to when

More information

New Expansion and Infinite Series

New Expansion and Infinite Series Interntionl Mthemticl Forum, Vol. 9, 204, no. 22, 06-073 HIKARI Ltd, www.m-hikri.com http://dx.doi.org/0.2988/imf.204.4502 New Expnsion nd Infinite Series Diyun Zhng College of Computer Nnjing University

More information

The practical version

The practical version Roerto s Notes on Integrl Clculus Chpter 4: Definite integrls nd the FTC Section 7 The Fundmentl Theorem of Clculus: The prcticl version Wht you need to know lredy: The theoreticl version of the FTC. Wht

More information

Tests for the Ratio of Two Poisson Rates

Tests for the Ratio of Two Poisson Rates Chpter 437 Tests for the Rtio of Two Poisson Rtes Introduction The Poisson probbility lw gives the probbility distribution of the number of events occurring in specified intervl of time or spce. The Poisson

More information

THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS.

THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS. THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS RADON ROSBOROUGH https://intuitiveexplntionscom/picrd-lindelof-theorem/ This document is proof of the existence-uniqueness theorem

More information

Genetic Programming. Outline. Evolutionary Strategies. Evolutionary strategies Genetic programming Summary

Genetic Programming. Outline. Evolutionary Strategies. Evolutionary strategies Genetic programming Summary Outline Genetic Progrmming Evolutionry strtegies Genetic progrmming Summry Bsed on the mteril provided y Professor Michel Negnevitsky Evolutionry Strtegies An pproch simulting nturl evolution ws proposed

More information

Satellite Retrieval Data Assimilation

Satellite Retrieval Data Assimilation tellite etrievl Dt Assimiltion odgers C. D. Inverse Methods for Atmospheric ounding: Theor nd Prctice World cientific Pu. Co. Hckensck N.J. 2000 Chpter 3 nd Chpter 8 Dve uhl Artist depiction of NAA terr

More information

Chapter 3. Vector Spaces

Chapter 3. Vector Spaces 3.4 Liner Trnsformtions 1 Chpter 3. Vector Spces 3.4 Liner Trnsformtions Note. We hve lredy studied liner trnsformtions from R n into R m. Now we look t liner trnsformtions from one generl vector spce

More information

Calculus Module C21. Areas by Integration. Copyright This publication The Northern Alberta Institute of Technology All Rights Reserved.

Calculus Module C21. Areas by Integration. Copyright This publication The Northern Alberta Institute of Technology All Rights Reserved. Clculus Module C Ares Integrtion Copright This puliction The Northern Alert Institute of Technolog 7. All Rights Reserved. LAST REVISED Mrch, 9 Introduction to Ares Integrtion Sttement of Prerequisite

More information

Continuous Random Variables

Continuous Random Variables STAT/MATH 395 A - PROBABILITY II UW Winter Qurter 217 Néhémy Lim Continuous Rndom Vribles Nottion. The indictor function of set S is rel-vlued function defined by : { 1 if x S 1 S (x) if x S Suppose tht

More information

Estimation of Binomial Distribution in the Light of Future Data

Estimation of Binomial Distribution in the Light of Future Data British Journl of Mthemtics & Computer Science 102: 1-7, 2015, Article no.bjmcs.19191 ISSN: 2231-0851 SCIENCEDOMAIN interntionl www.sciencedomin.org Estimtion of Binomil Distribution in the Light of Future

More information

Strong Bisimulation. Overview. References. Actions Labeled transition system Transition semantics Simulation Bisimulation

Strong Bisimulation. Overview. References. Actions Labeled transition system Transition semantics Simulation Bisimulation Strong Bisimultion Overview Actions Lbeled trnsition system Trnsition semntics Simultion Bisimultion References Robin Milner, Communiction nd Concurrency Robin Milner, Communicting nd Mobil Systems 32

More information

fractions Let s Learn to

fractions Let s Learn to 5 simple lgebric frctions corne lens pupil retin Norml vision light focused on the retin concve lens Shortsightedness (myopi) light focused in front of the retin Corrected myopi light focused on the retin

More information

The Islamic University of Gaza Faculty of Engineering Civil Engineering Department. Numerical Analysis ECIV Chapter 11

The Islamic University of Gaza Faculty of Engineering Civil Engineering Department. Numerical Analysis ECIV Chapter 11 The Islmic University of Gz Fculty of Engineering Civil Engineering Deprtment Numericl Anlysis ECIV 6 Chpter Specil Mtrices nd Guss-Siedel Associte Prof Mzen Abultyef Civil Engineering Deprtment, The Islmic

More information

Introduction to Algebra - Part 2

Introduction to Algebra - Part 2 Alger Module A Introduction to Alger - Prt Copright This puliction The Northern Alert Institute of Technolog 00. All Rights Reserved. LAST REVISED Oct., 008 Introduction to Alger - Prt Sttement of Prerequisite

More information

Research Article Moment Inequalities and Complete Moment Convergence

Research Article Moment Inequalities and Complete Moment Convergence Hindwi Publishing Corportion Journl of Inequlities nd Applictions Volume 2009, Article ID 271265, 14 pges doi:10.1155/2009/271265 Reserch Article Moment Inequlities nd Complete Moment Convergence Soo Hk

More information

LECTURE NOTE #12 PROF. ALAN YUILLE

LECTURE NOTE #12 PROF. ALAN YUILLE LECTURE NOTE #12 PROF. ALAN YUILLE 1. Clustering, K-mens, nd EM Tsk: set of unlbeled dt D = {x 1,..., x n } Decompose into clsses w 1,..., w M where M is unknown. Lern clss models p(x w)) Discovery of

More information

Student Activity 3: Single Factor ANOVA

Student Activity 3: Single Factor ANOVA MATH 40 Student Activity 3: Single Fctor ANOVA Some Bsic Concepts In designed experiment, two or more tretments, or combintions of tretments, is pplied to experimentl units The number of tretments, whether

More information

MATH 144: Business Calculus Final Review

MATH 144: Business Calculus Final Review MATH 144: Business Clculus Finl Review 1 Skills 1. Clculte severl limits. 2. Find verticl nd horizontl symptotes for given rtionl function. 3. Clculte derivtive by definition. 4. Clculte severl derivtives

More information

Energy Bands Energy Bands and Band Gap. Phys463.nb Phenomenon

Energy Bands Energy Bands and Band Gap. Phys463.nb Phenomenon Phys463.nb 49 7 Energy Bnds Ref: textbook, Chpter 7 Q: Why re there insultors nd conductors? Q: Wht will hppen when n electron moves in crystl? In the previous chpter, we discussed free electron gses,

More information

ODE: Existence and Uniqueness of a Solution

ODE: Existence and Uniqueness of a Solution Mth 22 Fll 213 Jerry Kzdn ODE: Existence nd Uniqueness of Solution The Fundmentl Theorem of Clculus tells us how to solve the ordinry differentil eqution (ODE) du = f(t) dt with initil condition u() =

More information

Lecture 14: Quadrature

Lecture 14: Quadrature Lecture 14: Qudrture This lecture is concerned with the evlution of integrls fx)dx 1) over finite intervl [, b] The integrnd fx) is ssumed to be rel-vlues nd smooth The pproximtion of n integrl by numericl

More information

Unit 1 Exponentials and Logarithms

Unit 1 Exponentials and Logarithms HARTFIELD PRECALCULUS UNIT 1 NOTES PAGE 1 Unit 1 Eponentils nd Logrithms (2) Eponentil Functions (3) The number e (4) Logrithms (5) Specil Logrithms (7) Chnge of Bse Formul (8) Logrithmic Functions (10)

More information

Chapter 3 Single Random Variables and Probability Distributions (Part 2)

Chapter 3 Single Random Variables and Probability Distributions (Part 2) Chpter 3 Single Rndom Vriles nd Proilit Distriutions (Prt ) Contents Wht is Rndom Vrile? Proilit Distriution Functions Cumultive Distriution Function Proilit Densit Function Common Rndom Vriles nd their

More information

CS667 Lecture 6: Monte Carlo Integration 02/10/05

CS667 Lecture 6: Monte Carlo Integration 02/10/05 CS667 Lecture 6: Monte Crlo Integrtion 02/10/05 Venkt Krishnrj Lecturer: Steve Mrschner 1 Ide The min ide of Monte Crlo Integrtion is tht we cn estimte the vlue of n integrl by looking t lrge number of

More information

Solution for Assignment 1 : Intro to Probability and Statistics, PAC learning

Solution for Assignment 1 : Intro to Probability and Statistics, PAC learning Solution for Assignment 1 : Intro to Probbility nd Sttistics, PAC lerning 10-701/15-781: Mchine Lerning (Fll 004) Due: Sept. 30th 004, Thursdy, Strt of clss Question 1. Bsic Probbility ( 18 pts) 1.1 (

More information

How to simulate Turing machines by invertible one-dimensional cellular automata

How to simulate Turing machines by invertible one-dimensional cellular automata How to simulte Turing mchines by invertible one-dimensionl cellulr utomt Jen-Christophe Dubcq Déprtement de Mthémtiques et d Informtique, École Normle Supérieure de Lyon, 46, llée d Itlie, 69364 Lyon Cedex

More information

Polynomial Approximations for the Natural Logarithm and Arctangent Functions. Math 230

Polynomial Approximations for the Natural Logarithm and Arctangent Functions. Math 230 Polynomil Approimtions for the Nturl Logrithm nd Arctngent Functions Mth 23 You recll from first semester clculus how one cn use the derivtive to find n eqution for the tngent line to function t given

More information

IN GAUSSIAN INTEGERS X 3 + Y 3 = Z 3 HAS ONLY TRIVIAL SOLUTIONS A NEW APPROACH

IN GAUSSIAN INTEGERS X 3 + Y 3 = Z 3 HAS ONLY TRIVIAL SOLUTIONS A NEW APPROACH INTEGERS: ELECTRONIC JOURNAL OF COMBINATORIAL NUMBER THEORY 8 (2008), #A2 IN GAUSSIAN INTEGERS X + Y = Z HAS ONLY TRIVIAL SOLUTIONS A NEW APPROACH Elis Lmpkis Lmpropoulou (Term), Kiprissi, T.K: 24500,

More information

Driving Cycle Construction of City Road for Hybrid Bus Based on Markov Process Deng Pan1, a, Fengchun Sun1,b*, Hongwen He1, c, Jiankun Peng1, d

Driving Cycle Construction of City Road for Hybrid Bus Based on Markov Process Deng Pan1, a, Fengchun Sun1,b*, Hongwen He1, c, Jiankun Peng1, d Interntionl Industril Informtics nd Computer Engineering Conference (IIICEC 15) Driving Cycle Construction of City Rod for Hybrid Bus Bsed on Mrkov Process Deng Pn1,, Fengchun Sun1,b*, Hongwen He1, c,

More information

Classical Mechanics. From Molecular to Con/nuum Physics I WS 11/12 Emiliano Ippoli/ October, 2011

Classical Mechanics. From Molecular to Con/nuum Physics I WS 11/12 Emiliano Ippoli/ October, 2011 Clssicl Mechnics From Moleculr to Con/nuum Physics I WS 11/12 Emilino Ippoli/ October, 2011 Wednesdy, October 12, 2011 Review Mthemtics... Physics Bsic thermodynmics Temperture, idel gs, kinetic gs theory,

More information

3.4 Numerical integration

3.4 Numerical integration 3.4. Numericl integrtion 63 3.4 Numericl integrtion In mny economic pplictions it is necessry to compute the definite integrl of relvlued function f with respect to "weight" function w over n intervl [,

More information

Theoretical foundations of Gaussian quadrature

Theoretical foundations of Gaussian quadrature Theoreticl foundtions of Gussin qudrture 1 Inner product vector spce Definition 1. A vector spce (or liner spce) is set V = {u, v, w,...} in which the following two opertions re defined: (A) Addition of

More information

ODE: Existence and Uniqueness of a Solution

ODE: Existence and Uniqueness of a Solution Mth 22 Fll 213 Jerry Kzdn ODE: Existence nd Uniqueness of Solution The Fundmentl Theorem of Clculus tells us how to solve the ordinry dierentil eqution (ODE) du f(t) dt with initil condition u() : Just

More information

Mathematics Number: Logarithms

Mathematics Number: Logarithms plce of mind F A C U L T Y O F E D U C A T I O N Deprtment of Curriculum nd Pedgogy Mthemtics Numer: Logrithms Science nd Mthemtics Eduction Reserch Group Supported y UBC Teching nd Lerning Enhncement

More information

CMDA 4604: Intermediate Topics in Mathematical Modeling Lecture 19: Interpolation and Quadrature

CMDA 4604: Intermediate Topics in Mathematical Modeling Lecture 19: Interpolation and Quadrature CMDA 4604: Intermedite Topics in Mthemticl Modeling Lecture 19: Interpoltion nd Qudrture In this lecture we mke brief diversion into the res of interpoltion nd qudrture. Given function f C[, b], we sy

More information

Chapter 5 : Continuous Random Variables

Chapter 5 : Continuous Random Variables STAT/MATH 395 A - PROBABILITY II UW Winter Qurter 216 Néhémy Lim Chpter 5 : Continuous Rndom Vribles Nottions. N {, 1, 2,...}, set of nturl numbers (i.e. ll nonnegtive integers); N {1, 2,...}, set of ll

More information

FUZZY HOMOTOPY CONTINUATION METHOD FOR SOLVING FUZZY NONLINEAR EQUATIONS

FUZZY HOMOTOPY CONTINUATION METHOD FOR SOLVING FUZZY NONLINEAR EQUATIONS VOL NO 6 AUGUST 6 ISSN 89-668 6-6 Asin Reserch Publishing Networ (ARPN) All rights reserved wwwrpnjournlscom FUZZY HOMOTOPY CONTINUATION METHOD FOR SOLVING FUZZY NONLINEAR EQUATIONS Muhmmd Zini Ahmd Nor

More information

A ROLLOUT CONTROL ALGORITHM FOR DISCRETE-TIME STOCHASTIC SYSTEMS

A ROLLOUT CONTROL ALGORITHM FOR DISCRETE-TIME STOCHASTIC SYSTEMS Proceedings of the ASE 2 Dynmic Systems nd Control Conference DSCC2 September 2-5, 2, Cmbridge, sschusetts, USA DSCC2- A ROLLOUT CONTROL ALGORITH FOR DISCRETE-TIE STOCHASTIC SYSTES Andres A. liopoulos

More information

Reversals of Signal-Posterior Monotonicity for Any Bounded Prior

Reversals of Signal-Posterior Monotonicity for Any Bounded Prior Reversls of Signl-Posterior Monotonicity for Any Bounded Prior Christopher P. Chmbers Pul J. Hely Abstrct Pul Milgrom (The Bell Journl of Economics, 12(2): 380 391) showed tht if the strict monotone likelihood

More information

Bernoulli Numbers Jeff Morton

Bernoulli Numbers Jeff Morton Bernoulli Numbers Jeff Morton. We re interested in the opertor e t k d k t k, which is to sy k tk. Applying this to some function f E to get e t f d k k tk d k f f + d k k tk dk f, we note tht since f

More information

Math 113 Exam 2 Practice

Math 113 Exam 2 Practice Mth Em Prctice Februry, 8 Em will cover sections 6.5, 7.-7.5 nd 7.8. This sheet hs three sections. The first section will remind you bout techniques nd formuls tht you should know. The second gives number

More information

APPROXIMATE INTEGRATION

APPROXIMATE INTEGRATION APPROXIMATE INTEGRATION. Introduction We hve seen tht there re functions whose nti-derivtives cnnot be expressed in closed form. For these resons ny definite integrl involving these integrnds cnnot be

More information

SCHOOL OF ENGINEERING & BUILT ENVIRONMENT. Mathematics. Basic Algebra

SCHOOL OF ENGINEERING & BUILT ENVIRONMENT. Mathematics. Basic Algebra SCHOOL OF ENGINEERING & BUILT ENVIRONMENT Mthemtics Bsic Algebr Opertions nd Epressions Common Mistkes Division of Algebric Epressions Eponentil Functions nd Logrithms Opertions nd their Inverses Mnipulting

More information

SUMMER KNOWHOW STUDY AND LEARNING CENTRE

SUMMER KNOWHOW STUDY AND LEARNING CENTRE SUMMER KNOWHOW STUDY AND LEARNING CENTRE Indices & Logrithms 2 Contents Indices.2 Frctionl Indices.4 Logrithms 6 Exponentil equtions. Simplifying Surds 13 Opertions on Surds..16 Scientific Nottion..18

More information

and that at t = 0 the object is at position 5. Find the position of the object at t = 2.

and that at t = 0 the object is at position 5. Find the position of the object at t = 2. 7.2 The Fundmentl Theorem of Clculus 49 re mny, mny problems tht pper much different on the surfce but tht turn out to be the sme s these problems, in the sense tht when we try to pproimte solutions we

More information

Multiple Integrals. Review of Single Integrals. Planar Area. Volume of Solid of Revolution

Multiple Integrals. Review of Single Integrals. Planar Area. Volume of Solid of Revolution Multiple Integrls eview of Single Integrls eding Trim 7.1 eview Appliction of Integrls: Are 7. eview Appliction of Integrls: Volumes 7.3 eview Appliction of Integrls: Lengths of Curves Assignment web pge

More information

A Signal-Level Fusion Model for Image-Based Change Detection in DARPA's Dynamic Database System

A Signal-Level Fusion Model for Image-Based Change Detection in DARPA's Dynamic Database System SPIE Aerosense 001 Conference on Signl Processing, Sensor Fusion, nd Trget Recognition X, April 16-0, Orlndo FL. (Minor errors in published version corrected.) A Signl-Level Fusion Model for Imge-Bsed

More information

Chapters 4 & 5 Integrals & Applications

Chapters 4 & 5 Integrals & Applications Contents Chpters 4 & 5 Integrls & Applictions Motivtion to Chpters 4 & 5 2 Chpter 4 3 Ares nd Distnces 3. VIDEO - Ares Under Functions............................................ 3.2 VIDEO - Applictions

More information

An approximation to the arithmetic-geometric mean. G.J.O. Jameson, Math. Gazette 98 (2014), 85 95

An approximation to the arithmetic-geometric mean. G.J.O. Jameson, Math. Gazette 98 (2014), 85 95 An pproximtion to the rithmetic-geometric men G.J.O. Jmeson, Mth. Gzette 98 (4), 85 95 Given positive numbers > b, consider the itertion given by =, b = b nd n+ = ( n + b n ), b n+ = ( n b n ) /. At ech

More information

Research Article On Existence and Uniqueness of Solutions of a Nonlinear Integral Equation

Research Article On Existence and Uniqueness of Solutions of a Nonlinear Integral Equation Journl of Applied Mthemtics Volume 2011, Article ID 743923, 7 pges doi:10.1155/2011/743923 Reserch Article On Existence nd Uniqueness of Solutions of Nonliner Integrl Eqution M. Eshghi Gordji, 1 H. Bghni,

More information

Lecture 09: Myhill-Nerode Theorem

Lecture 09: Myhill-Nerode Theorem CS 373: Theory of Computtion Mdhusudn Prthsrthy Lecture 09: Myhill-Nerode Theorem 16 Ferury 2010 In this lecture, we will see tht every lnguge hs unique miniml DFA We will see this fct from two perspectives

More information

Learning Moore Machines from Input-Output Traces

Learning Moore Machines from Input-Output Traces Lerning Moore Mchines from Input-Output Trces Georgios Gintmidis 1 nd Stvros Tripkis 1,2 1 Alto University, Finlnd 2 UC Berkeley, USA Motivtion: lerning models from blck boxes Inputs? Lerner Forml Model

More information