A ROLLOUT CONTROL ALGORITHM FOR DISCRETE-TIME STOCHASTIC SYSTEMS

Size: px

Start display at page:

Download "A ROLLOUT CONTROL ALGORITHM FOR DISCRETE-TIME STOCHASTIC SYSTEMS"

Cori Preston
5 years ago
Views:

1 Proceedings of the ASE 2 Dynmic Systems nd Control Conference DSCC2 September 2-5, 2, Cmbridge, sschusetts, USA DSCC2- A ROLLOUT CONTROL ALGORITH FOR DISCRETE-TIE STOCHASTIC SYSTES Andres A. liopoulos Propulsion System Reserch Lb Generl otors Globl Reserch & Development Wrren, I 489 ndres.mliopoulos@gm.com ABSTRACT The growing demnd for ming utonomous intelligent systems tht cn lern how to improve their performnce while intercting with their environment hs induced significnt reserch on computtionl cognitive models. Computtionl intelligence, or rtionlity, cn be chieved by modeling system nd the interction with its environment through ctions, perceptions, nd ssocited costs. A widely dopted prdigm for modeling this interction is the controlled rov chin. In this context, the problem is formulted s sequentil decisionming process in which n intelligent system hs to select those control ctions in severl time steps to chieve long-term gols. This pper presents rollout control lgorithm tht ims to build n online decision-ming mechnism for controlled rov chin. The lgorithm yields loohed suboptiml control policy. Under certin conditions, theoreticl bound on its performnce cn be estblished.. INTRODUCTION Sequentil decision models [, 2] re mthemticl bstrctions of situtions in which decisions must be mde in severl decision epochs while incurring certin cost (or rewrd) t ech epoch. Ech decision my influence the circumstnces under which future decisions will be mde, nd thus, the decision mer must blnce his/her desire to minimize (mximize) the cost (rewrd) of the present decision ginst his/her desire to void future situtions where high cost is inevitble. A lrge clss of sequentil decision-ming problems under uncertinty cn be solved using dynmic progrmming (DP) [3]. However, the computtionl cost of DP in some instnces my be prohibitive nd cn grow intrctbly s the size of the problem increses. As n lterntive pproch to ddress this issue, Approximte Dynmic Progrmming (ADP) [4] is employed, providing suboptiml control methods for deterministic nd stochstic problems. Rollout lgorithms nd model predictive control re two mjor methods within ADP with properties founded on policy itertion. The min ide of rollout lgorithms [5-] is to obtin n improved policy strting from some other suboptiml policy using one-time policy improvement. It hs been proposed by Abrmson [] nd by Tesuro nd Glperin [2] in the context of gme-plying computer progrms. In the ltter, bcgmmon position is evluted by simulting mny gmes strting from tht position nd the results re verged. odel predictive control [3-7] is populr pproch in vriety of control system design contexts, nd in prticulr, in chemicl process control. It ws motivted by the desire to introduce nonlinerities nd constrints into the liner-qudrtic control frmewor, while obtining suboptiml but stble closed-loop system. Other lterntives for pproching these problems hve been primrily developed in the field of Reinforcement Lerning (RL) [4, 8, 9]. RL hs imed to provide lgorithms, founded on DP, for lerning suboptiml control policies when nlyticl methods cnnot be used effectively, or the system s stte trnsition probbilities re not nown [2]. Although mny of these lgorithms re eventully gurnteed to find sub-optiml policies in sequentil decision-ming problems under uncertinty, their use of the ccumulted dt cquired over the lerning process is inefficient, nd they require significnt mount of experience to chieve cceptble performnce [2]. This requirement rises due to the formtion of these lgorithms in deriving control policies without lerning the system dynmics en route, tht is, they do not solve the system identifiction problem simultneously. In ddition, RL lgorithms re suited to problems in which the system needs to chieve prticulr gol sttes, which imposes Copyright 2 by Generl otors Downloded 6 Aug 2 to Redistribution subject to ASE license or copyright; see

2 limittions in employing efficiently these lgorithms to solve prticulr problems. The Predictive Optiml Decision-ming (POD) lerning model [22, 23] hs imed to ddress the system identifiction problem for completely unnown system by lerning in rel time the system s evolution over vrying nd unnown finite time horizon. The POD model hs been employed in vrious pplictions towrds ming utonomous intelligent systems tht cn lern to improve their performnce over time in stochstic environments. In the crt-pole blncing problem [23], n inverted pendulum ws mde cpble of relizing the blncing control policy nd turning into stble system. In vehicle cruise control implementtion [23], n utonomous cruise controller ws developed to lern to mintin the desired vehicle s speed t different rod grdes. POD hs lso ten steps towrd development utonomous intelligent propulsion systems relizing their optiml opertion with respect to the driver s driving style [22, 24]. In this pper, rollout control lgorithm tht ims to build n online decision-ming mechnism for controlled rov chins is presented. The lgorithm cn be combined with the POD model to yield loohed suboptiml control policy tht ssesses the system output with respect to lterntive control ctions, nd selecting those tht optimize specified performnce criteri. A theoreticl bound on its performnce is proven in Theorem 4., thus estblishing tht, under certin conditions, the loohed control policy exists. The reminder of the pper proceeds s follows: Section 2 estblishes the mthemticl frmewor of the controlled rov chin. Section 3 reviews briefly the Predictive Optiml Decisionming (POD) computtionl model tht ims to lern the trnsition probbilities nd ssocited costs. Section 4 introduces the rollout control lgorithm nd formultes the theoreticl bound on its performnce. Concluding remrs re presented in Section PROBLE FORULATION The stochstic system model estblishes the mthemticl frmewor for the representtion of dynmic systems tht evolve stochsticlly over time [2, 25, 26], tht is, when incurring stochstic disturbnce or noise t time, w, in their portryl. The one-dimensionl model is given by n eqution of the form s f( s,, w),,,... () where s is the system s stte tht belongs to some stte spce S {, 2,..., N}, N, f is function tht describes how the system s stte is updted, is the control ction, nd w is the } is treted s disturbnce t time. The sequence { w, stochstic process, nd the joint probbility distribution of the rndom vribles w, w,..., w is unnown for ech. The system output is represented by y h ( s, v ),,,... (2) where y is the observtion or system s output, h is function tht describes how the system output is updted, nd v is the mesurement error or noise. The sequence { v, } is lso considered stochstic process with unnown probbility distribution. We re interested in deriving control policy so tht given performnce criterion is optimized over ll dmissible policies Π. An dmissible policy consists of sequence of functions { µ,...}, where µ mps sttes s into ctions µ ( s ). The system s stte s depends upon the input sequence,,... s well s the rndom vribles w, w,..., Eq. (). Consequently, s is rndom vrible; the system output y h( s, v) is function of the rndom vribles s, s,..., v, v,..., nd thus, is lso rndom vrible. Similrly, the sequence of control ctions µ ( s), {, }, constitutes stochstic process. Suppose tht the previous vlues of the rndom vribles s m nd m, m re nown. Then the conditionl distribution of s given these vlues will be P ( s s,..., s,,..., ) ( w s,..., s,,..., ). s s, P w s, The conditionl probbility distribution of s given s nd cn be independent of the previous vlues of sttes nd control ctions, if it is gurnteed tht for every control policy, w is independent of the rndom vribles s nd m m, m. Kumr nd Vriy [26] proved tht this property is imposed under the ssumption tht the following rndom vribles s, w, w,..., v, v,..., re ll independent. The ltter imposes condition directly to the bsic rndom vribles which eventully yields tht the stte s depends only on s nd. oreover, the conditionl probbility distributions do not depend on the control policy, nd thus the superscript cn be dropped Ps, (,...,,,..., ) s s s s (4) P ( s s, ). s s, (3) 2 Copyright 2 by Generl otors Downloded 6 Aug 2 to Redistribution subject to ASE license or copyright; see

3 A stochstic process { s, } stisfying Eq. (4) is clled rov Process. If the stte spce is discrete, then the process is defined s controlled rov chin. The discrete-time, sttionry controlled rov chin is stochstic dynmic system specified by five-tuple S, A, A, P, R, where { } () S {, 2,..., N} is the finite stte spce; (b) A is the compct ction spce; (c) A is fmily { A() i i } S of nonempty mesurble subsets of A, where A () i denotes the set of fesible ctions when the system is t stte i S, with the property tht the set { i i i} K : (, ), A ( ), of fesible pirs is mesurble subset of S A nd contins the grph of mesurble function form S to A ; (d) P is the stochstic ernel on S given K, tht is, the trnsition probbility of the system from stte i S to j S ; nd (e) R is the mesurble one-stge cost function, R : K. The evolution of the system occurs t ech of sequence of stges,,..., nd is portryed by the sequence of the rndom vribles s nd corresponding to the system s stte nd control ction. At ech stge, the controller observes the system s stte s i, nd executes n ction, from the fesible set of ctions A () i A t this stte. At the next stge, the system trnsits to the stte s j imposed by the conditionl probbility P( j i, ), nd cost R( j i, ) is incurred. After the trnsition to the next stte hs occurred, new ction is selected, nd the process is repeted. The completed period of time over which the system is observed is clled the decision-ming horizon nd is denoted by. The horizon cn be either finite or infinite; in this pper, we consider finite-horizon decision-ming problems. A control policy determines the probbility distribution of stte process { s, } nd the control process {, }. Different policies will led to different probbility distributions. In optiml control problems, the objective is to derive the optiml control policy tht minimizes (mximizes) the ccumulted cost (rewrd) incurred t ech stte trnsition per decision epoch. If policy is fixed, the cost incurred by when the process strts from n initil stte s nd up to the time horizon is J ( s ) R ( s j s i, ), i, j, A ( i ). (5) The ccumulted cost J ( s ) is rndom vrible since s nd re rndom vribles. Hence the expected ccumulted cost of control policy is given by J ( s ) E R ( s j s i, ( s )) s ( ) A s E R( s j s i, µ ( s)) s µ As ( ) p( s j s i, µ ( s)) R( s j s i, µ ( s)), (6) where the expecttion is ten with respect to the probbility distribution of { s, } nd {, } determined by the control policy. The optiml policy { µ,..., µ } cn be derived by rg min J ( s ). (7) Π 3. ONLINE SELF-LEARNING IDENTIFICATION The problem of ming utonomous intelligent systems is formulted s sequentil decision-ming under uncertinly. In this context, n intelligent system (decision mer), e.g., dvnced propulsion systems, robot, utomted mnufcturing system, etc, hs to select those ctions in severl time steps (decision epochs) to chieve long-term gols efficiently. This problem involves two mjor sub-problems: () the system identifiction problem, nd (b) the stochstic control problem. The first is exploittion of the informtion cquired from the system output to identify its behvior, tht is, how stte representtion cn be built by observing the system s stte trnsitions. The second is ssessment of the system output with respect to lterntive control policies, nd selecting those tht optimize specified performnce criteri. The Predictive Optiml Decision-ming (POD) lerning model [23] is intended to ddress the system identifiction problem for completely unnown system by lerning in rel time the system dynmics over vrying nd unnown finite time horizon. The model embedded in the self-lerning controller is constituted by stte representtion which ttempts to provide n efficient process in relizing the stte trnsitions tht occurred in the rov domin. The model considers systems tht their evolution cn be modeled s controlled rov chin under the ssumptions tht the rov chin is homogeneous, ergodic, nd irreducible. The lerning process of the POD model trnspires while the system intercts with its environment. Ten in conjunction with ssigning vlues of the control ctions from the fesible ction 3 Copyright 2 by Generl otors Downloded 6 Aug 2 to Redistribution subject to ASE license or copyright; see

4 spce, A, this interction portrys the progressive enhncement of the controller s nowledge of the system s evolution with respect to the control ctions. ore precisely, t ech of sequence of decision epochs,, 2,..., stte s is introduced to the controller, nd on tht bsis the controller selects n ction, µ ( s). This stte rises s result of the system s evolution. One epoch lter, s consequence of this ction, the system trnsits to new stte s j, nd receives numericl cost, R( s j s i, ). At ech epoch, the controller implements mpping from the Crtesin product of the stte spce nd ction spce to the set of rel numbers, S A, by mens of the costs tht it receives. Similrly, nother mpping from the Crtesin product of the stte spce nd ction spce to the closed set [,] is executed, S A [,], i.e., the trnsition probbility mtrix, P(, ). The ltter essentilly perceives the incidence in which prticulr sttes or prticulr sequences of sttes rise. The POD model possesses structure tht enbles convergent behvior of the conditionl probbilities infused by the POD stte-spce representtion to the sttionry distribution. This behvior is desirble in the effort towrds ming utonomous intelligent systems tht cn lern to improve their performnce over time in stochstic environments. The convergence of POD to the sttionry distribution of the rov stte trnsitions hs been proven in [27], hence estblishing POD s robust model. As the process is stochstic, however, it is still necessry for the controller to build decision-ming mechnism to derive the control policy. This policy is expressed by mens of mpping from sttes to probbilities of selecting the ctions, resulting in the minimum expected ccumulted cost. 4. ROLLOUT CONTROL ALGORITH The objective of the control lgorithm is to evlute in rel time the optiml ction t ech epoch not only for the current stte, but lso for the next two subsequent sttes over the following epochs. The requirement of rel-time implementtion imposes computtionl burden in llowing the lgorithm to loo further hed in time, thus evluting n ction over dditionl succeeding sttes. Suppose tht the current stte is s nd the following stte given n ction A ( s), is s. The immedite cost incurred by this trnsition is Rs ( s, ). The minimum expected cost for the next two subsequent sttes is perceived in terms of the mgnitude, V( s ), nd is equl to { } V( s ) min E R( s s, ). (8) 2 A( s ) s 2 re described by probbility distributions nd the expected vlue of the overll cost is minimized. In this context, the control policy relized by the lgorithm is bsed on the minimx control pproch, whereby the worst possible vlues of the uncertin quntities within the given set re ssumed to occur. This essentilly ssures tht the control policy will result in t most mximum overll cost. Consequently, being t stte s the control lgorithm provides the policy { µ,..., µ }, in terms of the vlues of the controllble vribles s ( s ) rgmin mx R( s s, ) V( s ). S (9) µ ( s ) A( s ) s To evlute the efficiency of the lgorithm, the estblishment of performnce bound in terms of the ccumulted cost over the decision epochs is necessry. The following Lemm (see, e.g. []) ims to provide useful step towrd presenting the min result (Theorem 3.). Lemm 4. : Let f : S [, ] nd g : S A [, ] be two functions. If ( gi) min (, ) >, i () A then we hve min mx[ f( i) g( i, µ ( i))] mx[ f( i) min g( i, )], µ () i A i i A () where the function µ : S A, mps the stte into ction, tht is, µ () i, nd SA, re the stte nd ction spce respectively. Assumption 4. : The minimum expected cost V( s ), Eq. (8), incurred t the decision epoch is bounded, tht is, V( s ) >, s. Theorem 4.: The ccumulted cost J ( s) incurred by the loohed control policy { µ,..., µ }, nmely, ( s ) rg min mx R ( s s, ) min E R ( s s, ), { } (2) µ ( s ) A( s ) s A( s ) s 2 2 is bounded by the ccumulted cost J ( s ) incurred by the minimx control policy { µ,..., µ }, nmely, µ ( s ) A( s ) s [ R s s J s ] rg min mx (, ) ( ) with probbility. (3) For the problem of optiml control of uncertin systems, which is treted in stochstic frmewor, ll uncertin quntities 4 Copyright 2 by Generl otors Downloded 6 Aug 2 to Redistribution subject to ASE license or copyright; see

5 Proof: Suppose tht the chin strts t stte s i, i t time nd ends up t. We consider the problem of finding policy { µ,..., µ } with µ ( s) A for ll s nd tht minimizes the cost function 2 J ( s) mx R (, ) R( s s, ) s.(4) S The DP lgorithm for this problem tes the following form strting from the til sub-problem J ( s ) min mx R ( s s, ) R ( s ), nd µ ( ) A( ) (5) J ( s ) J ( s ). (2) Consequently, n optiml policy for the minimx problem cn be constructed by minimizing the right hnd side of Eq. (4). Performing the sme ts s we did with the DP lgorithm by strting from the lst epoch of the decision-ming process nd moving bcwrds, the ccumulted cost, J ( s), incurred by the control policy { µ,..., µ } is J ( s ) min mx R(, ) min E { R ( 2, ) } µ ( ) A( ) A( ) 2 R ( s s, ) R ( s ) J ( s ), (2) J ( s ) min mx R ( s s, ) J ( s ), µ ( s ) A( s ) s (6) R ( 2, ) since the terminl epoch is t. where R ( s ) is the cost of the terminl decision epoch. Following the steps of the DP lgorithm proposed by Bertses [], the optiml ccumulted cost J ( s ) strting from the lst decision epoch nd moving bcwrds is J ( s ) min... min µ ( s) A( s) µ ( ) A( ) 2 mx...mx R (, ) R( s s, ). s S S (7) By pplying Lemm 4., we cn interchnge the min over µ Μ nd the mx over s,..., 2. The necessry condition in Lemm 4. is implied by Assumption 4.. Eqution (7) yields J ( s ) min... min µ ( s) A( s) µ ( s ) A( s ) mx... mx R( s s, ) mx [ R 2( 2, 2) J( ) ] s 2 S S s (8) 3 min... min mx... mx R( s s, ) J ( ). µ ( s) A( s) µ 2( 2) A( 2) s 2 S S (9) J ( ) min µ ( ) A( ) mx R (, ) min E { R(, ) } J ( ) A( ) (22) min mx R ( s s, ) J ( s ), (23) µ ( ) A( ) Similrly, R(, ) since the terminl epoch is t. Consequently, J ( s ) min mx R ( s s, ) J ( s ) J ( s ), µ ( ) A( ) since J ( s ) is constnt quntity. J 2( 2) min µ 2( 2) A( 2) (24) mx R 2( 2, 2) min E { R (, ) } A( ) J ( ). (25) By continuing bcwrds in similr wy we obtin However, 5 Copyright 2 by Generl otors Downloded 6 Aug 2 to Redistribution subject to ASE license or copyright; see

6 min µ 2( 2) A( 2) mx R 2( 2, 2) min E { R (, ) } A( ) min mx (, ), µ 2( 2) A( 2) [ R s s ] (26) since the LHS of the inequlity will minimize cost which is not only mximum over the cost incurred when the chin trnsits from 2 to but lso minimum over the cost incurred when the chin trnsits from to s. So, the LHS cn be t most equl to the cost which is mximum over the trnsition from 2 to. Consequently, compring the ccumulted cost of the control policy J ( s) in Eq. (25) with the one resulted from the DP t the sme decision epoch, nmely, J ( s ) min mx R ( s s, ) J ( s ) µ 2( 2) A( 2) s we conclude tht (27) J ( s ) J ( s ). (28) By continuing bcwrd with similr rguments, we hve J ( s ) J ( s ) J ( s ). (29) Consequently, the ccumulted cost resulting from the control policy { µ,..., µ } is bounded by the ccumulted cost of the optiml minimx control policy with probbility. 5. CONCLUDING REARKS We presented the theoreticl frmewor nd rollout control lgorithm towrd ming utonomous intelligent systems tht cn lern their optiml opertion in rel time. The evolution of the system ws modeled s controlled rov chin, nd the ts of deriving control policy ws formulted s sequentil decisionming problem under uncertinty. The lgorithm comprises the decision-ming mechnism tht solves the stochstic control problem by utilizing ccumulted dt cquired s the system intercts with its environment. The solution of the lgorithm hs theoreticl performnce bound tht is superior to tht of the solution provided by the one-step minimx control lgorithm (Theorem 4.). The reserch presented here considered the pproximte solution of discrete optimiztion problem using procedures cpble of mgnifying the effectiveness of ny given heuristic lgorithm through sequentil ppliction. In prticulr, the problem ws embedded within dynmic progrmming frmewor, nd two-step rollout lgorithm ws introduced relted to notions of policy itertion. Future reserch should explore the impct of the number of time steps tht the lgorithm cn loo forwrd in time on its performnce bound. REFERENCES [] Bertses, D. P., Dynmic Progrmming nd Optiml Control (Volumes nd 2), Athen Scientific, September 2. [2] Bertses, D. P. nd Shreve, S. E., Stochstic Optiml Control: The Discrete-Time Cse, st edition, Athen Scientific, Februry 27. [3] Bellmn, R., Dynmic Progrmming. Princeton, NJ, Princeton University Press, 957. [4] Bertses, D. P. nd Tsitsilis, J. N., Neuro-Dynmic Progrmming (Optimiztion nd Neurl Computtion Series, 3), st edition, Athen Scientific, y 996. [5] Bertses, D. P., Tsitsilis, J. N., nd Wu, C., "Rollout Algorithms for Comnintoril Optimiztion," Heuristics, vol. 3, pp , 997. [6] Bertses, D. P. nd Cstnon, D. A., "Rollout Algorithms for Stochstic Scheduling Problems," Heuristics, vol. 5, pp. 89-8, 999. [7] Secomndi, N., "Compring Neuro-Dynmic Progrmming Algorithms for the Vehicle Routing Problem with Stochstic Demnds," Computers nd Opertions Reserch, vol. 27, pp , 2. [8] cgovern, A., oss, E., nd Brto, A., "Building Bsic Building Bloc Scheduler Using Reinforcement Lerning nd Rollouts," chine Lerning, vol. 49, pp. 4-6, 22. [9] Bertsims, D. nd Popescu, I., "Revenue ngement in Dynmic Networ Environment," Trnsporttion Science, vol. 37, pp , 23. [] Tu, F. nd Pttipti, K. R., "Rollout Strtegies for Sequentil Fult Dignosis," IEEE Trns on Systems, n nd Cybernetics, Prt A, pp , 23. [] Abrmson, B., "Expected-Outcome: A Generl odel of Sttic Evlution," IEEE Trnsctions on Pttern Anlysis nd chine Intelligence, vol. 2, pp , 99. [2] Tesuro, G. nd Glperin, G., "On-line Policy Improvement Using onte-crlo Serch," Advnces in Neurl Informtion Processing, vol. 9, pp , 996. [3] orri,. nd Lee, J. H., "odel Predictive Control: Pst, Present, nd Future," Computers nd Chemicl Engineering, vol. 23, pp , 999. [4] yne, D. Q., Rwlings, J. B., Ro, C. V., nd Scoert, P. O.., "Constrined odel Predictive Control: Stbility nd Optimlity," Automtic, vol. 36, pp , 2. 6 Copyright 2 by Generl otors Downloded 6 Aug 2 to Redistribution subject to ASE license or copyright; see

7 [5] Rwlings, J. B., "Tutoril Overview of odel Predictive Control," Control Systems gzine, vol. 2, pp , 2. [6] Findeisen, R., Imslnd, L., Allgower, F., nd Foss, B. A., "Stte nd Output Feedbc Nonliner odel Predictive Control: An Overview," Europen Journl of Control, vol. 9, pp. 9-25, 23. [7] Qin, S. J. nd Bdgwell, T. A., "A Survey of Industril odel Predictive Control Technology," Control Engineering Prctice, vol., pp , 23. [8] Sutton, R. S. nd Brto, A. G., Reinforcement Lerning: An Introduction (Adptive Computtion nd chine Lerning), The IT Press, rch 998. [9] Gosvi, A., Simultion-Bsed Optimiztion: Prmetric Optimiztion Techniques nd Reinforcement Lerning, st edition, Springer, June 3, 23. [2] Borr, V. S., "A Lerning Algorithm for Discrete-Time Stochstic Control," Probbility in the Engineering nd Informtion Sciences, vol. 4, pp , 2. [2] Kelbling, L. P., Littmn,. L., nd oore, A. W., "Reinforcement Lerning: A Survey," Journl of Artificil Intelligence Reserch, vol. 4, 996. [22] liopoulos, A. A., Rel-Time, Self-Lerning Identifiction nd Stochstic Optiml Control of Advnced Powertrin Systems, Ph.D. Disserttion, Deprtment of echnicl Engineering, University of ichign, Ann Arbor, USA, 28. [23] liopoulos, A. A., Pplmbros, P. Y., nd Assnis, D. N., "A Rel-Time Computtionl Lerning odel for Sequentil Decision-ing Problems Under Uncertinty," ASE J. Dyn. Sys., es., Control, vol. 3, 4, pp. 4-(8), 29. [24] liopoulos, A. A., Assnis, D. N., nd Pplmbros, P. Y., "Rel-Time Self-Lerning Optimiztion of Diesel Engine Clibrtion," ASE J. Eng. Gs Turbines Power, vol. 3, 2, pp. 2283(7), 29. [25] Kushner, H. J., Introduction to Stochstic Control, Holt, Rinehrt nd Winston, 97. [26] Kumr, P. R. nd Vriy, P., Stochstic Systems, Prentice Hll, June 986. [27] liopoulos, A. A., "Convergence Properties of Computtionl Lerning odel for Unnown rov Chins," ASE J. Dyn. Sys., es., Control, vol. 3, No. 4, pp. 4(7), Copyright 2 by Generl otors Downloded 6 Aug 2 to Redistribution subject to ASE license or copyright; see

Reinforcement Learning

Reinforcement Learning Reinforcement Lerning Tom Mitchell, Mchine Lerning, chpter 13 Outline Introduction Comprison with inductive lerning Mrkov Decision Processes: the model Optiml policy: The tsk Q Lerning: Q function Algorithm