Reinforcement Learning: A Tutorial. Scope of Tutorial. 1 Introduction

Size: px

Start display at page:

Download "Reinforcement Learning: A Tutorial. Scope of Tutorial. 1 Introduction"

Leonard Booker
6 years ago
Views:

1 Reinforcemen Learning: A Tuorial Mance E. Harmon WL/AACF 224 Avionics Circle Wrigh Laboraory Wrigh-Paerson AFB, OH mharmon@acm.org Sephanie S. Harmon Wrigh Sae Universiy 56-8 Mallard Glen Drive Cenerville, OH Scope of Tuorial The purpose of his uorial is o provide an inroducion o reinforcemen learning (RL) a a level easily undersood by sudens and researchers in a wide range of disciplines. The inen is no o presen a rigorous mahemaical discussion ha requires a grea deal of effor on he par of he reader, bu raher o presen a concepual framework ha migh serve as an inroducion o a more rigorous sudy of RL. The fundamenal principles and echniques used o solve RL problems are presened. The mos popular RL algorihms are presened. Secion presens an overview of RL and provides a simple example o develop inuiion of he underlying dynamic programming mechanism. In Secion 2 he pars of a reinforcemen learning problem are discussed. These include he environmen, reinforcemen funcion, and value funcion. Secion 3 gives a descripion of he mos widely used reinforcemen learning algorihms. These include TD(λ) and boh he residual and direc forms of value ieraion, Q-learning, and advanage learning. In Secion 4 some of he ancillary issues in RL are briefly discussed, such as choosing an exploraion sraegy and an appropriae discoun facor. The conclusion is given in Secion 5. Finally, Secion 6 is a glossary of commonly used erms followed by references in Secion 7 and a bibliography of RL applicaions in Secion 8. The uorial srucure is such ha each secion builds on he informaion provided in previous secions. I is assumed ha he reader has some knowledge of learning algorihms ha rely on gradien descen (such as he backpropagaion of errors algorihm). Inroducion There are many unsolved problems ha compuers could solve if he appropriae sofware exised. Fligh conrol sysems for aircraf, auomaed manufacuring sysems, and sophisicaed avionics sysems all presen difficul, nonlinear conrol problems. Many of hese problems are currenly unsolvable, no because curren compuers are oo slow or have oo lile memory, bu simply because i is oo difficul o deermine wha he program should do. If a compuer could learn o solve he problems hrough rial and error, ha would be of grea pracical value. Reinforcemen Learning is an approach o machine inelligence ha combines wo disciplines o successfully solve problems ha neiher discipline can address individually. Dynamic Programming is a field of mahemaics ha has radiionally been used o solve problems of opimizaion and conrol. However, radiional dynamic programming is limied in he size and complexiy of he problems i can address. Supervised learning is a general mehod for raining a parameerized funcion approximaor, such as a neural nework, o represen funcions. However, supervised learning requires sample inpu-oupu pairs from he funcion o be learned. In oher words, supervised learning requires a se of quesions wih he righ answers. For example, we migh no know he bes way o program a compuer o recognize an infrared picure of a ank, bu we do have a large collecion of infrared picures, and we do know wheher

2 each picure conains a ank or no. Supervised learning could look a all he examples wih answers, and learn how o recognize anks in general. Unforunaely, here are many siuaions where we don know he correc answers ha supervised learning requires. For example, in a fligh conrol sysem, he quesion would be he se of all sensor readings a a given ime, and he answer would be how he fligh conrol surfaces should move during he nex millisecond. Simple neural neworks can learn o fly he plane unless here is a se of known answers, so if we don know how o build a conroller in he firs place, simple supervised learning won help. For hese reasons here has been much ineres recenly in a differen approach known as reinforcemen learning (RL). Reinforcemen learning is no a ype of neural nework, nor is i an alernaive o neural neworks. Raher, i is an orhogonal approach ha addresses a differen, more difficul quesion. Reinforcemen learning combines he fields of dynamic programming and supervised learning o yield powerful machine-learning sysems. Reinforcemen learning appeals o many researchers because of is generaliy. In RL, he compuer is simply given a goal o achieve. The compuer hen learns how o achieve ha goal by rial-and-error ineracions wih is environmen. Thus, many researchers are pursuing his form of machine inelligence and are excied abou he possibiliy of solving problems ha have been previously unsolvable. To provide he inuiion behind reinforcemen learning consider he problem of learning o ride a bicycle. The goal given o he RL sysem is simply o ride he bicycle wihou falling over. In he firs rial, he RL sysem begins riding he bicycle and performs a series of acions ha resul in he bicycle being iled 45 degrees o he righ. A his poin heir are wo acions possible: urn he handle bars lef or urn hem righ. The RL sysem urns he handle bars o he lef and immediaely crashes o he ground, hus receiving a negaive reinforcemen. The RL sysem has jus learned no o urn he handle bars lef when iled 45 degrees o he righ. In he nex rial he RL sysem performs a series of acions ha again resul in he bicycle being iled 45 degrees o he righ. The RL sysem knows no o urn he handle bars o he lef, so i performs he only oher possible acion: urn righ. I immediaely crashes o he ground, again receiving a srong negaive reinforcemen. A his poin he RL sysem has no only learned ha urning he handle bars righ or lef when iled 45 degrees o he righ is bad, bu ha he "sae" of being iled 45 degrees o he righ is bad. Again, he RL sysem begins anoher rial and performs a series of acions ha resul in he bicycle being iled 40 degrees o he righ. Two acions are possible: urn righ or urn lef. The RL sysem urns he handle bars lef which resuls in he bicycle being iled 45 degrees o he righ, and ulimaely resuls in a srong negaive reinforcemen. The RL sysem has jus learned no o urn he handle bars o he lef when iled 40 degrees o he righ. By performing enough of hese rial-and-error ineracions wih he environmen, he RL sysem will ulimaely learn how o preven he bicycle from ever falling over. 2 The Pars Of A Reinforcemen Learning Problem In he sandard reinforcemen learning model an agen ineracs wih is environmen. This ineracion akes he form of he agen sensing he environmen, and based on his sensory inpu choosing an acion o perform in he environmen. The acion changes he environmen in some manner and his change is communicaed o he agen hrough a scalar reinforcemen signal. There are hree fundamenal pars of a reinforcemen learning problem: he environmen, he reinforcemen funcion, and he value funcion. The Environmen Every RL sysem learns a mapping from siuaions o acions by rial-and-error ineracions wih a dynamic environmen. This environmen mus a leas be parially observable by he reinforcemen learning sysem, and he observaions may come in he form of sensor readings, symbolic descripions, or possibly menal siuaions (e.g., he siuaion of being los). The acions may be low level (e.g., volage o moors), high level (e.g., accep job offer), or even menal (e.g., shif in focus of aenion). If he RL sysem can observe perfecly all he informaion in he environmen ha migh influence he choice of acion o perform, hen he RL sysem chooses acions based on rue saes of he environmen. This ideal case is he bes possible basis for reinforcemen learning and, in fac, is a necessary condiion for much of he associaed heory.

3 The Reinforcemen Funcion As saed previously, RL sysems learn a mapping from siuaions o acions by rial-and-error ineracions wih a dynamic environmen. The goal of he RL sysem is defined using he concep of a reinforcemen funcion, which is he exac funcion of fuure reinforcemens he agen seeks o maximize. In oher words, here exiss a mapping from sae/acion pairs o reinforcemens; afer performing an acion in a given sae he RL agen will receive some reinforcemen (reward) in he form of a scalar value. The RL agen learns o perform acions ha will maximize he sum of he reinforcemens received when saring from some iniial sae and proceeding o a erminal sae. I is he job of he RL sysem designer o define a reinforcemen funcion ha properly defines he goals of he RL agen. Alhough complex reinforcemen funcions can be defined, here are a leas hree noeworhy classes ofen used o consruc reinforcemen funcions ha properly define he desired goals. Pure Delayed Reward and Avoidance Problems In he Pure Delayed Reward class of funcions he reinforcemens are all zero excep a he erminal sae. The sign of he scalar reinforcemen a he erminal sae indicaes wheher he erminal sae is a goal sae (a reward) or a sae ha should be avoided (a penaly). For example, if one waned an RL agen o learn o play he game of backgammon, he sysem could be defined as follows. The siuaion (sae) would be he configuraion of he playing board (he locaion of each player s pieces). In his case here are approximaely 0 20 differen possible saes. The acions available o he agen are he se of legal moves. The reinforcemen funcion is defined o be zero afer every urn excep when an acion resuls in a win or a loss, in which case he agen receives a + reinforcemen for a win, and a - reinforcemen for a loss. Because he agen is rying o maximize he reinforcemen, i will learn ha he saes corresponding o a win are goal saes and saes resuling in a loss are o be avoided. Anoher example of a pure delayed reward reinforcemen funcion can be found in he sandard car-pole or invered pendulum problem. A car supporing a hinged, invered pendulum is placed on a finie rack. The goal of he RL agen is o learn o balance he pendulum in an uprigh posiion wihou hiing he end of he rack. The siuaion (sae) is he dynamic sae of he car pole sysem. Two acions are available o he agen in each sae: move he car lef, or move he car righ. The reinforcemen funcion is zero everywhere excep for he saes in which he pole falls or he car his he end of he rack, in which case he agen receives a - reinforcemen. Again, because he agen is rying o maximize oal reinforcemen, he agen will learn he sequence of acions necessary o balance he pole and avoid he - reinforcemen. Minimum Time o Goal Reinforcemen funcions in his class cause an agen o perform acions ha generae he shores pah or rajecory o a goal sae. An example is an experimen commonly known as he Car on he hill problem. The problem is defined as ha of a saionary car being posiioned beween wo seep inclines. The goal of he driver (RL agen) is o successfully drive up he incline on he righ o reach a goal sae a he op of he hill. The sae of he environmen is he car s posiion Graviy Wins Goal

4 and velociy. Three acions are available o he agen in each sae: forward hrus, backward hrus, or no hrus a all. The dynamics of he sysem are such ha he car does no have enough hrus o simply drive up he hill. Raher, he driver mus learn o use momenum o his advanage o gain enough velociy o successfully climb he hill. The reinforcemen funcion is - for ALL sae ransiions excep he ransiion o he goal sae, in which case a zero reinforcemen is reurned. Because he agen wishes o maximize reinforcemen, i learns o choose acions ha minimize he ime i akes o reach he goal sae, and in so doing learns he opimal sraegy for driving he car up he hill. Games Thus far i has been assumed ha he learning agen always aemps o maximize he reinforcemen funcion. This need no be he case. The learning agen could jus as easily learn o minimize he reinforcemen funcion. This migh be he case when he reinforcemen is a funcion of limied resources and he agen mus learn o conserve hese resources while achieving a goal (e.g., an airplane execuing a maneuver while conserving as much fuel as possible). An alernaive reinforcemen funcion would be used in he conex of a game environmen, when here are wo or more players wih opposing goals. In a game scenario, he RL sysem can learn o generae opimal behavior for he players involved by finding he maximin, minimax, or saddlepoin of he reinforcemen funcion. For example, a missile migh be given he goal of minimizing he disance o a given arge (in his case an airplane). The airplane would be given he opposing goal of maximizing he disance o he missile. The agen would evaluae he sae for each player and would choose an acion independen of he oher players acion. These acions would hen be execued in parallel. Because he acions are chosen independenly and execued simulaneously, he RL agen learns o choose acions for each player ha would generae he bes oucome for he given player in a wors case scenario. The agen will perform acions for he missile ha will minimize he maximum disance o he airplane assuming he airplane will choose he acion ha maximizes he same disance. The agen will perform acions for he airplane ha will maximize he minimum disance o he missile assuming he missile will perform he acion ha will minimize he same disance. A more deailed discussion of his alernaive can be found in Harmon, Baird, and Klopf (994), and Liman(996). The Value Funcion In previous secions he environmen and he reinforcemen funcion are discussed. However, he issue of how he agen learns o choose good acions, or even how we migh measure he uiliy of an acion is no explained. Firs, wo erms are defined. A policy deermines which acion should be performed in each sae; a policy is a mapping from saes o acions. The value of a sae is defined as he sum of he reinforcemens received when saring in ha sae and following some fixed policy o a erminal sae. The opimal policy would herefore be he mapping from saes o acions ha maximizes he sum of he reinforcemens when saring in an arbirary sae and performing acions unil a erminal sae is reached. Under his definiion he value of a sae is dependen upon he policy. The value funcion is a mapping from saes o sae values and can be approximaed using any ype of funcion approximaor (e.g., mulilayered percepron, memory based sysem, radial basis funcions, look-up able, ec.). An example of a value funcion can be seen using a simple Markov decision process wih 6 saes. The sae space can be visualized using a 4x4 grid. Each square represens a sae. The reinforcemen funcion is - everywhere (i.e., he agen receives a 0-4 reinforcemen of - on each ransiion). There are 4 acions possible in each sae: norh, souh, eas, wes. The goal saes are he upper lef corner and he lower righ corner. The value funcion for he random policy is shown in Figure For each sae he random policy randomly chooses one of he four possible acions. The numbers in he saes represen he expeced values of he saes For example, when saring in he lower lef corner and following a random policy, on average here will be 22 ransiions o oher saes before he erminal sae is reached Figure -4 0

5 The opimal value funcion is shown in Figure 2. Again, saring in he lower lef corner, calculaing he sum of he reinforcemens when performing he opimal policy (he policy ha will maximize he sum of he reinforcemens), he value of ha sae is -3 because i akes only 3 ransiions o reach a erminal sae. If we are given he opimal value funcion, hen i becomes a rivial ask o exrac he opimal policy. For example, one can sar in any sae in Figure 2 and simply choose he acion ha maximizes he immediae reinforcemen received. In oher words, one can perform a one level deep breadh-firs search over acions o find he acion ha will maximize he immediae reward. The opimal policy for he value funcion shown in Figure 2 is given in Figure 3. This leads us o he fundamenal quesion of almos all of reinforcemen learning research: How do we devise an algorihm ha will efficienly find he opimal value funcion? Figure Figure 3

6 3 Approximaing he Value Funcion Reinforcemen learning is a difficul problem because he learning sysem may perform an acion and no be old wheher ha acion was good or bad. For example, a learning auo-pilo program migh be given conrol of a simulaor and old no o crash. I will have o make many decisions each second and hen, afer acing on housands of decisions, he aircraf migh crash. Wha should he sysem learn from his experience? Which of is many acions were responsible for he crash? Assigning blame o individual acions is he problem ha makes reinforcemen learning difficul. Surprisingly, here is a soluion o his problem. I is based on a field of mahemaics called dynamic programming, and i involves jus wo basic principles. Firs, if an acion causes somehing bad o happen immediaely, such as crashing he plane, hen he sysem learns no o do ha acion in ha siuaion again. So whaever acion he sysem performed one millisecond before he crash, i will avoid doing in he fuure. Bu ha principle doesn help for all he earlier acions which didn lead o immediae disaser. The second principle is ha if all he acions in a cerain siuaion leads o bad resuls, hen ha siuaion should be avoided. So if he sysem has experienced a cerain combinaion of aliude and airspeed many differen imes, whereby rying a differen acion each ime, and all acions led o somehing bad, hen i will learn ha he siuaion iself is bad. This is a powerful principle, because he learning sysem can now learn wihou crashing. In he fuure, any ime i chooses an acion ha leads o his paricular siuaion, i will immediaely learn ha paricular acion is bad, wihou having o wai for he crash. By using hese wo principles, a learning sysem can learn o fly a plane, conrol a robo, or do any number of asks. I can firs learn on a simulaor, hen fine une on he acual sysem. This echnique is generally referred o as dynamic programming, and a slighly closer analysis will reveal how dynamic programming can generae he opimal value funcion. The Essence of Dynamic Programming Iniially, he approximaion of he opimal value funcion is poor. In oher words, he mapping from saes o sae values is no valid. The primary objecive of learning is o find he correc mapping. Once his is compleed, he opimal policy can easily be exraced. A his poin some noaion needs o be inroduced : V*(x ) is he opimal value funcion where x is he sae vecor; V(x ) is he approximaion of he value funcion; is a discoun facor in he range [0,] ha causes immediae reinforcemen o have more imporance (weighed more heavily) han fuure reinforcemen. (A more complee discussion of is presened in Secion 4.) In general, V(x ) will be iniialized o random values and will conain no informaion abou he opimal value funcion V*(x ). This means ha he approximaion of he opimal value funcion in a given sae is equal o he rue value of ha sae V*(x ) plus some error in he approximaion, as expressed in equaion () V ( x ) = e( x ) + V * ( x ) () where e(x ) is he error in he approximaion of he value of he sae occupied a ime. Likewise, he approximaion of he value of he sae reached afer performing some acion a ime is he rue value of he sae occupied a ime + plus some error in he approximaion, as expressed in equaion (2). V ( x ) = e( x ) + V * ( x ) (2) As saed previously, he value of sae x for he opimal policy is he sum of he reinforcemens when saring from sae x and performing opimal acions unil a erminal sae is reached. By his definiion, a simple relaionship exiss beween he values of successive saes, x and x +. This relaionship is defined by he Bellman equaion and is expressed in equaion (3). The discoun facor is used o exponenially decrease he weigh of reinforcemens received in he fuure (A discussion of he funcion of in his equaion can be found in Secion 4).

7 V *( x ) = r( x ) + V *( x + ) (3) The approximaion V(x ) also has he same relaionship, as shown in equaion (4). By subsiuing he righhand side of equaions () and (2) ino equaion (4) we ge equaion (5) and expanding yields equaion (6). V ( x ) = r( x ) + V ( x + ) (4) ( ) e( x ) + V *( x ) = r( x ) + e( x ) + V *( x ) + + e( x ) + V * ( x ) = r( x ) + e( x ) + V * ( x ) + + (5) (6) Using equaion (3), V*(x ) is subraced from boh sides of equaion (6) o reveal he relaionship in he errors of successive saes. This relaionship is expressed in equaion (7). e( x ) e( x ) = + (7) The significance of his relaionship can be seen by using he simple Markov chain shown in Figure 4. In Figure 4 he sae labeled T is he erminal sae. The rue value of his sae is known a priori. In oher words, he error in he approximaion of he sae labeled T, e(t), is 0 by definiion. An analogy migh be he sae in which a missile ulimaely his or misses is arge. The rue value of r 2 r Figure 2 3 r T his sae is known: + for a hi, - for a miss. The process of learning is he process of finding an approximaion V(x ) ha makes equaions (3) and (7) rue for all saes x. If he approximaion error in sae 3 is a facor of smaller han he error in sae T, which is by definiion 0, hen he approximaion error in sae 3 mus also be 0. If equaion (7) is rue for all x, hen he approximaion error in each sae x is necessarily 0, ergo V(x )=V*(x ) for all x. The imporance of he discoun facor can be seen by using anoher simple Markov chain ha has no erminal sae (Figure 5). Using a similar argumen as ha used for Figure 4 one can see ha when equaion (7) is saisfied he approximaion error mus be 0 for all saes. The only difference in he prior example and in his case is ha he error in any given sae mus be a facor of 6 smaller han iself (because here are 6 saes in he cycle). Therefore, equaion (7) can only be saisfied if he approximaion error is 0 in every sae. Therefore, as saed earlier, he process of learning is he process of finding a soluion o equaion (4) for all saes x (which is also a soluion o equaion (7)). Several learning algorihms have been developed for precisely his ask. Value Ieraion 6 5 If i is assumed ha he funcion approximaor used o represen V* is a lookup able (each sae has a corresponding elemen in he able whose enry is he approximaed sae value), hen one can find he opimal value funcion by performing sweeps hrough sae space, updaing he value of each sae according o equaion (8) unil a sweep hrough sae space is performed in which here are no changes o sae values (he sae values have converged). 4 Figure e() =e(2) 2 e() = e(3) e() = 3 e(4 ) e() = 4 e(5 ) e() = 5 e(6 ) e() = 6 e() = 0

( ) w = max r( x, u) + V ( x ) V ( x ) + u (8) In equaion (8) u is he acion performed in sae x and causes a ransiion o sae x +, and r(x,u) is he reinforcemen received when performing acion u in sae x.

8 ( ) w = max r( x, u) + V ( x ) V ( x ) + u (8) In equaion (8) u is he acion performed in sae x and causes a ransiion o sae x +, and r(x,u) is he reinforcemen received when performing acion u in sae x. Figure 6 illusraes he updae. Figure 4 Figure 6 depics he scope of a single updae o he approximaion of he value of x. Specific o his example, here are wo acions possible in sae x, and each of hese acions leads o a differen successor sae x +. In a value ieraion updae, one mus firs find he acion ha reurns he maximum value. The only way o accomplish his is o acually perform an acion and calculae he sum of he reinforcemen received and he (possibly discouned) approximaed value of he successor sae V(x + ). This mus be done for all acions u in a given sae x, and is no possible wihou a model of he dynamics of he sysem. For example, in he case of a robo deciding o choose beween pahs o follow, i is no possible o choose one pah, observe he successor sae, and hen reurn o he saring sae o explore he resuls of he nex available acion. Insead, he robo mus in simulaion perform hese acions and observe he resuls. Then, based on he simulaion resuls, he robo may choose he acion ha resuls in he maximum value. One should noe ha he righ side of equaion (8) is simply he difference in he wo sides of he Bellman equaion defined in equaion (4), wih he excepion ha we have generalized he equaion o allow for Markov decision processes (muliple acions possible in a given sae) raher han Markov chains (single acion possible in every sae). This expression is he Bellman residual, and is formally defined by equaion (9). ( ) e( x ) = max r( x, u) + V ( x ) V ( x ) + u (9) E(x ) is he error funcion defined by he Bellman residual over all of sae space. Each updae (equaion (8)) reduces he value of E(x ), and in he limi as he number of updaes goes o infiniy E(x )=0. When E(x )=0, equaion (4) is saisfied and V(x )=V*(x ). Learning is accomplished. Residual Gradien Algorihms Thus far i has been assumed our funcion approximaor is a lookup able. The is normally he case in classical dynamic programming. However, his assumpion severely limis he size and complexiy of he problems solvable. Many real-world problems have exremely large or even coninuous sae spaces. In pracice i is no possible o represen he value funcion for such problems using a lookup able. Hence, an exension o classical value ieraion is o use a funcion approximaor ha can generalize and inerpolae values of saes never before seen. For example, one migh use a neural nework for he approximaion V(x,w ) of V*(x), where w is he parameer vecor. The resuling nework parameer updae is given in equaion (0). V x w w = ( r x u + V x w ) V x w (, ) α max (, ) ( +, ) (, ) u w (0)

9 I is useful o draw an analogy o he updae equaion used in supervised learning algorihms when firs examining equaion (0). In his conex, α is he learning rae, max ( r( x, u) + V( x, w )) u + is he desired oupu of he nework, V(x,w ) is he acual oupu of he nework, and V ( x, w ) is he gradien w of he oupu of he nework wih respec o he parameer vecor. I appears ha we are performing updaes ha will minimize he Bellman residual, bu his is no necessarily he case. The arge value ( r( x u) + V( x w )) max, +, u is a funcion of he parameer vecor w a ime. Once he updae o w is performed, he arge has changed because i is now a funcion of a differen parameer vecor (he vecor a ime +). I is possible ha he Bellman residual has acually been increased raher han decreased. The error funcion on which gradien descen is being performed changes wih every updae o he parameer vecor. This can resul in he values of he nework parameer vecor oscillaing or even growing o infiniy. One soluion o his problem is o perform gradien descen on he mean squared Bellman residual. Because his defines an unchanging error funcion, convergence o a local minimum is guaraneed. This means ha we can ge he benefi of he generaliy of neural neworks while sill guaraneeing convergence. The resuling parameer updae is given in equaion (). V ( x +, w ) V ( x, w ) w = α[ r( x ) + V ( x +, w ) V ( x, w )] w w The resuling mehod is referred o as a residual gradien algorihm because gradien descen is performed on he mean squared Bellman residual. Therefore, equaion () is he updae equaion for residual value ieraion, and equaion (0) is he updae equaion for direc value ieraion. I is imporan o noe ha if he MDP is non-deerminisic hen i becomes necessary o generae independen successor saes o guaranee convergence o he correc answer. For a more deailed discussion see Baird (995); Harmon, Baird, and Klopf (995); and Harmon and Baird (996). Q-Learning Q-learning (Wakins, 989 and 992) is anoher exension o radiional dynamic programming (value ieraion) ha solves he following problem. A deerminisic Markov decision process is one in which he sae ransiions are deerminisic (an acion performed in sae x always ransiions o he same successor sae x + ). Alernaively, in a nondeerminisic Markov decision process, a probabiliy disribuion funcion defines a se of poenial successor saes for a given acion in a given sae. If he MDP is non-deerminisic, hen value ieraion requires ha we find he acion ha reurns he maximum expeced value (he sum of he reinforcemen and he inegral over all possible successor saes for he given acion). For example, o find he expeced value of he successor sae associaed wih a given acion, one mus perform ha acion an infinie number of imes, aking he inegral over he values of all possible successor saes for ha acion. The reason his is necessary is demonsraed in Figure 7. () Figure 5 In Figure 7 here are wo possible acions in sae x. Each acion reurns a reinforcemen of 0. Acion u causes a ransiion o one of wo possible successor saes wih equal probabiliy. The same is rue for

10 acion u 2. The values of he successor saes are 0 and for boh acions. Value ieraion requires ha he value of sae x be equal o he maximum over acions of he sum of reinforcemen and he expeced value of he successor sae. By aking an infinie number of samples of successor saes for acion u, one would be able o calculae ha he acual expeced value is 0.5. The same is rue for acion u 2. Therefore, he value of sae x is 0.5 However, if one were o naively perform value ieraion on his MDP by aking a single sample of he successor sae associaed wih each acion insead of he inegral, hen x would converge o a value of Clearly he wrong answer. Theoreically, value ieraion is possible in he conex of non-deerminisic MDPs. However, in pracice i is compuaionally impossible o calculae he necessary inegrals wihou added knowledge or some degree of modificaion. Q-learning solves he problem of having o ake he max over a se of inegrals. Raher han finding a mapping from saes o sae values (as in value ieraion), Q-learning finds a mapping from sae/acion pairs o values (called Q-values). Insead of having an associaed value funcion, Q- learning makes use of he Q-funcion. In each sae, here is a Q-value associaed wih each acion. The definiion of a Q-value is he sum of he (possibly discouned) reinforcemens received when performing he associaed acion and hen following he given policy hereafer. Likewise, he definiion of an opimal Q- value is he sum of he reinforcemens received when performing he associaed acion and hen following he opimal policy hereafer. In he conex of Q-learning, he value of a sae is defined o be he maximum Q-value in he given sae. Given his definiion i is easy o derive he equivalen of he Bellman equaion (equaion 4) for Q-learning. Q( x, u ) = r( x, u ) + max Q( x, u ) + + u + (2) Q-learning differs from value ieraion in ha i doesn require ha in a given sae each acion be performed and he expeced values of he successor saes be calculaed. While value ieraion performs an updae ha is analogous o a one level breadh-firs search, Q-learning akes a single-sep sample of a Mone-Carlo roll-ou. This process is demonsraed in Figure 8. Figure 6 The updae equaion in Figure 8 is valid when using a lookup able o represen he Q-funcion. The Q- value is a predicion of he sum of he reinforcemens one will receive when performing he associaed acion and hen following he given policy. To updae ha predicion Q(x,u ) one mus perform he associaed acion u, causing a ransiion o he nex sae x + and reurning a scalar reinforcemen r(x,u ). Then one need only find he maximum Q-value in he new sae o have all he necessary informaion for revising he predicion (Q-value) associaed wih he acion jus performed. Q-learning does no require one o calculae he inegral over all possible successor saes in he case ha he sae ransiions are nondeerminisic. The reason is ha a single sample of a successor sae for a given acion is an unbiased esimae of he expeced value of he successor sae. In oher words, afer many updaes he Q-value associaed wih a paricular acion will converge o he expeced sum of all reinforcemens received when performing ha acion and following he opimal policy hereafer.

11 Residual Gradien and Direc Q-learning As i is possible o represen he value funcion wih a neural nework in he conex of value ieraion, so i is possible o represen he Q-funcion wih a neural nework in he conex of Q-learning. The informaion presened in he discussion of value ieraion concerning convergence o a sable value funcion is also applicable o guaraneeing convergence o a sable Q-funcion. Equaion (3) is he updae equaion for direc Q-learning where α is he learning rae, and equaion (4) is he updae equaion for residual gradien Q-learning. Q( x, u, w ) w = α r x u Q x u w Q x u w (, ) + max (,, ) u + + (,, ) w + w = α r x u Q x u w Q x u w (, ) + max (,, ) u + + (,, ) + Q( x, u ) Q( x, u, w ) + + w w (3) (4) Advanage Learning Alhough Q-learning is a significan improvemen over value ieraion, i is sill limied in scope in a leas one imporan way. The number of raining ieraions necessary o sufficienly represen he opimal Q- funcion when using funcion approximaors ha generalize scales poorly wih he size of he ime inerval beween saes. The greaer he number of acions per uni ime (he smaller he incremen in ime beween acions) he greaer he number of raining ieraions required o adequaely represen he opimal Q- funcion. The explanaion for his is demonsraed wih a simple example. Figure 9 depics a Markov decision process wih 000 saes. Sae 0 is he iniial sae and has a single acion available, ransiion o sae. Sae 999 is an absorbing sae Figure 7 In saes..998 here are wo acions available, ransiion o eiher he sae immediaely o he righ or immediaely o he lef. For example, in sae, he acion of going lef will ransiion o sae 0, and he acion of going righ will ransiion o sae 2. Each ransiion incurs a cos (reinforcemen) of. The objecive is o minimize he oal cos accumulaed in ransiioning from sae o sae unil he absorbing saes is reached. The opimal Q-value for each acion is represened by he numbers nex o each sae. For example, in sae 2 he opimal Q-value for he acion of going lef is 000, and he opimal Q-value for he acion of going righ is 998. The opimal policy can easily be found in each sae by choosing o perform he acion wih he minimum Q-value. When using a funcion approximaor ha generalizes over sae/acion pairs (any funcion approximaor oher han a lookup able or equivalen), i is possible o encouner pracical limiaions in he number of raining ieraions required o accuraely approximae he opimal Q-funcion. As he ime inerval beween saes decreases in size, he required precision in he approximaion of he opimal Q-funcion increases exponenially. For example, he opimal Q-funcion associaed wih he MDP in Figure 9 is linear and can be represened by a simple linear funcion approximaor. However, i requires an unreasonably large number of raining ieraions o achieve he level of precision necessary o generae he opimal policy. The reason for he large number of raining ieraions is simple. The difference in he Q-values in a given sae is small relaive o he difference in he Q-values across saes (a raio of approximaely :000). For example, he difference in he Q-values in sae is 2 (00-999=2). The difference in he minimum Q-

12 values in saes and 998 is 998 (999-=998). The approximaion of he opimal Q-funcion mus achieve a degree of precision such ha he iny differences in Q-values in a single sae are represened. Because he differences in Q-values across saes have a greaer impac on he mean squared error, during raining he nework learns o represen hese differences firs. The differences in he Q-values in a given sae have only a iny effec on he mean squared error and herefore ge los in he noise. To represen he differences in Q-values in a given sae requires much greaer precision han o represen he Q-values across saes. As he raio of he ime inerval o he number of saes decreases i becomes necessary o approximae he opimal Q-funcion wih increasing precision. In he limi, infinie precision is necessary. Advanage learning does no share he scaling problem of Q-learning. Similar o Q-learning, advanage learning learns a funcion of sae/acion pairs. However, in advanage learning he value associaed wih each acion is called an advanage. Therefore, advanage learning finds an advanage funcion raher han a Q-funcion or value funcion. The value of a sae is defined o be he value of he maximum advanage in ha sae. For he sae/acion pair (x,u) an advanage is defined as he sum of he value of he sae and he uiliy (advanage) of performing acion u raher han he acion currenly considered bes. For opimal acions his uiliy is zero, meaning he value of he acion is also he value of he sae; for sub-opimal acions he uiliy is negaive, represening he degree of sub-opimaliy relaive o he opimal acion. The equivalen of he Bellman equaion for advanage learning is given in equaion (5). A( x, u ) = max A( x, u ) + u r( x, u ) + max A( x, u ) max A( x, u ) + + u + u K where is he discoun facor per ime sep, K is a ime uni scaling facor, and <> represens he expeced value over all possible resuls of performing acion u in sae x o receive immediae reinforcemen r and o ransiion o a new sae x +. Residual Gradien and Direc Advanage Learning The number of raining ieraions required in Q-learning scales poorly as he raio of he ime inerval beween saes o he number of saes grows small. Advanage learning can find a sufficienly accurae approximaion o he advanage funcion in a number of raining ieraions ha is independen of his raio. The updae equaions for direc advanage learning and residual advanage learning are given in equaions (6) and (7) respecively. Again, he reader is referred o he subsecion in he discussion of value ieraion devoed o residual gradien algorihms. For a furher discussion of advanage learning see Harmon and Baird (996). w = α r( x, u ) + max A( x, u, w ) + max A( x, u, w ) A( x, u, w ) u + + K K + u A( x, u, w ) w (6) (5)

13 w = α r x u A x u w x u w x u w (, ) + max (,, ) + A A u + + max (,, ) (,, ) K K + u max A( x, u, w ) max A( x, u, w ) u + + u + A( x, u, w ) + w K K w w (7) TD(λ) Consider he Markov chain in Figure 0. The iniial sae is 0 and he erminal sae is 999. Each sae ransiion reurns a cos (reinforcemen) of and he value of sae 999 is defined o be 0. Because his is a Markov chain i is no sensible o sugges ha he RL sysem learn o minimize or maximize reinforcemen. Insead, we are concerned exclusively wih predicing he oal reinforcemen received when saring from sae n where n is a sae in he range [..998] Figure 8 Value ieraion, Q-learning, and advanage learning can all solve his problem. However, TD(λ) can solve i faser. In he conex of Markov chains, TD(λ) is idenical o value ieraion wih he excepion ha TD(λ) updaes he value of he curren sae based on a weighed combinaion of he values of fuure saes, as opposed o using only he value of he immediae successor sae. Recall ha in value ieraion he arge value of he curren sae is he sum of he reinforcemen and he value of he successor sae, in oher words, he righ side of he Bellman equaion (Equaion 8). V ( x w ) = r( x ) + V ( x +, w ) (8) Noice ha he arge is also based on an esimae V(x +,w ), and his esimae can be based on zero informaion. Indeed, his is he case much of he ime and can be demonsraed using Figure 0. Assume ha he value funcion for his Markov chain is represened using a lookup able. In his case, our lookup able has 000 elemens, each corresponding o a sae, and he enry in each elemen is he value of he corresponding sae. Before learning begins enries are iniialized o random values. The process of learning sars by updaing he value of sae 0 o be he sum of he reinforcemen received on ransiion from sae 0 o sae and he value of sae. Remember, a his poin he value of sae is arbirary. This is rue for all saes excep he erminal sae (999) which, by definiion, has a value of 0. Because he iniial values of saes are arbirary (wih he excepion of he erminal sae), he enire firs sweep hrough he Markov chain (epoch) of raining resuls in he improvemen of he approximaion of he value funcion only in sae 998. In he firs epoch, only in sae 998 is he updae o he approximaion based on somehing oher han an arbirary value. This is erribly inefficien. In fac, no unil 999 epochs of raining have been performed will he approximaion of he value of sae 0 conain any degree of ruh (he approximaion is based on somehing oher han an arbirary value). In epoch 2 of raining, he approximaion of he value of sae 997 is updaed based on an approximaion of he value of sae 998 ha has as is basis he rue value of sae 999, raher han an arbirary value. In epoch 3, he approximaion of he value of sae 996 will be updaed based on ruh raher han an arbirary value. Each epoch moves ruh back one sep in he chain. The approximaion of he value of sae x is updaed based on he approximaion of he value of he sae one sep ino he fuure, x +. If he value of a sae were based on a weighed average of he values of fuure saes, hen ruh would be propagaed back in ime much more efficienly. In our example above, if

14 insead of updaing he value of a sae based exclusively on he value of he immediae successor sae one used he nex 2 successor saes as he basis of he updae, hen he number of epochs performed before he value of sae 0 is no longer based on an arbirary value is reduced from 000 o 500. If he value approximaion of sae 0 is based on a weighed combinaion of values of he succeeding 500 saes, hen only 2 epochs are required before he value approximaion of sae 0 is based on somehing oher han an arbirary value. This is precisely he funcion of TD(λ) (Suon, 988) for 0<λ<. Insead of updaing a value approximaion based solely on he approximaed value of he immediae successor sae, TD(λ) basis he updae on an exponenial weighing of values of fuure saes. λ is he weighing facor. TD(0), he case of λ=0, is idenical o value ieraion for he example problem saed above. TD() updaes he value approximaion of sae n based solely on he value of he erminal sae. The parameer updae for TD(λ) is given in equaion (9). ( + ) k w = α r( x ) + V ( x, w ) V ( x, w ) λ V ( x, w ) An incremenal form of his equaion can be derived as follows. Given ha g is he value of he sum in (9) for, we can compue g +, using only curren informaion, as k = w k (9) + k g = λ V ( x, w ) + + k = + k = V ( x, w ) + λ V ( x, w ) w = V ( x, w ) + λg w k + w k k = k + w k (20) Noice ha equaion (9) does no have a max or min erm. This suggess ha TD(λ) is used exclusively in he conex of predicion (Markov chains). One way o exend he use of TD(λ) o he domain of Markov decision processes is o perform updaes according o equaion (9) while calculaing he sum according o equaion (20) when following he curren policy. When a sep of exploraion is performed (choosing an acion ha is no currenly considered bes ), he sum of pas gradiens g in equaion (20) should be se o 0. The inuiion behind his mehod follows. The value of a sae x is defined as he sum of he reinforcemens received when saring in x and following he curren policy unil a erminal sae is reached. During raining, he curren policy is he bes approximaion o he opimal policy generaed hus far. On occasion one mus perform acions ha don agree wih he curren policy so ha beer approximaions o he opimal policy can be realized. However, one migh no wan he value of he resuling sae propagaed hrough he chain of pas saes. This would corrup he value approximaions for hese saes by inroducing informaion ha is no consisen wih he definiion of a sae value. One furher noe. TD(λ) for λ=0 is equivalen o value ieraion. Likewise, he discussion of residual gradien algorihms is applicable o TD(λ) when λ=0. However, his is no he case for 0<λ<. No algorihms exis ha guaranee convergence for TD(λ) for 0<λ< when using a general funcion approximaor.

15 4 Miscellaneous Issues Exploraion As saed earlier, he fundamenal quesion in reinforcemen learning research is: How do we devise an algorihm ha will efficienly find he opimal value funcion? I was shown ha he opimal value funcion is a soluion o he se of equaions defined by he Bellman equaion (Equaion 4). The process of learning was subsequenly described as he process of improving an approximaion of he opimal value funcion by incremenally finding a soluion o his se of equaions. One should noice ha he Bellman equaion is defined over all of sae space. The opimal value funcion saisfies his equaion for ALL x in sae space. This requiremen inroduces he need for exploraion. Exploraion is defined as inenionally choosing o perform an acion ha is no considered bes for he express purpose of acquiring knowledge of unseen (or lile seen) saes. In order o idenify a (sub-)opimal approximaion, sae space mus be sufficienly explored. For example, a robo facing an unknown environmen has o spend some ime acquiring knowledge of is environmen. Alernaively, experience acquired during exploraion mus also be considered during acion selecion o minimize he coss (negaive reinforcemens) associaed wih learning. Alhough he robo mus explore is environmen, i should avoid collisions wih obsacles. However, he robo does no know which acions will resul in collision unil all of sae space has been explored. On he oher hand, i is possible ha a policy ha is sufficienly good will be recognized wihou having o explore all of sae space. There is a fundamenal rade-off beween exploraion and exploiaion (using previously acquired knowledge o direc he choice of acion). Therefore, i is imporan o use exploraion echniques ha will maximize he knowledge gained during learning while minimizing he coss of exploraion and learning ime. For a good inroducion o he issues of efficien exploraion see Thrun (992). Discouned vs. Non-Discouned The discoun facor is a number in he range of [0..] and is used o weigh near erm reinforcemen more heavily han disan fuure reinforcemen. For he purpose of discussion, he updae equaion for value ieraion is shown again as equaion (2). The closer is o he greaer he weigh of fuure reinforcemens. The weighing of fuure reinforcemens has a half-life of σ = log0.5 / log. For =0, he value of a sae is based exclusively on he immediae reinforcemen received for performing he associaed acion. For finie horizon Markov decision processes (an MDP ha erminaes) i is no sricly necessary o use a discoun facor. In his case (=), he value of sae x is based on he oal reinforcemen received when saring in sae x and following he given policy. ( ) w = max r( x, u) + V ( x, w ) V ( x, w ) (2) + u In he case of infinie horizon Markov decision processes (an MDP ha never erminaes), a discoun facor is required. Wihou he use of a discoun facor, he sum of he reinforcemens received would be infinie for every sae. The use of a discoun facor limis he maximum value of a sae o be on he order of R.

16 5 Conclusion Reinforcemen learning appeals o many researchers because of is generaliy. Any problem domain ha can be cas as a Markov decision process can poenially benefi from his echnique. In fac, many researchers view reinforcemen learning no as a echnique, bu raher a paricular ype of problem ha is amenable o soluion by he algorihms described above. Reinforcemen learning is an exension of classical dynamic programming in ha i grealy enlarges he se of problems ha can pracically be solved. Unlike supervised learning, reinforcemen learning sysems do no require explici inpu-oupu pairs for raining. By combining dynamic programming wih neural neworks, many are opimisic ha classes of problems previously unsolvable will finally be solved. Acknowledgmens The developmen of his uorial was suppored under Task 232R by he Unied Saes Air Force Office of Scienific Research. We would also like o hank Leemon Baird, Harry Klopf, Eric Blasche, Jim Morgan, and Sco Weaver for useful commens. 6 Glossary policy - a mapping from saes o acions. reinforcemen - a scalar variable ha communicaes he change in he environmen o he reinforcemen learning sysem. For example, if an RL sysem is a conroller for a missile, he reinforcemen signal migh be he disance beween he missile and he arge (In which case, he RL sysem would learn o minimize reinforcemen). Markov decision process - An MDP consiss of a se of saes X; a se of sar saes S ha is a subse of X; a se of acions A; a reinforcemen funcion R where R(x,a) is he expeced immediae reinforcemen for aking acion a in sae x; and an acion model P where P(x' x,a) gives he probabiliy ha execuing acion a in sae x will lead o sae x'. Noe: I is a requiremen ha he choice of acion be dependen solely on he curren sae observaion x. If knowledge of prior acions or saes affecs he curren choice of acion hen he decision process is no Markov. deerminisic - In he conex of saes, here is a one-o-one mapping from sae/acion pairs o successor saes. In oher words, wih probabiliy one, he ransiion from sae x afer performing acion a will always resul in sae x'. non-deerminisic - In he conex of saes, here exiss a probabiliy disribuion funcion P(x' x,a) ha gives he probabiliy ha execuing acion a in sae x will lead o sae x'. sae - The condiion of a physical sysem as specified by a se of appropriae variables. unbiased esimae - The expeced (mean) error in he esimae is zero. 7 References Baird, L. C. (995). Residual Algorihms: Reinforcemen Learning wih Funcion Approximaion. In Armand Priediis & Suar Russell, eds. Machine Learning: Proceedings of he Twelfh Inernaional Conference, 9-2 July, Morgan Kaufmann Publishers, San Francisco, CA. Baird, L. C. (993). Advanage Updaing. (Technical Repor WL-TR-93-46). Wrigh-Paerson Air Force Base Ohio: Wrigh Laboraory. (available from he Defense Technical Informaion Cener, Cameron Saion, Alexandria, VA ). Bersekas, D. P. (995). Dynamic Programming and Opimal Conrol. Ahena Scienific, Belmon, MA. Harmon, M. E., Baird, L. C., and Klopf, A. H. (995). Reinforcemen learning applied o a differenial game. Adapive Behavior, MIT Press, (4), pp

17 Harmon, M. E., Baird, L. C., and Klopf, A. H. (994). Advanage Updaing Applied o a Differenial Game. In Tesauro, Tourezky & Leen, eds. Advances in Neural Informaion Processing Sysems: Proceedings of he 994 Conference, MIT Press, Cambridge, Massachuses. Harmon, M. E., and Baird, L. C. (996). Muli-player residual advanage learning wih general funcion approximaion. (Technical Repor WL-TR ). Wrigh-Paerson Air Force Base Ohio: Wrigh Laboraory. (available from he Defense Technical Informaion Cener, Cameron Saion, Alexandria, VA ). Kaelbling, L. P., Liman, M. L., and Moore, A.W. (996). Reinforcemen Learning: A Survey, Journal of Arificial Inelligence Research Volume 4, pp Liman M. L. (994). Markov games as a framework for muli-agen reinforcemen learning. In Proceedings of he Elevenh Inernaional Conference on Machine Learning,, San Francisco, CA., Morgan Kaufmann, pp Suon, R. S. (988). Learning o Predic by he Mehods of Temporal Differences. Machine Learning 3: Thrun, S. (992). The Role of Exploraion in Learning Conrol. In Handbook for Inelligen Conrol: Neural, Fuzzy and Adapive Approaches, Van Nosrand Reinhold, Florence, Kenucky Wakins, C. J. C. H. (989). Learning from delayed rewards. Docoral hesis, Cambridge Universiy, Cambridge, England. Wakins, J. C. H., Dayan, P. (992). Technical Noe: Q-Learning. Machine Learning 8: Bibliography Applicaions of Reinforcemen Learning Tesauro, G.J. (995). Temporal Differences Learning and TD-Gammon. Communicaions of he ACM, 38: Boyan, J. A., and Liman, M. L. (994). Packe rouing in dynamically changing neworks: A reinforcemen learning approach. In Cowan, J. D., Tesauro, G., and Alspecor, J. (eds.), Advances in Neural Informaion Processing Sysems: Proceedings of he 994 Conference, San Francisco, CA: Morgan Kaufmann. Cries, R. H., and Baro, A.G. (996). Improving Elevaor Performance Using Reinforcemen Learning.. In Tourezky, Mozer & Hasselmo, eds. Advances in Neural Informaion Processing Sysems: Proceedings of he 995 Conference, MIT Press, Cambridge, Massachuses. Singh, S., and Bersekas, D. P. (996). Reinforcemen Learning for Dynamic Channel Allocaion in Cellular Telephone Sysems. To appear in Advances in Neural Informaion Processing Sysems: Proceedings of he 996 Conference, MIT Press, Cambridge, Massachuses. Zhang, W., and Dieerich, T.G. (996). High-Performance Job-Shop Scheduling Wih a Time-Delay TD(λ) Nework. In Tourezky, Mozer & Hasselmo, eds. Advances in Neural Informaion Processing Sysems: Proceedings of he 995 Conference, MIT Press, Cambridge, Massachuses.

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 RL Lecure 7: Eligibiliy Traces R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 1 N-sep TD Predicion Idea: Look farher ino he fuure when you do TD backup (1, 2, 3,, n seps) R. S. Suon and