Temporal Abstraction in Temporal-difference Networks

Size: px

Start display at page:

Download "Temporal Abstraction in Temporal-difference Networks"

Posy Tyler
5 years ago
Views:

1 Temporal Absracion in Temporal-difference Neworks Richard S. Suon, Eddie J. Rafols, Anna Koop Deparmen of Compuing Science Universiy of Albera Edmonon, AB, Canada T6G 2E8 Absrac We presen a generalizaion of emporal-difference neworks o include emporally absrac opions on he links of he quesion nework. Temporal-difference (TD) neworks have been proposed as a way of represening and learning a wide variey of predicions abou he ineracion beween an agen and is environmen. These predicions are composiional in ha heir arges are defined in erms of oher predicions, and subjuncive in ha ha hey are abou wha would happen if an acion or sequence of acions were aken. In convenional TD neworks, he iner-relaed predicions are a successive ime seps and coningen on a single acion; here we generalize hem o accommodae exended ime inervals and coningency on whole ways of behaving. Our generalizaion is based on he opions framework for emporal absracion. The primary conribuion of his paper is o inroduce a new algorihm for inra-opion learning in TD neworks wih funcion approximaion and eligibiliy races. We presen empirical examples of our algorihm s effeciveness and of he greaer represenaional expressiveness of emporallyabsrac TD neworks. The primary disinguishing feaure of emporal-difference (TD) neworks (Suon & Tanner, 2005) is ha hey permi a general composiional specificaion of he goals of learning. The goals of learning are hough of as predicive quesions being asked by he agen in he learning problem, such as Wha will I see if I sep forward and look righ? or If I open he fridge, will I see a bole of beer? Seeing a bole of beer is of course a complicaed percepual ac. I migh be hough of as obaining a se of predicions abou wha would happen if cerain reaching and grasping acions were aken, abou wha would happen if he bole were opened and urned upside down, and of wha he bole would look like if viewed from various angles. To predic seeing a bole of beer is hus o make a predicion abou a se of oher predicions. The arge for he overall predicion is a composiion in he mahemaical sense of he firs predicion wih each of he oher predicions. TD neworks are he firs framework for represening he goals of predicive learning in a composiional, machine-accessible form. Each node of a TD nework represens an individual quesion somehing o be prediced and has associaed wih i a value represening an answer o he quesion a predicion of ha somehing. The quesions are represened by a se of direced links beween nodes. If node 1 is linked o node 2, hen node 1 rep-

2 resens a quesion incorporaing node 2 s quesion; is value is a predicion abou node 2 s predicion. Higher-level predicions can be composed in several ways from lower ones, producing a powerful, srucured represenaion language for he arges of learning. The composiional srucure is no jus in a human designer s head; i is expressed in he links and hus is accessible o he agen and is learning algorihm. The nework of hese links is referred o as he quesion nework. An enirely separae se of direced links beween he nodes is used o compue he values (predicions, answers) associaed wih each node. These links collecively are referred o as he answer nework. The compuaion in he answer nework is composiional in a convenional way node values are compued from oher node values. The essenial insigh of TD neworks is ha he noion of composiionaliy should apply o quesions as well as o answers. A secondary disinguishing feaure of TD neworks is ha he predicions (node values) a each momen in ime can be used as a represenaion of he sae of he world a ha ime. In his way hey are an insance of he idea of predicive sae represenaions (PSRs) inroduced by Liman, Suon and Singh (2002), Jaeger (2000), and Rives and Schapire (1987). Represening a sae by is predicions is a poenially powerful sraegy for sae absracion (Rafols e al., 2005). We noe ha he quesions used in all previous work wih PSRs are defined in erms of concree acions and observaions, no oher predicions. They are no composiional in he sense ha TD-nework quesions are. The quesions we have discussed so far are subjuncive, meaning ha hey are condiional on a cerain way of behaving. We predic wha we would see if we were o sep forward and look righ, or if we were o open he fridge. The quesions in convenional TD neworks are subjuncive, bu hey are condiional only on primiive acions or open-loop sequences of primiive acions (as are convenional PSRs). I is naural o generalize his, as we have in he informal examples above, o quesions ha are condiional on closed-loop emporally exended ways of behaving. For example, opening he fridge is a complex, high-level acion. The arm mus be lifed o he door, he hand shaped for grasping he handle, ec. To ask quesions like if I were o go o he coffee room, would I see John? would require subsanial emporal absracion in addiion o sae absracion. The opions framework (Suon, Precup & Singh, 1999) is a sraighforward way of alking abou emporally exended ways of behaving and abou predicions of heir oucomes. In his paper we exend he opions framework so ha i can be applied o TD neworks. Significan exensions of he original opions framework are needed. Novel feaures of our opion-exended TD neworks are ha hey 1) predic componens of opion oucomes raher han full oucome probabiliy disribuions, 2) learn according o he firs inra-opion mehod o use eligibiliy races (see Suon & Baro, 1998), and 3) include he possibiliy of opions whose policies are indifferen o which of several acions are seleced. 1 The opions framework In his secion we presen he essenial elemens of he opions framework (Suon, Precup & Singh, 1999) ha we will need for our exension of TD neworks. In his framework, an agen and an environmen inerac a discree ime seps = 1, 2, 3... In each sae s S, he agen selecs an acion a A, deermining he nex sae s An acion is a way of behaving for one ime sep; he opions framework les us alk abou emporally exended ways of behaving. An individual opion consiss of hree pars. The firs is he iniiaion se, I S, he subse of saes in which he opion can be sared. The second componen of an opion is is policy, π : S A [0, 1], specifying how he agen behaves when 1 Alhough he opions framework includes rewards, we omi hem here because we are concerned only wih predicion, no conrol.

3 following he opion. Finally, a erminaion funcion, β : S A [0, 1], specifies how he opion ends: β(s) denoes he probabiliy of erminaing when in sae s. The opion is hus compleely and formally defined by he 3-uple (I, π, β). 2 Convenional TD neworks In his secion we briefly presen he deails of he srucure and he learning algorihm comprising TD neworks as inroduced by Suon and Tanner (2005). TD neworks address a predicion problem in which he agen may no have direc access o he sae of he environmen. Insead, a each ime sep he agen receives an observaion o O dependen on he sae. The experience sream hus consiss of a sequence of alernaing acions and observaions, o 1, a 1, o 2, a 2, o 3. The TD nework consiss of a se of nodes, each represening a single scalar predicion, inerlinked by he quesion and answer neworks as suggesed previously. For a nework of n nodes, he vecor of all predicions a ime sep is denoed y = (y 1,..., y n ) T. The predicions are esimaes of he expeced value of some scalar quaniy, ypically of a bi, in which case hey can be inerpreed as esimaes of probabiliies. The predicions are updaed a each ime sep according o a vecor-valued funcion u wih modifiable parameer W, which is ofen aken o be of a linear form: y = u(y 1, a 1, o, W ) = σ(w x ), (1) where x R m is an m-vecor of feaures creaed from (y 1, a 1, o ), W is an n m marix (whose elemens are someimes referred o as weighs), and σ is he n-vecor form of eiher he ideniy funcion or he S-shaped logisic funcion σ(s) = 1 1+e. The s feaure vecor is an arbirary vecor-valued funcion of y 1, a 1, and o. For example, in he simples case he feaure vecor is a uni basis vecor wih he locaion of he one communicaing he curren sae. In a parially observable environmen, he feaure vecor may be a combinaion of he agen s acion, observaions, and predicions from he previous ime sep. The overall updae u defines he answer nework. The quesion nework consiss of a se of arge funcions, z i : O R n R, and condiion funcions, c i : A R n [0, 1] n. We define z i = z i (o +1, ỹ +1 ) as he arge for predicion y. i 2 Similarly, we define c i = c i (a, y ) as he condiion a ime. The learning algorihm for each componen w ij of W can hen be wrien w ij +1 = wij + α ( z i y i ) c i y i w ij, (2) where α is a posiive sep-size parameer. Noe ha he arges here are funcions of he observaion and predicions exacly one ime sep laer, and ha he condiions are funcions of a single primiive acion. This is wha makes his algorihm suiable only for learning abou one-sep TD relaionships. By chaining ogeher muliple nodes, Suon and Tanner (2005) used i o predic k seps ahead, for various paricular values of k, and o predic he oucome of specific acion sequences (as in PSRs, e.g., Liman e al., 2002; Singh e al., 2004). Now we consider he exension o emporally absrac acions. 3 Opion-exended TD neworks In his secion we presen our inra-opion learning algorihm for TD neworks wih opions and eligibiliy races. As suggesed earlier, each node s ougoing link in he quesion 2 The quaniy ỹ is almos he same as y, and we encourage he reader o hink of hem as idenical here. The difference is ha ỹ is calculaed by weighs ha are one sep ou of dae as compared o y, i.e., ỹ = u(y 1, a 1, o, W 1) (cf. equaion 1).

4 nework will now correspond o an opion applying over possibly many seps. The policy of he ih node s opion corresponds o he condiion funcion c i, which we hink of as a recognizer for he opion. I inspecs each acion aken o assess wheher he opion is being followed: c i = 1 if he agen is acing consisenly wih he opion policy and c i = 0 oherwise (inermediae values are also possible). When an agen ceases o ac consisenly wih he opion policy, we say ha he opion has diverged. The possibiliy of recognizing more han one acion as consisen wih he opion is a significan generalizaion of he original idea of opions. If no acions are recognized as accepable in a sae, hen he opion canno be followed and hus canno be iniiaed. Here we ake he se of saes wih a leas one recognized acion o be he iniiaion se of he opion. The opion-erminaion funcion β generalizes naurally o TD neworks. Each node i is given a corresponding erminaion funcion, β i : O R n [0, 1], where β i = β i (o +1, y ) is he probabiliy of erminaing a ime. 3 β i = 1 indicaes ha he opion has erminaed a ime ; β i = 0 indicaes ha i has no, and inermediae values of β correspond o sof or sochasic erminaion condiions. If an opion erminaes, hen z i acs as he arge, bu if he opion is ongoing wihou erminaion, hen he node s own nex value, ỹ i +1, should be he arge. The erminaion funcion specifies which of he wo arges (or mixure of he wo arges) is used o produce a form of TD error for each node i: δ i = β i z i + (1 β i )ỹ i +1 y i. (3) Our opion-exended algorihm incorporaes eligibiliy races (see Suon & Baro, 1998) as shor-erm memory variables organized in an n m marix E, paralleling he weigh marix. The races are a record of he effec ha each weigh could have had on each node s predicion during he ime he agen has been acing consisenly wih he node s opion. The componens e ij of he eligibiliy marix are updaed by e ij = c i [ λe ij 1 (1 βi ) + yi w ij ], (4) where 0 λ 1 is he race-decay parameer familiar from he TD(λ) learning algorihm. Because of he c i facor, all of a node s races will be immediaely rese o zero whenever he agen deviaes from he node s opion s policy. If he agen follows he policy and he opion does no erminae, hen he race decays by λ and incremens by he gradien in he way ypical of eligibiliy races. If he policy is followed and he opion does erminae, hen he race will be rese o zero on he immediaely following ime sep, and a new race will sar building. Finally, our algorihm updaes he weighs on each ime sep by 4 Fully observable experimen w ij +1 = wij + α δ i e ij. (5) This experimen was designed o es he correcness of he algorihm in a simple gridworld where he environmenal sae is observable. We applied an opions-exended TD nework o he problem of learning o predic observaions from ineracion wih he gridworld environmen shown on he lef in Figure 1. Empy squares indicae spaces where he agen can move freely, and colored squares (shown shaded in he figure) indicae walls. The agen is egocenric. A each ime sep he agen receives from he environmen six bis represening he color i is facing (red, green, blue, orange, yellow, or whie). In his firs experimen we also provided = 144 oher bis direcly indicaing he complee sae of he environmen (square and orienaion). 3 The fac ha he opion depends only on he curren predicions, acion, and observaion means ha we are considering only Markov opions.

5 Figure 1: The es world (lef) and he quesion nework (righ) used in he experimens. The riangle in he world indicaes he locaion and orienaion of he agen. The walls are labeled R, O, Y, G, and B represening he colors red, orange, yellow, green and blue. Noe ha he lef wall is mosly blue bu parly green. The righ diagram shows in full he porion of he quesion nework corresponding o he red bi. This srucure is repeaed, bu no shown, for he oher four (non-whie) colors. L, R, and F are primiive acions, and Forward and Wander are opions. There are hree possible acions: A ={F, R, L}. Acions were seleced according o a fixed sochasic policy independen of he sae. The probabiliy of he F, L, and R acions were 0.5, 0.25, and 0.25 respecively. L and R cause he agen o roae 90 degrees o he lef or righ. F causes he agen o move ahead one square wih probabiliy 1 p and o say in he same square wih probabiliy p. The probabiliy p is called he slipping probabiliy. If he forward movemen would cause he agen o move ino a wall, hen he agen does no move. In his experimen, we used p = 0, p = 0.1, and p = 0.5. In addiion o hese primiive acions, we provided wo emporally absrac opions, Forward and Wander. The Forward opion akes he acion F in every sae and erminaes when he agen senses a wall (color) in fron of i. The policy of he Wander opion is he same as ha acually followed by he agen. Wander erminaes wih probabiliy 1 when a wall is sensed, and sponaneously wih probabiliy 0.5 oherwise. We used he quesion nework shown on he righ in Figure 1. The predicions of nodes 1, 2, and 3 are esimaes of he probabiliy ha he red bi would be observed if he corresponding primiive acion were aken. Node 4 is a predicion of wheher he agen will see he red bi upon erminaion of he Wander opion if i were aken. Node 5 predics he probabiliy of observing he red bi given ha he Forward opion is followed unil erminaion. Nodes 6 and 7 represen predicions of he oucome of a primiive acion followed by he Forward opion. Nodes 8 and 9 ake his one sep furher: hey represen predicions of he red bi if he Forward opion were followed o erminaion, hen a primiive acion were aken, and hen he Forward opion were followed again o erminaion. We applied our algorihm o learn he parameer W of he answer nework for his quesion nework. The sep-size parameer α was 1.0, and he race-decay parameer λ was 0.9. The iniial W 0, E 0, and y 0 were all 0. Each run began wih he agen in he sae indicaed in Figure 1 (lef). In his experimen σ( ) was he ideniy funcion. For each value of p, we ran 50 runs of 20,000 ime seps. On each ime sep, he roo-meansquared (RMS) error in each node s predicion was compued and hen averaged over all he nodes. The nodes corresponding o he Wander opion were no included in he average because of he difficuly of calculaing heir correc predicions. This average was hen

6 0.4 RMS Error Fully Observable 0.4 RMS Error Parially Observable p = 0.5 p = 0.1 p = Seps Seps Figure 2: Learning curves in he fully-observable experimen for each slippage probabiliy (lef) and in he parially-observable experimen (righ). iself averaged over he 50 runs and bins of 1,000 ime seps o produce he learning curves shown on he lef in Figure 2. For all slippage probabiliies, he error in all predicions fell almos o zero. Afer approximaely 12,000 rials, he agen made almos perfec predicions in all cases. No surprisingly, learning was slower a he higher slippage probabiliies. These resuls show ha our augmened TD nework is able o make a complee emporally-absrac model of his world. 5 Parially observable experimen In our second experimen, only he six color observaion bis were available o he agen. This experimen provides a more challenging es of our algorihm. To model he environmen well, he TD nework mus consruc a represenaion of sae from very sparse informaion. In fac, compleely accurae predicion is no possible in his problem wih our quesion nework. In his experimen he inpu vecor consised of hree groups of 46 componens each, 138 in oal. If he acion was R, he firs 46 componens were se o he 40 node values and he six observaion bis, and he oher componens were 0. If he acion was L, he nex group of 46 componens was filled in in he same way, and he firs and hird groups were zero. If he acion was F, he hird group was filled. This echnique enables he answer nework as funcion approximaor o represen a wider class of funcions in a linear form han would oherwise be possible. In his experimen, σ( ) was he S-shaped logisic funcion. The slippage probabiliy was p = 0.1. As our performance measure we used he RMS error, as in he firs experimen, excep ha he predicions for he primiive acions (nodes 1-3) were no included. These predicions can never become compleely accurae because he agen can ell in deail where i is locaed in he open space. As before, we averaged RMS error over 50 runs and 1,000 ime sep bins, o produce he learning curve shown on he righ in Figure 2. As before, he RMS error approached zero. Node 5 in Figure 1 holds he predicion of red if he agen were o march forward o he wall ahead of i. Corresponding nodes in he oher subneworks hold he predicions of he oher colors upon Forward. To make hese predicions accuraely, he agen mus keep rack of which wall i is facing, even if i is many seps away from i. I has o learn a sor of compass ha i can keep updaed as i urns in he middle of he space. Figure 3 is a demonsraion of he compass learned afer a represenaive run of 200,000 ime seps. A he end of he run, he agen was driven manually o he sae shown in he firs row (relaive

7 ime index = 1). On seps 1-25 he agen was spun clockwise in place. The hird column shows he predicion for node 5 in each porion of he quesion nework. Tha is, he predicions shown are for each color-observaion bi a erminaion of he Forward opion. A = 1, he agen is facing he orange wall and i predics ha he Forward opion would resul in seeing he orange bi and none oher. Over seps 2-5 we see ha he predicions are mainained accuraely as he agen spins despie he fac ha is observaion bis remain he same. Even afer spinning for 25 seps he agen knows exacly which way i is facing. While spinning, he agen correcly never predics seeing he green bi (afer Forward), bu if i is driven up and urned, as in he las row of he figure, he green bi is accuraely prediced. The fourh column shows he predicion for node 8 in each porion of he quesion nework. Recall ha hese nodes correspond o he sequence Forward, L, Forward. A ime = 1, he agen accuraely predics ha Forward will bring i o orange (hird column) and also predics ha Forward, L, Forward will bring i o green. The predicions made for node 8 a each subsequen sep of he sequence are also correc. These resuls show ha he agen is able o accuraely mainain is long erm predicions wihou direcly encounering sensory verificaion. How much larger would he TD nework have o be o handle a 100x100 gridworld? The answer is no a all. The same quesion nework applies o any size problem. If he layou of he colored walls remain he same, hen even he answer nework ransfers across worlds of widely varying sizes. In oher experimens, raining on successively larger problems, we have shown ha he same TD nework as used here can learn o make all he long-erm predicions correcly on a 100x100 version of he 6x6 gridworld used here s y 5 y 8 1 Figure 3: An illusraion of par of wha he agen learns in he parially observable environmen. The second column is a sequence of saes wih (relaive) ime index as given by he firs column. The sequence was generaed by conrolling he agen manually. On seps 1-25 he agen was spun clockwise in place, and he rajecory afer ha is shown by he line in he las sae diagram. The hird and fourh columns show he values of he nodes corresponding o 5 and 8 in Figure 1, one for each color-observaion bi.

8 6 Conclusion Our experimens show ha opion-exended TD neworks can learn effecively. They can learn facs abou heir environmens ha are no represenable in convenional TD neworks or in any oher mehod for learning models of he world. One concern is ha our inra-opion learning algorihm is an off-policy learning mehod incorporaing funcion approximaion and boosrapping (learning from predicions). The combinaion of hese hree is known o produce convergence problems for some mehods (see Suon & Baro, 1998), and hey may arise here. A sound soluion may require modificaions o incorporae imporance sampling (see Precup, Suon & Dasgupa, 2001). In his paper we have considered only inra-opion eligibiliy races races exending over he ime span wihin an opion bu no persising across opions. Tanner and Suon (2005) have proposed a mehod for iner-opion races ha could perhaps be combined wih our inra-opion races. The primary conribuion of his paper is he inroducion of a new learning algorihm for TD neworks ha incorporaes opions and eligibiliy races. Our experimens are small and do lile more han exercise he learning algorihm, showing ha i does no break immediaely. More significan is he greaer represenaional power of opion-exended TD neworks. Opions are a general framework for emporal absracion, predicive sae represenaions are a promising sraegy for sae absracion, and TD neworks are able o represen composiional quesions. The combinaion of hese hree is poenially very powerful and worhy of furher sudy. Acknowledgmens The auhors graefully acknowledge he ideas and encouragemen hey have received in his work from Mark Ring, Brian Tanner, Sainder Singh, Doina Precup, and all he members of he rlai.ne group. References Jaeger, H. (2000). Observable operaor models for discree sochasic ime series. Neural Compuaion, 12(6): MIT Press. Liman, M., Suon, R. S., & Singh, S. (2002). Predicive represenaions of sae. In T. G. Dieerich, S. Becker and Z. Ghahramani (eds.), Advances In Neural Informaion Processing Sysems 14, pp MIT Press. Precup, D., Suon, R. S., & Dasgupa, S. (2001). Off-policy emporal-difference learning wih funcion approximaion. In C. E. Brodley, A. P. Danyluk (eds.), Proceedings of he Eigheenh Inernaional Conference on Machine Learning, pp San Francisco, CA: Morgan Kaufmann. Rafols, E. J., Ring, M., Suon, R.S., & Tanner, B. (2005). Using predicive represenaions o improve generalizaion in reinforcemen learning. To appear in Proceedings of he Nineeenh Inernaional Join Conference on Arificial Inelligence. Rives, R. L., & Schapire, R. E. (1987). Diversiy-based inference of finie auomaa. In Proceedings of he Tweny Eighh Annual Symposium on Foundaions of Compuer Science, (pp ). IEEE Compuer Sociey. Singh, S., James, M. R., & Rudary, M. R. (2004). Predicive sae represenaions: A new heory for modeling dynamical sysems. In Uncerainy in Arificial Inelligence: Proceedings of he Twenieh Conference in Uncerainy in Arificial Inelligence, (pp ). AUAI Press. Suon, R. S., & Baro, A. G. (1998). Reinforcemen learning: An inroducion. Cambridge, MA: MIT Press. Suon, R. S., Precup, D., Singh, S. (1999). Beween MDPs and semi-mdps: A framework for emporal absracion in reinforcemen learning. Arificial Inelligence, 112, pp Suon, R. S., & Tanner, B. (2005). Temporal-difference neworks. To appear in Neural Informaion Processing Sysems Conference 17. Tanner, B., Suon, R. S. (2005) Temporal-difference neworks wih hisory. To appear in Proceedings of he Nineeenh Inernaional Join Conference on Arificial Inelligence.

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 RL Lecure 7: Eligibiliy Traces R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 1 N-sep TD Predicion Idea: Look farher ino he fuure when you do TD backup (1, 2, 3,, n seps) R. S. Suon and