Learning to Take Concurrent Actions

Size: px

Start display at page:

Download "Learning to Take Concurrent Actions"

Emery Matthews
5 years ago
Views:

1 Learning o Take Concurren Acions Khashayar Rohanimanesh Deparmen of Compuer Science Universiy of Massachuses Amhers, MA 0003 khash@cs.umass.edu Sridhar Mahadevan Deparmen of Compuer Science Universiy of Massachuses Amhers, MA 0003 mahadeva@cs.umass.edu Absrac We invesigae a general semi-markov Decision Process (SMDP) framework for modeling concurren decision making, where agens learn opimal plans over concurren emporally exended acions. We inroduce hree ypes of parallel erminaion schemes all, any and coninue and heoreically and experimenally compare hem. Inroducion We invesigae a general framework for modeling concurren acions. The noion of concurren acion is formalized in a general way, o capure boh siuaions where a single agen can execue muliple parallel processes, as well as he muli-agen case where many agens ac in parallel. Concurrency clearly allows agens o achieve goals more quickly: in making breakfas, we inerleave making oas and coffee wih oher aciviies such as geing milk; in driving, we search for road signs while conrolling he wheel, acceleraor and brakes. Mos previous work on concurrency has focused on parallelizing primiive (uni sep) acions. Reier developed axioms for concurren planning using he siuaion calculus framework [4]. Knoblock [3] and Bouilier [] modify he STRIPS represenaion of acions o allow for concurren acions. These approaches assume deerminisic effecs. Prior work in decision-heoreic planning includes work on muli-dimensional vecor acion spaces [2], and models based on dynamic merging of muliple MDPs [6]. There is also a massive lieraure on concurren processes, dynamic logic, and emporal logic. Pars of hese lines of research deal wih he specificaion and synhesis of concurren acions, including probabilisic ones [8]. In conras, we focus on parallelizing emporally exended acions. The concurrency framework described below significanly exends our previous work [5]. We provide a deailed analysis of hree erminaion schemes for composing parallel acion srucures. The hree schemes any, all, and coninue are illusraed in Figure. We characerize he class of policies under each scheme. We also heoreically compare he opimaliy of he concurren policies under each scheme wih ha of he ypical

2 sequenial case. The heoreical resuls are complemened by an experimenal sudy, which illusrae he rade-offs beween opimaliy and convergence speed, and he advanages of concurrency over sequenialiy. 2 Concurren Acion Model Building on SMDPs, we inroduce he Concurren Acion Model (CAM) (S, A, T, R), where S is a se of saes, A is a se of primary acions, T is a ransiion probabiliy disribuion S (A) S N [0, ], where (A) is he power-se of he primary acions and N is he se of naural numbers, and R is he reward funcion mapping S R. Here, a concurren acion is simply represened as a se of primary acions (hereafer called a muli-acion), where each primary acion is eiher a single sep acion, or a emporally exended acion (e.g., modeled as a closed loop policy over single sep acions [7]). We denoe he se of muli-acions ha can be execued in a sae s by A(s). In pracice, his funcion can capure resource consrains ha limi how many acions an agen can execue in parallel. Thus, he ransiion probabiliy disribuion in pracice may be defined over a much smaller subse han he power-se of primary acions (e.g., in he grid world example in Figure 3, he power se is > 00, bu he se of concurren acions is only 0). S a erminaed a a 2 a 3 +k inerruped S erminaed a a 2 a 3 +k S a erminaed +k Nex muli-acion a = {a, a 2, a 3, a 4 } +k a a a 2 a 3 a 4 a 4 a 4 muli-acion d a = {a, a 2, a 3, a 4 } n d n+ d n muli-acion a = {a, a 2, a 3, a 4 } d n+ Curren muli-acion d n a = {a, a 2, a 3, a 4 } d n+ Coninue o run Figure : Lef: T any erminaion scheme. Middle: T all erminaion scheme. Righ: T coninue erminaion scheme. A principal goal of his paper is o undersand how o define decision epochs for concurren processes, since he primary acions in a muli-acion may no erminae a he same ime. The even of erminaion of a muli-acion can be defined in many ways. Three erminaion schemes are illusraed in Figure. In he T any erminaion scheme (Figure, lef), he nex decision epoch is when he firs primary acion wihin he muli-acion currenly being execued erminaes, where he res of he primary acions ha did no erminae naurally are inerruped (he noion of inerrupion is similar o [7]). In he T all erminaion scheme (Figure, middle), he nex decision epoch is he earlies ime a which all he primary acions wihin he muli-acion currenly being execued have erminaed. We can design oher erminaion schemes by combining T any and T all : for example, anoher erminaion scheme called coninue is one ha always erminaes based on he T any erminaion scheme, bu les hose primary acions ha did no erminae naurally coninue running, while iniiaing new primary acions if hey are going o be useful (Figure, righ). A deerminisic Markovian (memoryless) policy in CAMs is defined as he mapping π : S (A). Noe ha even hough he mapping is defined independen of he

3 erminaion scheme, he behavior of a muli-acion policy depends on he erminaion scheme ha is used in he model. To illusrae his, le < π, τ > (called a policy-erminaion consruc) denoe he process of execuing he muli-acion policy π using he erminaion scheme τ {T any, T all }. To simplify noaion, we only use his form whenever we wan o explicily poin ou wha erminaion scheme is being used for execuing he policy π. For a given Markovian policy, we can wrie he value of ha policy in an arbirary sae given he erminaion mechanism used in he model. Le Θ(π, s, τ) denoe he even of iniiaing he muli-acion π(s ) a ime and erminaing i according o he τ {T any, T all } erminaion scheme. Also le π τ denoe he opimal muli-acion policy wihin he space of policies over muli-acions ha erminae according o he τ {T any, T all } erminaion scheme. To simplify noaion, we may alernaively use τ o denoe opimaliy wih respec o he τ erminaion scheme. Then he opimal value funcion can be wrien as: V τ (s ) = E{r + + γr γ k r +k + γ k max (s a A(s +k ) Q τ +k, a) Θ(π τ, s, τ)} where Q τ (s +k, a) denoes he muli-acion value of execuing a in sae s +k (erminaed using τ) and following he opimal policy π τ hereafer. The policy associaed wih he coninue erminaion scheme is a hisory dependen policy, since for a given sae s, he coninue policy will selec a muli-acion such ha i includes he se of all he primary acions of he muli-acion execued in he previous decision epoch ha did no erminae naurally in he curren sae s (we refer o his se as he coninue-se represened by h ). The coninue policy is defined as he mapping π con : S H (A) in which H is a se of coninue-ses h. Noe ha he value funcion definiion for he coninue policy should be defined over boh sae s and he coninue-se h (represened by s, h ), i.e., V πcon ( s, h ). Le he funcion A(s, h ) reurn he se of muli-acions ha can be execued in sae s ha include he coninuing primary acions in h. Then he coninue policy is formally defined as: π con ( s, h ) = arg max a A(s,h ) Q πcon ( s, h, a). To illusrae his, assume ha he curren sae is s and he muli-acion a = {a, a 2, a 3, a 4 } is execued in sae s. Also, assume ha he primary acion a is he firs acion ha erminaes afer k seps in sae s +k. According o he definiion of he coninue erminaion scheme (ha erminaes based on T any ), he muli-acion a is erminaed a ime + k and we need o selec a new muliacion o execue in sae s +k (wih he coninue-se h +k = {a 2, a 3, a 4 }). The coninue policy will selec he bes muli-acion a +k ha includes he primary acions {a 2, a 3, a 4 }, since hey did no erminae in sae s +k (see Figure, righ). 3 Theoreical Resuls In his secion we presen some of our heoreical resuls comparing he opimaliy of various policies under differen erminaion schemes inroduced in he previous secion. In all of hese heorems we use he parial ordering relaion V π V π2 π π 2, in order o compare differen policies. For lack of space, we abbreviaed he proofs. Noe ha in heorems and 3 which compare he coninue policy wih π any and π all policies, he value funcion is wrien over he pair s, h o be consisen wih he definiion of he coninue policy. This does no influence he original definiion of he value funcion for he opimal policies in T any and T all

4 erminaion schemes, since hey are independen of he coninue-se h. Firs, we compare he opimal muli-acion policies based on he T any erminaion scheme and he coninue policy. Theorem : For every sae s S, and all coninue-se h H, V π con ( s, h ) V any ( s, h ). Proof: By wriing he value funcion definiion for each case we have: V π con ( s, h ) = max a A(s,h ) Qπ con ( s, h, a) max a A(s ) Qπ con ( s, h, a) max Q any ( s, h, a) = V any ( s, h ) a A(s ) The inequaliy holds since he maximizaion in π con is over a smaller se (i.e., A(s, h )) which is a subse of he larger se A(s ) ha is maximized over, in he π any case. Nex, we show ha he opimal plans wih muli-acions ha erminae according o he T any erminaion scheme are beer compared o he opimal plans wih muli-acions ha erminae according o he T all erminaion scheme: Theorem 2: For every sae s S, V all (s) V any (s). Proof: The proof is based on he following lemma which saes ha if we aler he execuion of he opimal muli-acion policy based on T all (i.e., π all ) in such a way ha a every decision epoch he nex muli-acion is sill seleced from π all, bu we erminae i based on T any hen he new policy-erminaion consruc represened by < all, any > is beer han he π all policy. Inuiively his makes sense, since if we inerrup π all (s) when he firs primary acion a i a = π all (s) erminaes in some fuure sae s, due o he opimaliy of π all, execuing π all (s ) is always beer han or equal o coninuing some oher policy such as he one in progress (i.e., π all (s)). Noe ha he proof is no as simple as in he firs heorem since he wo differen policies discussed in his heorem (i.e., π any and π all ) are no being execued using he same erminaion mehod. Lemma : For every sae s S, V all (s) V < all,any> (s). Proof: Le V all n,any(s) denoe he value of following he opimal π all policy in sae s, where for he firs n decision epochs we use he T any erminaion scheme and for he res we use he T all erminaion scheme. By inducion on n, we can show ha V all (s) V all n,any(s), s S and for all n. This suggess ha if we always erminae a muli-acion π all (s ) according o he T any erminaion scheme, we achieve a beer reurn; or mahemaically V all (s) lim n V all n,any(s) = V < all,any> (s). Using Lemma, and he opimaliy of π any in he space of policies wih erminaion scheme according o T any, i follows ha V all (s) V < all,any> (s) V any (s). Nex, we show ha if we execue he coninue policy in which a any decision epoch we always execue he bes se of primary acions along wih hose ones ha were execued in he previous decision epoch and have no erminaed ye, we achieve a beer reurn compared o he case in which we execue he bes se of primary acions, bu always wai unil all of he primary acions erminae before making a new decision: Theorem 3: For every sae s S, and all coninue-se h H, V all ( s, h ) V π con ( s, h ). Proof: In π all policies, muli-acions are execued unil all of he primary acions

5 of ha muli-acion erminae. The coninue policy, however, may also iniiae new useful primary acion in addiion o hose already running which may achieve a beer reurn. Le V all n,con( s, h ) denoe he value of he alered policy π all ha works as follows: for a given sae and coninue-se s, h, he policy π all ( s, h ) is execued while for he firs n decision epochs we use he coninue erminaion scheme (which means erminaing according o T any, and selecing he nex muli-acion according o he coninue policy) and for he res we use he T all erminaion scheme. By inducion on n, i can be shown ha V all ( s, h ) V all n,con( s, h ) for all n. This suggess ha as we increase n, he alered policy behaves more like he coninue policy and hus in he limi we have V all ( s, h ) lim n V all n,con( s, h ) = V πcon ( s, h ) which proves he heorem. Finally we show ha he opimal muli-acion policies based on T all erminaion scheme are as good as he case where he agen always execues a single primary acion a a ime, as i is he case in sandard SMDPs. Noe ha his heorem does no sae ha concurren plans are always beer han sequenial ones; i simply says ha if in a problem, he sequenial execuion of he primary acions is he bes policy, CAM is able o represen and find ha policy. Le π seq represen he opimal policy in he sequenial case, where only one primary acion can be execued a a ime: Theorem 4: For every sae s S, V seq (s) V all (s), in which V seq (s) is he value of he opimal policy when he primary acions are execued one a a ime sequenially. Proof: I suffices o show ha sequenial policies are wihin he space of concurren policies. This holds since a single primary acion can be considered as a muli-acion conaining only one primary acion whose erminaion is consisen wih eiher of he muli-acion erminaion schemes (i.e., in he sequenial case boh T any and T all erminaion schemes are same). Corollary summarizes our heoreical resuls. I shows how differen policies in a concurren acion model using differen erminaion schemes compare o each oher in erms of opimaliy. Corollary : In a concurren acion model and a se of erminaion schemes {T any, T all, coninue}, he following parial ordering holds among he opimal policy based on T any, he opimal policy based on T all, he coninue policy and he opimal sequenial policy: π seq π all π con π any. Proof: This follows immediaely from he above heorems. Figure 2 visually describes he summary of resuls ha we presened in Corollary. According o his figure, he opimal muli-acion policies based on T any and T all, and also coninue muli-acion policies dominae (wih respec o he parial ordering relaion defined over policies) he opimal policies over he sequenial case. Furhermore, policies based on coninue muli-acions dominae he opimal muliacion policies based on T all erminaion scheme, while hemselves being dominaed by he opimal muli-acion policies based on T any erminaion scheme.

6 Muli-acion policies using T any Coninue muli-acion policies Muli-acion policies using T all Policies over sequenial acions Figure 2: Comparison of policies over muli-acions and sequenial primary acions using differen erminaion schemes. 4 Experimenal Resuls In his secion we presen experimenal resuls using a grid world ask comparing various erminaion schemes (see Figure 3). Each hallway connecs wo rooms, and has a door wih wo locks. An agen has o rerieve wo keys and hold boh keys a he same ime in order o open boh locks. The process of picking up keys is modeled as a emporally exended acion ha akes differen amoun of imes for each key. Moreover, keys canno be held indefiniely, since he agen may drop a key occasionally. Therefore he agen needs o find an efficien soluion for picking up he keys in parallel wih navigaion o ac opimally. This is an episodic ask, in which a he beginning of each episode he agen is placed in a fixed posiion (upper lef corner) and he goal of he agen is o navigae o a fixed posiion goal (hallway H3). Agen H0-4 sochasic primiive acions (Up, Down, Lef and Righ) - Fail 0% of imes, when fails i will move randomly o one of he neighbors H H3 (Goal) - 8 muli-sep navigaion acions (o each room s 2 hallways) - One primiive no-op acion - 3 sochasic primiive acions for keys (ge-key, key-nop and puback-key) - 2 muli-sep key acions (pickup-key), one for each key - Drop each key 30% of imes when holding i H2 Figure 3: A navigaion problem ha requires concurren plans. There are wo locks on each door, which need o be opened simulaneously. Rerieving each key akes differen amouns of ime. The agen can execue wo ypes of acion concurrenly: () navigaion acions, and (2) key acions. Navigaion acions include a se of one-sep sochasic navigaion acions (Up, Lef, Down and Righ) ha move he agen in he corresponding direcion wih probabiliy 0.9 and fail wih probabiliy 0.. Upon failure he agen moves insead in one of he oher hree direcions, each wih probabiliy 30. There is also a se of emporally exended acions defined over he one sep navigaion acions ha ranspor he agen from wihin he room o one of he wo hallway cells leading ou of he room (Figure 4 (lef)). Key acions are defined o manipulae each

7 key (ge-key, puback-key, pickup-key, ec). Among hem pickup-key is a emporally exended acion (Figure 4 (righ)). Noe ha each key has is own se of acions. Door is closed & boh keys are ready Primiive acion "ge-key" Primiive acion "key-nop" Door is open Inside he room Primiive acion "puback-key" Muli-sep acion "pickup-key" S 0 S... S 0 Muli-sep hallway acion can be aken 0.7 Targe Hallway Muli-sep hallway acion can no be aken Door is closed & keys are no ready Ouside he room S0 S S 6 S7... Key Ready Key Dropped Key Key S0 S S 2... S 6 Key Ready Key Dropped S0 Figure 4: Lef: he policy associaed wih one of he hallway emporally exended acions. Righ: represenaion of he key pickup acions for each key process. In his example, navigaion acions can be execued concurrenly wih key acions. Acions ha manipulae differen keys can be also execued concurrenly. However, he agen is no allowed o execue more han one navigaion acion, or more han one key acion (from he same key acion se) concurrenly. In order o properly handle concurren execuion of acions, we have used a facored sae space defined by sae variables posiion (04 posiions), key-sae ( saes) and key2-sae (7 saes). In our previous work we showed ha concurren acions formed an SMDP over primiive acions [5], which urns ou o hold for all he erminaion schemes described above. Thus, we can use SMDP Q-learning o compare concurren policies over differen erminaion schemes wih he use of his mehod for purely sequenial policy learning [7]. Afer each decision epoch where he muli-acion a is aken in some sae s and erminaes in sae s, he following updae rule is used: Q(s, a) Q(s, a) + α [ r + γ k max a A(s ) Q(s, a ) Q(s, a) ], where k denoes he number of ime seps since iniiaion of he muli-acion a a sae s and is erminaion a sae s, and r denoes he cumulaive discouned reward over his period. The agen is punished by for each primiive acion. Figure 5 (lef) compares he number of primiive acions aken unil success, and Figure 5 (righ) shows he median number of decision epochs per rial, where for rial n, i is he median of all rials from o n. These daa are averaged over 0 episodes, each consising of 500, 000 rials. As shown in figure 5 (lef), concurren acions over any erminaion scheme yield a faser plan han sequenial execuion. Moreover, he policies learned based on T any (i.e. boh π any and π con ) are also faser han T all. Also, π any achieves higher opimaliy han π con, however he difference is small. We conjecure ha sequenial execuion and T all converge faser compared o T any, due o he frequency wih which muli-acions are erminaed. As shown in Figure 5 (righ), T all makes fewer decisions, compared o T any. This is inuiive since T all erminaes only when all of he primary acions in a muli-acion are compleed, and hence i involves less inerrupion compared o learning based on T any. Noe π con converges faser han π any and i is nearly as good as T any.. We can hink of

8 Median/Trials (seps o goal) Sequenial Acions Concurren Acions: opimal, T-all Concurren Acions: opimal, T-any Concurren Acions: coninue Median/Trials (# of decision epochs) Sequenial Acions Concurren Acions: opimal, T-all Concurren Acions: opimal, T-any Concurren Acions: coninue Trial Trial Figure 5: Lef: moving median of number of seps o he goal. Righ: moving median of number of muli-acion level decision epochs aken o he goal. π con as a blend of T all and T any. Even hough i uses he T any erminaion scheme, i coninues execuing primary acions ha did no erminae naurally when he firs primary acion erminaes, making i similar o T all. 5 Fuure Work Even hough specifying he A(s) se of applicable muli-acions migh significanly reduce he se of choices, we sill may need addiional mechanisms for efficienly searching he space of muli-acions ha can run in parallel. Also, we can addiionally exploi he hierarchical srucure of muli-acions o compile hem ino an effecive policy over primary acions. These are some of he pracical issues ha we will invesigae in fuure work. References [] Craig Bouilier and Ronen Brafman. Planning wih concurren ineracing acions. In Proceedings of he Foureenh Naional Conference on Arificial Inelligence (AAAI 97), 997. [2] P. Cichosz. Learning mulidimensional conrol acions from delayed reinforcemens. In Eighh Inernaional Symposium on Sysem-Modelling-Conrol (SMC-8), Zakopane, Poland, 995. [3] C. A. Knoblock. Generaing parallel execuion plans wih a parial-order planner. In Proceedings of he Second Inernaional Conference on Arificial Inelligence Planning Sysems, Chicago, IL, 994., 994. [4] Ray Reier. Naural acions, concurrency and coninuous ime in he siuaion calculus. Principles of Knowledge Represenaion and Reasoning: Proceedings of he Fifh Inernaional Conference (KR 96), Cambridge MA., November 5-8, 996, 996. [5] Khashayar Rohanimanesh and Sridhar Mahadevan. Decision-heoreic planning wih concurren emporally exended acions. In Proceedings of he 7h Conference on Uncerainy in Arificial Inelligence, 200. [6] S. Singh and David Cohn. How o dynamically merge markov decision processes. Proceedings of NIPS, 998. [7] R. Suon, D. Precup, and S. Singh. Beween MDPs and Semi-MDPs: A framework for emporal absracion in reinforcemen learning. Arificial Inelligence, pages 8 2, 999. [8] Glynn Winskel. Topics in concurrency: Par ii comp. sci. lecure noes. Compuer Science course a he Universiy of Cambridge, 2002.

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 RL Lecure 7: Eligibiliy Traces R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 1 N-sep TD Predicion Idea: Look farher ino he fuure when you do TD backup (1, 2, 3,, n seps) R. S. Suon and