Exploring Continuous Action Spaces with Diffusion Trees for Reinforcement Learning

Size: px

Start display at page:

Download "Exploring Continuous Action Spaces with Diffusion Trees for Reinforcement Learning"

Osborne Casey
5 years ago
Views:

1 Exploring Continuous Action Spces with Diffusion Trees for Reinforcement Lerning Christin Vollmer, Erik Schffernicht, nd Horst-Michel Gross Neuroinformtics nd Cognitive Robotics Lb Ilmenu University of Technology Ilmenu, Germny Abstrct. We propose new pproch for reinforcement lerning in problems with continuous ctions. Actions re smpled by mens of diffusion tree, which genertes smples in the continuous ction spce nd orgnizes them in hierrchicl tree structure. In this tree, ech subtree holds subset of the ction smples nd thus holds knowledge bout subregion of the ction spce. Additionlly, we store the expected long-term return of the smples of subtree in the subtree s root. Thus, the diffusion tree integrtes both, smpling technique nd mens for representing cquired knowledge in hierrchicl fshion. Smpling of new ction smples is done by recursively wlking down the tree. Thus, informtion bout subregions stored in the roots of ll subtrees of brnching point cn be used to direct the serch nd to generte new smples in promising regions. This fcilittes control of the smple distribution, which llows for informed smpling bsed on the cquired knowledge, e.g. the expected return of region in the ction spce. In simultion experiments, we show how this cn be used conceptully for exploring the stte-ction spce efficiently. Keywords: reinforcement lerning, continuous ction spce, ction smpling, diffusion tree, hierrchicl representtion. 1 Introduction Reinforcement lerning in continuous domins is n re of ctive reserch. Conventionl lgorithms re only proven to work well in environments where ction spce nd stte spce re both discrete [1]. To extend those lgorithms to continuous domins common pproch is to discretize the stte spce nd the ction spce nd pply discrete lgorithms [2]. This, however, usully reduces the performnce of the pproches [3]. One mjor issue when pplying reinforcement lerning to continuous domins is the lck of techniques to represent nd updte knowledge over continuous domins efficiently. Severl successful pproches hve been proposed tht represent knowledge by mens of prmetric function pproximtors [3] or smple-bsed density estimtion. K. Dimntrs, W. Duch, L.S. Ilidis (Eds.): ICANN 2010, Prt II, LNCS 6353, pp , c Springer-Verlg Berlin Heidelberg 2010

2 Exploring Continuous Action Spces with Diffusion Trees 191 In this work, we present novel pproch to reinforcement lerning in continuous ction spces, bsed on ction smpling. In ction-smpling-bsed pproches, the gent stores knowledge by mens of set of discrete smples, which re generted successively by certin technique, one per lerning step, nd executed nd evluted therefter by the gent. To store knowledge efficiently, those smples hve to be concentrted on regions with high interest. Therefore, the smpling-technique hs to use the knowledge cquired so fr, to mke the smpling process s informed s possible. In our pproch, ctions re smpled by mens of diffusion tree, which orgnizes smples from continuous spce nd knowledge bout the underlying domin in hierrchicl structure. Higher levels in the hierrchy represent knowledge bout bigger regions in the ction spce. Evlution of knowledge is done by recursively wlking the tree from its root to its leves. In blnced tree, evlution therefore is efficient. While wlking down the tree, the stored knowledge is used to control the smple distribution. In this pper, we only outline the theoreticl concept nd vlidte it in proof-of-concept mnner. Further reserch hs to be done to proof the full vlidity of the pproch for rel-world pplictions. This pper is orgnized s follows. Section 2 briefly introduces the stte of the rt in smpling-bsed pproches to reinforcement lerning. As bsis of our pproch the Dirichlet Diffusion Tree is introduced in section 3. Our proposed lgorithm is described in section 4. Section 5 shows results of two simple experiments conducted to conceptully vlidte our pproch. Conclusions nd n outlook to future work re stted in section 6. 2 Stte of the Art Much reserch hs been done in the field of reinforcement lerning in continuous domins. In this section, we will outline few techniques, strongly relted to our proposed pproch. Our lgorithm belongs to the group of smpling-bsed pproches. Algorithms of tht group typiclly represent knowledge by mens of smples drwn from the underlying domin. In [4] n pproch is presented tht extends the trditionl dynmic progrmming to continuous ction domins. However, the stte spce remins discrete. Vlues for sttes re stored in tble, one vlue per stte. The policy is lso represented s tble, where for every stte n ction is stored. Multiliner interpoltion is used to compute vlues in the continuous stte domin. In every itertion of the presented lgorithm, sweep through the whole stte spce is done where for every stte new ction nd new vlue is computed. Therefore, n ction is being smpled uniformly for every stte. If the ction is better thn the previously stored one w.r.t. the expected return, the old ction is discrded nd the new one is stored insted. Unfortuntely, this pproch is not suited for rel-time explortion nd lerning, due to the computtionl cost for the sweeps. Also smpling ctions uniformly does not incorporte ny knowledge bout promising ctions for stte seen so fr nd, thus, is inefficient for fst explortion. In [5,6] the ide of smpling ctions is extended to so-clled treebsed smpling pproch. For stte, set of ction smples is drwn. For every

3 192 C. Vollmer, E. Schffernicht, nd H.-M. Gross ction the resulting successor stte is simulted. In tht simulted stte gin set of ction smples is drwn nd gin the next stte is evluted. Tht wy look-hed tree is built. Bsed on tht tree the expected long-term return of n ction in the current stte cn be estimted. For this pproch genertive process model is required, which nrrows the pplicbility in prctice. In [7] smpling-bsed ctor-critic pproch is presented which opertes on discrete stte spce. For every stte set of ction smples is mintined. With every ction smple n importnce weight is ssocited. Together, ll smples for stte pproximte probbility density function (PDF) over the continuous ction spce for tht stte. New ction smples re drwn from tht distribution by mens of importnce smpling. The weight of smple is set proportionl to the expected return of tht ction. Therefore, the pproximted PDF hs high vlues where ctions re promising w.r.t. the expected return nd thus re smpled nd executed more often. 3 Mthemticl Foundtions In this section, the necessry mthemticl foundtions will be introduced. We strt with brief definition of our nottion for reinforcement lerning nd then introduce the formlism of the Dirichlet Diffusion Tree, which serves s bsis for our pproch. Reinforcement Lerning: Our proposed pproch is bsed on the ide of Q- Lerning [1], well known pproch to reinforcement lerning. The reder is ssumed to be firly fmilir with this topic. We refer to [8] for good nd comprehensive introduction. In the following our nottion of Q-Lerning will be defined. The stte of the gent will be denoted by s S, ctions will be ssumed to be equl for ll sttes nd will be noted by A. The rewrd function is given by by r = r(s, ) :S A R. Estimted ction-vlues re defined by ˆQ(s, ) =r(s, ) +γ ˆV (s ). Where ˆV is the estimted stte vlue nd γ is the discounting fctor. Dirichlet Diffusion Tree: Our pproch is bsed on the ide of the Dirichlet Diffusion Tree (DDT) introduced in [9], in prticulr on the construction of such tree, which will be outlined in the following (see Fig. 3. In DDT smples re generted sequentilly, ech one by stochstic diffusion process of durtion t = D. The time evolution of smple i is represented by rndom vrible X i (t) witht [0,D]. The strt loction of the first smple is set to X 1 (0) = 0. The loction of the smple n infinitesiml time step dt lter is determined by X 1 (t + dt) =X 1 (t)+n(t), where N(t) is multivrite Gussin with zero men nd covrince σ 2 Idt.ThevluesN(t) for distinct vlues of t re i.i.d., thus the time evolution of X 1 (t) is Gussin process. Lets cll the so generted pth X 1 (see Fig. 1()). For the second smple the strt point of the new diffusion process, the pth X 2, is set to the strt point of the first one, hence X 2 (0) = X 1 (0). The second smple then shres the pth of the first smple up to rndomly smpled

4 Exploring Continuous Action Spces with Diffusion Trees t X 1 (0) X 1 (7) () First pth X t X 2 (0) = X 1 (0) X 2 (3) = X 1 (3) X 2 (7) (b) Second pth X X 3 (7) t X 3 (0) = X 1 (0) X 3 (3) = X 1 (3) X 3 (5) = X 2 (5) (c) Third pth X 3. Fig. 1. Evolution of Dirichlet Diffusion Tree for three successively smpled pths with length of D = 7. The first pth (left) is smpled by ccumultion of gussin increments. The second pth (middle) diverges from the first t time t =3.Thethird pth (right) shres the first prt with the first pth then goes long the second pth nd diverges t time t =5. divergence time T d, where it diverges from the first pth nd goes its own wy, which is gin determined by Gussin process (see Fig. 1(b)). Thus for t T d the pths re the sme nd for t>t d they re different. T d is rndom vrible nd is determined by divergence function (t). The probbility of diverging in the next infinitesiml intervl dt is given by p(t d [t, t + dt])dt = (t)dt, where (t) is n rbitrry monotoniclly incresing divergence function (see [9] for detils). As result the probbility of divergence increses monotoniclly in time during the diffusion process. Lets ssume X 2 diverged from X 2 t time T d = t 0 =3. Now the third pth X 3 is being smpled. Lets ssume, the point of divergence of the third pth is t 1 >t 0, i.e. X 3 diverges lter the X 2 did nd X 1 (t) = X 2 (t) =X 3 (t) fort [0,t 0 ],. Thus, when the process reches t 0 =3decision hs to be mde whether it should follow X 1 or X 2 until it diverges t t = t 1 =5 (see Fig. 1(c)). This decision is done by rndomly choosing from one of the brnching pths with probbility proportionl to the number of previous times the respective pth ws chosen. Thus pths tht hve often been chosen before, re more likely to be chosen gin. The concept of preferring wht hs been chosen before is clled reinforcement of pst events by [9] nd is one of the min resons which motivtes the use of the DDT in our work. [9] further introduces n dditionl wy to implement this concept by reducing the probbility of divergence from pth X i proportionl to the number of times the pth hs been trvelled before. Thus it is less likely to diverge from pth tht hs been used by mny smples before. After generting N pths X 1,...,X N,thevluesX 1 (D),...,X N (D) represent the set of smples generted if the DDT is viewed s blck-box smpling technique. We cll those vlues finl smples, s they re the finl outcome of ech diffusion process.

5 194 C. Vollmer, E. Schffernicht, nd H.-M. Gross 4 Our Algorithm The lgorithm proposed here borrows hevily from the ide of the diffusion tree nd thus is clled DT-Lerning, where DT stnds for diffusion tree. Like most other smpling-bsed pproches it opertes on discrete stte spce S = {s i } i=1,...,ns. To represent vlues nd ctions, we mintin diffusion tree for every stte, where the domin of the smples is the ction spce of the gent. The following prgrph introduces the structurl elements tht mke up diffusion tree s used in our pproch. Structurl Elements of Our Diffusion Tree: Unlike the continuous notion of the diffusion tree s presented in [9], the pths of our diffsion tree re discrete in time nd consist of sequence of concrete smples of the diffusion process, which we further cll nodes. Further, we extend the notion of the diffusion tree by structurl element clled segment, which comprises the set of nodes from one divergence point to nother (see Fig. 2). Let c be segment nd let c[i] bethei-th node of c. Tht wy, the segments themselves comprise tree structure, where segment hs one prent segment nd rbitrrily mny child segments. One prticulr segment hs no prent segment nd is clled the root segment. Segments without child segments re clled lef segments. The lst node of lef segment is lso lef node of the entire tree. In order to ese nottion we will use functionl nottion for ttributes of n entity ( tree, segment, or node) in the following. Let rt(s) be the root segment of the tree of stte s. Letp(c) be the prent segment of segment c nd let ch(c) be the set of child segments of segment c. In cse c is lef segment, ch(c) =. Let further lef(c) denote the lst node of segment c. Ifc is lef segment, lef(c) is lso lef node of the tree. A lef node of the tree represents finl smple from the underlying domin. All intermedite nodes of ll segments in the tree re just Fig. 2. Abstrction of diffusion tree (left) to tree of segments (right). Nodes in the diffusion tree mke up segment (ellipses). The segments themselves form tree (right). Segment 1 is the root segment, segments 3,4, nd 5 re lef segments. The rectngulr lef nodes (left) re the finl ction smples, plced continuously in the ction spce. byproduct of the smpling nd hve no prticulr use. Put differently, if we interpret the diffusion tree s blck-box smpling mechnism which just genertes smples in the ction spce, we would only see the finl smples represented by the lef nodes. The remining tree structure would be hidden in the box

6 Exploring Continuous Action Spces with Diffusion Trees 195 Hierrchicl Representtion of Knowledge: Besides the structurl reltions, severl elements crry further informtion s ttributes. The ttribute counter(c) counts the number of pths tht shre the segment c, i.e. the number of pths tht went c before they diverged nd went their own wy. The ttribute vl(c) crries the q-vlue of segment. The q-vlue of segment is our wy of representing the estimted long-term return of stte or stte-ction pir nd is defined recursively s follows. The vlue of lef segment c of the tree in stte s is vl(c) = ˆQ(s, ), where = lef(c) is the finl ction smple of the segment. The quntity ˆQ(s, ) is the estimted long term return, when executing ction in stte s nd is obtined in the rel-time run when the gent enters the resulting successor stte s nd is given by ˆQ(s, ) =r(s, ) +γ ˆV (s ). The vlue of non-lef segment c is defined by the mximum vlue over ll it s children. By pplying this rule recursively bottom-up the vlue of root segment of stte s becomes the mximum vlue of ll ction smples generted by the diffusion tree in tht stte nd thus vl(rt(s)) = ˆV (s) is the expected long term return for stte s when cting greedy, i.e. lwys executing the ction tht mximizes expected long-term return. Controlled Explortion by Informed Smpling: In order to direct our serch for good ction smples we need to control our ction smpling process. We do this by controlling the divergence time nd by controlling the choice of pth to go t divergence point. For the first one, we use the pproch from the originl DDT, which is decresing the probbility of divergence from segment c with incresing counter counter(c). This wy we implement the principle of reinforcement of pst events. For the ltter one, we will describe our pproch in the following. The informtion vilble t brnching point lef(c) is the set of children c of the segment c nd ll informtion those children re ttributed with, in prticulr ech one s vl(c ), which represents the expected ction-vlue of the region covered by the subtree of c. Bsed on tht informtion, we cn mke decision bout which pth to choose in numerous wys, ech with different effects on the resulting smple distribution. The originl heuristics of [9] is to rndomly choose child with probbility proportionl to the child s counter. This heuristic results in n ccumultion of smples in regions where lredy mny smples re, becuse counters of segments leding to those regions re high. However, to fcilitte efficient explortion we wish to ccumulte smples in regions with high expected long-term return insted. A stright forwrd pproch to implement this ide is to deterministiclly choose the child with the mximum vlue. This will ultimtely led to ccumultion of smples in regions with high expected long-term return. However, this sttement is only vlid, if the tree hs seen vlues in ll promising regions of the underlying domin, i.e. it hs some smples evenly distributed over the underlying domin. If we choose this heuristic right from the strt of the lerning process, the tree will concentrte its smples to locl optim it encounters in the first few smpling steps. A common wy to circumvent this issue in conventionl pproches is to choose ctions rndomly t

7 196 C. Vollmer, E. Schffernicht, nd H.-M. Gross the beginning of the lerning process which ccounts for the uncertinty of knowledge bout the utility of the ctions nd to increse the trust on the knowledge obtined by decresing the rndom proportion in decision mking over time. To implement this ide we use Boltzmnn Selection, where the probbility of choosing child is given by p c =exp(vl(c)/τ) / c ch(p(c)) exp (vl(c )/τ). Thus, t the beginning of the lerning process we set τ to high vlue to ccount for the uncertinty of knowledge. Choices will be mde purely rndomly nd finl smples will be evenly spred over the ction spce. Over time we decrese τ, nd thus the choice will be incresingly deterministic to ccount for the incresing certinty of the cquired knowledge bout high expected return. Algorithmic Description: Algorithm 1 shows the pseudocode of our pproch. Knowledge is cquired by incrementlly building diffusion trees in the sttes. Every time the gent visits stte, it genertes new pth (line 2) in the diffusion tree nd thereby smples n ction to be executed. In the beginning Algorithm 1. DT LEARNING(s). 1: repet 2: c SAMPLE PATH(s) 3: lef(c) 4: execute, observe result stte s nd rewrd r 5: P ROP AGAT E UP(c, r, vl(rt(s ))) 6: s s 7: until s is gol stte procedure SAMPLE PATH(s) 8: if rt(s) =0then 9: rt(s) smple new segment strting t t=0 nd =0 10: return rt(s) 11: else 12: c rt(s) 13: loop 14: d smple divergence time [strt(c),d]//withstrt( ) strt time 15: if d end(c) then // with end( ) end time 16: c smple new segment strting t t=d nd =c[d] 17: p(c ) (c) ndch(c) ch(c) {c } 18: return c 19: else if d>end(c) then 20: c choose child c ch(c) by Boltzmnn Selection procedure P ROP AGAT E UP(c, r, v) 21: vl(c) r + γ v 22: repet 23: c p(c); 24: e r + γv vl(c) 25: if e>0 then 26: vl(c) vl(c)+αe// with α lerning rte 27: until c hs no prent

8 Exploring Continuous Action Spces with Diffusion Trees 197 of run the diffusion trees in ll sttes re empty, i.e. they hve no pth. On the first visit of stte s the gent genertes the first pth, which will be the first segment c of the tree in s nd thus rt(s) =c (line 9). The lef node of c represents the finl ction smple nd thus lef(rt(s)) = (line 3). The gent will now execute leding into stte s, observe the rewrd r(s, ) (line4)nd updte the vlue of the three in s (line 5) by first setting the vlue of c ccording to vlue updte eqution (line 21) nd then recursively updting the vlue of the prents (line 22). When entering stte with tree tht hs t lest one segment, we wlk down the tree by smpling divergence time (line 14) nd choosing between children (line 20) until divergence (line 15). Figure 3 shows run of n gent in world with two sttes nd two ctions A 1 B 0 0 A 1 B 0 0 A 1 B () Step t (b) Step t (c) Step t Fig. 3. Successive smpling of pths. The upper prt of ech figure shows the stte trnsition grph of simple bstrct world with two discrete sttes nd two discrete ctions, where the current stte is pinted with thick line width. Below the sttes A nd B the diffusion trees of those sttes re shown. The intervl lines below the trees illustrte the mpping from continuous ction smples to the two discrete ctions utilized in the selected exemplry ppliction. 5 Experiments In order to vlidte our pproch we conducted two experiments in simultion. The experiments serve to vlidte the vlue of informed smpling ginst uninformed smpling. Therefore we compre two lgorithms, DT-Lerning (DTL) nd simple rndom scheme we cll Rndom Smpling Q-Lerning (RSQL). In RSQL, with probbility ν n ction-smple is drwn uniformly in every stte nd kept in cse its resulting estimted return is greter thn the return of the best ction-smple kept so fr for tht stte. With probbility 1 ν the best ction obtined so fr is executed. The prmeter ν is set to vlue ner one t the beginning nd is decresed over time to ccount for the uncertinty of knowledge in the beginning. Thus, RSQL is the simplest smpling scheme possible s it is s uninformtive s possible while still fulfilling ll necessities of the Q-lerning frmework. The tsk in the first experiment is to find the shortest pth from strt loction to gol loction in grid world. The sttes spce consists of the twodimensionl loctions in the grid. The ctions in gridworld consist of the five

9 198 C. Vollmer, E. Schffernicht, nd H.-M. Gross steps per episode episode QL RSQL DTL episode RSQL DTL Fig. 4. Performnce of the lgorithms DT-Lerning (DTL), RSQ-Lerning (RSQL), nd Q-Lerning (QL) on the two test tsks to rech gol cell (left) nd to stbilize pendulum (right) choices to go up, down, west, est nd to sty, i.e. {0,...,4}. To pply the ction-continuous pproches, their continuous outputs [0, 5] re mpped to those five ctions by =. The gent receives positive rewrd when it enters the gol cell nd negtive one, when it bumps into wll. We chose this discrete world, becuse it is simple nd fcilittes esy nlysis of the key properties of our lgorithm. We evluted the verge number of steps until the gent reches the gol point during number of successive lerning episodes, where the gent keeps its knowledge over the different episodes. Figure 4 (left) shows the results, verged over 10 trils ech. We pplied Q-lerning (QL) in its originl ction-discrete fshion, to serve s bse line for comprison. As cn be seen the convergence of both smpling-bsed lgorithms is worse thn Q-lerning. This is becuse Q-Lerning, working with the five discrete ctions, is nturlly the best fit for this tsk. The convergence of DTL is better thn the one of RSQL, due to DTL smpling more ctions in regions with high expected return, wheres RSQL cts ignornt bout the knowledge obtined erlier nd thus genertes smples tht led into wlls with reltively high probbility. In second experiment we tested our lgorithm on the tsk to stbilize pendulum in n upright position. To ese the tsk, the strting position for every episode is the upright position. During n episode the number of steps is counted until pendulum crosses the horizontl position. The two-dimensionl stte spce, consisting of ngle φ [0, 2π] nd ngulr velocity ω [ 10 rd rd s, 10 s ], ws discretized into 41 eqully sized intervls per dimension. The ction spce ws the ngulr ccelertion A =[ 10 Nm,10 Nm]. Figure 4 (right) shows the results of the two lgorithms RSQL nd DTL. As cn be seen DT-Lerning converges slightly fster. Agin, this is due to the more efficient explortion resulting from controlled smpling of ctions in regions with higher expected return. We omitted Q-Lerning here, becuse the necessry discretiztion of the ction spce would render the results incomprble.

10 6 Conclusion Exploring Continuous Action Spces with Diffusion Trees 199 In this work we presented n pproch for reinforcement lerning with continuous ctions. We were ble to show the benefits of informed smpling of ctions by efficiently using hierrchiclly structured knowledge bout vlues of the ctions spce. The computtionl cost of smpling n ction is of logrithmic order in the number of ction smples, s is typicl for tree-bsed pproches. In comprison to very simple, uninformed smpling scheme our pproch showed better convergence rtes. However, some open issues remin. Due to the discretiztion of the stte spce, there is discontinuity in the vlue of prticulr ction between two sttes. This could be hndled by n interpoltion between two trees. Another issue concerns the ging of informtion in unused prts of the trees. Becuse memory requirements for our pproch re reltively high, technique must be found to prune subtrees bsed on the utility of their contined informtion. These issues will be subject to further reserch. References 1. Wtkins, C.J., Dyn, P.: Q-lerning. Mchine Lerning 8, (1992) 2. Gross, H.M., Stephn, V., Boehme, H.J.: Sensory-bsed robot nvigtion using selforgnizing networks nd q-lerning. In: Proceedings of the 1996 World Congress on Neurl Networks, pp Psychology Press, Sn Diego (1996) 3. Gskett, C., Wettergreen, D., Zelinsky, A., Zelinsky, E.: Q-lerning in continuous stte nd ction spces. In: Austrlin Joint Conference on Artificil Intelligence, pp Springer, Heidelberg (1999) 4. Atkeson, C.G.: Rndomly smpling ctions in dynmic progrmming. In: 2007 IEEE Symposium on Approximte Dynmic Progrmming nd Reinforcement Lerning (ADPRL 2007), pp (2007) 5. Kerns, M., Mnsour, Y., Ng, A.Y.: A sprse smpling lgorithm for ner-optiml plnning in lrge mrkov decision processes. Mchine Lerning 49, (2002) 6. Ross, S., Chib-Dr, B., Pineu, J.: Byesin reinforcement lerning in continuous pomdps with ppliction to robot nvigtion. In: 2008 IEEE Interntionl Conference on Robotics nd Automtion (ICRA 2008), pp IEEE, Los Almitos (My 2008) 7. Lzric, A., Restelli, M., Bonrini, A.: Reinforcement lerning in continuous ction spces through sequentil monte crlo methods. In: Pltt, J., Koller, D., Singer, Y., Roweis, S. (eds.) Advnces in Neurl Informtion Processing Systems, vol. 20, pp MIT Press, Cmbridge (2008) 8. Sutton, R.S., Brto, A.G.: Reinforcement Lerning: An Introduction. The MIT Press, Cmbridge (Mrch 1998) 9. Nel, R.M.: Density modeling nd clustering using dirichlet diffusion trees. In: Byesin Sttistics 7: Proceedings of the Seventh Vlenci Interntionl Meeting, pp (2003)

19 Optimal behavior: Game theory

19 Optimal behavior: Game theory Intro. to Artificil Intelligence: Dle Schuurmns, Relu Ptrscu 1 19 Optiml behvior: Gme theory Adversril stte dynmics hve to ccount for worst cse Compute policy π : S A tht mximizes minimum rewrd Let S (,