Experience Selection in Deep Reinforcement Learning for Control

Size: px

Start display at page:

Download "Experience Selection in Deep Reinforcement Learning for Control"

Homer Perkins
5 years ago
Views:

1 Jounal of Machine Leaning Reseach 19 (2018) 1-56 Submitted 3/17; Revised 07/18; Published 08/18 Expeience Selection in Deep Reinfocement Leaning fo Contol Tim de Buin Jens Kobe Cognitive Robotics Depatment Delft Univesity of Technology Mekelweg 2, 2628 CD Delft, The Nethelands Kal Tuyls Deepmind 14 Rue de Londes, Pais, Fance Depatment of Compute Science Univesity of Livepool Ashton Steet, Livepool L69 3BX, United Kingdom Robet Babuška Cognitive Robotics Depatment Delft Univesity of Technology Mekelweg 2, 2628 CD Delft, The Nethelands Edito: Geoge Konidais Abstact Expeience eplay is a technique that allows off-policy einfocement-leaning methods to euse past expeiences. The stability and speed of convegence of einfocement leaning, as well as the eventual pefomance of the leaned policy, ae stongly dependent on the expeiences being eplayed. Which expeiences ae eplayed depends on two impotant choices. The fist is which and how many expeiences to etain in the expeience eplay buffe. The second choice is how to sample the expeiences that ae to be eplayed fom that buffe. We popose new methods fo the combined poblem of expeience etention and expeience sampling. We efe to the combination as expeience selection. We focus ou investigation specifically on the contol of physical systems, such as obots, whee exploation is costly. To detemine which expeiences to keep and which to eplay, we investigate diffeent poxies fo thei immediate and long-tem utility. These poxies include age, tempoal diffeence eo and the stength of the applied exploation noise. Since no cuently available method woks in all situations, we popose guidelines fo using pio knowledge about the chaacteistics of the contol poblem at hand to choose the appopiate expeience eplay stategy. Keywods: einfocement leaning, deep leaning, expeience eplay, contol, obotics 1. Intoduction Reinfocement leaning is a poweful famewok that makes it possible to lean complex nonlinea policies fo sequential decision making pocesses while equiing vey little pio knowledge. Especially the subfield of deep einfocement leaning, whee neual netwoks c 2018 Tim de Buin, Jens Kobe, Kal Tuyls, Robet Babuška. License: CC-BY 4.0, see Attibution equiements ae povided at

2 de Buin, Kobe, Tuyls and Babuška ae used as function appoximatos, has ecently yielded some impessive esults. Among these esults ae leaning to play Atai games (Mnih et al., 2015) and to contol obots (Levine et al., 2016) staight fom aw images, as well as beating the top human playe in the game of Go (Silve et al., 2016). Reinfocement leaning methods can be divided into on-policy and off-policy methods. On-policy methods diectly optimize the policy that is used to make decisions, while offpolicy methods can lean about an optimal policy fom data geneated by anothe policy. Neithe appoach is without its poblems, which has motivated wok on methods that combine on and off-policy updates (Wang et al., 2017; Gu et al., 2017; O Donoghue et al., 2017). When a einfocement leaning method is eithe patially o entiely off-policy, past expeiences can be stoed in a buffe and eused fo leaning. Doing so not only educes the sample complexity of the leaning algoithm, but can also be cucial fo the stability of einfocement-leaning algoithms that use deep neual netwoks as function appoximatos (Mnih et al., 2015; Lillicap et al., 2016; Schaul et al., 2016; Wang et al., 2017). If we have access to a buffe with past expeiences, an inteesting question aises: how should we sample the expeiences to be eplayed fom this buffe? It has been shown by Schaul et al. (2016) that a good answe to this question can significantly impove the pefomance of the einfocement-leaning algoithm. Howeve, even if we know how to sample fom the expeience buffe, two additional questions aise: what should the buffe capacity be and, once it is full, how do we decide which expeiences should be etained in the buffe and which ones can be ovewitten with new expeiences? These questions ae especially elevant when leaning on systems with a limited stoage capacity, fo instance when dealing with high-dimensional inputs such as images. Finding a good answe to the question of which expeiences to etain in the buffe becomes even moe impotant when exploation is costly. This can be the case fo physical systems such as obots, whee exploatoy actions cause wea o damage and isks need to be minimized (Kobe et al., 2013; Gacıa and Fenández, 2015; Tama et al., 2016; Koyakovskiy et al., 2017). It is also the case fo tasks whee a minimum level of pefomance needs to be achieved at all times (Banejee and Peng, 2004) o when the policy that geneates the expeiences is out of ou contol (Seo and Zhang, 2000; Schaal, 1999). We will efe to the combined poblem of expeience etention and expeience sampling as expeience selection. The questions of which expeiences to sample and which expeiences to etain in the buffe ae elated, since they both equie a judgment on the utility of the expeiences. The diffeence between them is that detemining which expeiences to sample equies a judgment on the instantaneous utility: fom which expeiences can the agent lean the most at the moment of sampling? In contast, a decision on expeience etention should be based on the expected long tem utility of expeiences. Expeiences need to be etained in a way that pevents insufficient coveage of the state action space in the futue, as expeiences cannot be ecoveed once they have been discaded. To know the tue utility of an expeience, it would be necessay to foesee the effects of having the einfocement-leaning agent lean fom the expeience at any given time. Since this is not possible, we instead investigate poxies fo the expeience utility that ae cheap to obtain. 2

3 Expeience Selection in Deep RL fo Contol In this wok, we investigate age, supise (in the fom of the tempoal diffeence eo), and the amplitude of the exploation noise as poxies fo the utility of expeiences. To motivate the need fo multiple poxies, we will stat by showing the pefomance of diffeent expeience selection methods on contol benchmaks that, at fist sight, seem vey closely elated. As a motivating example we show how the cuent state-of-the-at expeience selection method of Schaul et al. (2016), based on etaining a lage numbe of expeiences and sampling them accoding to thei tempoal diffeence eo, compaes on these benchmaks to sampling unifomly at andom fom the expeiences of the most ecent episodes. We show that the state-of-the-at method significantly outpefoms the standad method on one benchmak while significantly unde-pefoming on the othe, seemingly simila benchmak. The focus of this pape is on the contol of physical systems such as obots. The hadwae limitations of these systems can impose constaints on the exploation policy and the numbe of expeiences that can be stoed in the buffe. These factos make the coect choice of expeience sampling stategy especially impotant. As we show on additional, moe complex benchmaks, even when sustained exploation is possible, it can be beneficial to be selective about which and how many expeiences to etain in the buffe. The costs involved in opeating a obot mean that it is geneally infeasible to ely on an extensive hypepaamete seach to detemine which expeience selection stategy to use. We theefoe want to undestand how this choice can be made based on pio knowledge of the contol task. With this in mind, the contibutions of this wok ae twofold: 1. We investigate how the utility of diffeent expeiences is influenced by the aspects of the contol poblem. These aspects include popeties of the system dynamics such as the sampling fequency and noise, as well as constaints on the exploation. 2. We descibe how to pefom expeience etention and expeience sampling based on expeience utility poxies. We show how these two pats of expeience selection wok togethe unde a ange of conditions. Based on this we povide guidelines on how to use pio knowledge about the contol poblem at hand to choose an expeience selection stategy. Note that fo many of the expeiments in this wok most of the hype-paametes of the deep einfocement-leaning algoithms ae kept fixed. While it would be possible to impove the pefomance though a moe extensive hype-paamete seach, ou focus is on showing the elationships between the pefomance of the diffeent methods and the popeties of the contol poblems. While we do intoduce new methods to addess specific poblems, the intended outcome of this wok is to be able to make moe infomed choices egading expeience selection, athe than to pomote any single method. The est of this pape is oganized as follows. Section 2 gives an oveview of elated wok. In Section 3, the basics of einfocement leaning, as well as the deep einfocement leaning and expeience eplay methods used as a stating point ae discussed. Section 4 gives a highlevel oveview of the simple benchmaks used in most of this wok, with the mathematical details pesented in Appendix 9.3. The notation we use to distinguish between diffeent methods, as well as the pefomance citeia that we use, ae discussed in Section 5. In 3

4 de Buin, Kobe, Tuyls and Babuška Section 6, we investigate what spead ove the state-action the expeiences ideally should have, based on the chaacteistics of the contol poblem to be solved. The poposed methods to select expeiences ae detailed in Section 7, with the esults of applying these methods to the diffeent scenaios in simple and moe complex benchmaks ae pesented in Section 8. The conclusions, as well as ou ecommended guidelines fo choosing the buffe size, etention poxy and sampling stategy ae given in Section Related Wok When a leaning system needs to lean a task fom a set of examples, the ode in which the examples ae pesented to the leane can be vey impotant. One method to impove the leaning pefomance on complex tasks is to gadually incease the difficulty of the examples that ae pesented. This concept is known as shaping (Skinne, 1958) in animal taining and cuiculum leaning (Bengio et al., 2009) in machine leaning. Sometimes it is possible to geneate taining examples of just the ight difficulty on-line. Recent machine leaning examples of this include geneative advesaial netwoks (Goodfellow et al., 2014) and self play in einfocement leaning (see fo example the wok by Silve et al. 2017). When the taining examples ae fixed, leaning can be sped up by epeating those examples that the leaning system is stuggling with moe often than those that it finds easy, as was shown fo supevised leaning by, among othes, Hinton (2007) and Loshchilov and Hutte (2015). Additionally, the eventual pefomance of supevised-leaning methods can be impoved by e-sampling the taining data popotionally to the difficulty of the examples, as done in the boosting technique (Valiant, 1984; Feund et al., 1999) In on-line einfocement leaning, a set of examples is geneally not available to stat with. Instead, an agent inteacts with its envionment and obseves a steam of expeiences as a esult. The expeience eplay technique was intoduced to save those expeiences in a buffe and eplay them fom that buffe to the leaning system (Lin, 1992). The intoduction of an expeience buffe makes it possible to choose which examples should be pesented to the leaning system again. As in supevised leaning, we can eplay those expeiences that induced the lagest eo (Schaul et al., 2016). Anothe option that has been investigated in the liteatue is to eplay moe often those expeiences that ae associated with lage immediate ewads (Naasimhan et al., 2015). In off-policy einfocement leaning the question of which expeiences to lean fom extends beyond choosing how to sample fom a buffe. It begins with detemining which expeiences should be in the buffe. Lipton et al. (2016) fill the buffe with successful expeiences fom a pe-existing policy befoe leaning stats. Othe authos have investigated citeia to detemine which expeiences should be etained in a buffe of limited capacity when new expeiences ae obseved. In this context, Pietes and Wieing (2016) have investigated keeping only expeiences with the highest immediate ewads in the buffe, while ou pevious wok has focused on ensuing sufficient divesity in the state-action space (de Buin et al., 2016a,b). Expeience eplay techniques, including those in this wok, often take the steam of expeiences that the agent obseves as given and attempt to lean fom this steam in an optimal way. Othe authos have investigated ways to instill the desie to seek out infomation that is useful fo the leaning pocess diectly into the agent s behavio (Schmidhube, 4

5 Expeience Selection in Deep RL fo Contol 1991; Chentanez et al., 2004; Houthooft et al., 2016; Bellemae et al., 2016; Osband et al., 2016). Due to the classical exploation-exploitation dilemma, changing the agents behavio to obtain moe infomative expeiences comes at the pice of the agent acting less optimally accoding to the oiginal ewad function. A safe altenative to actively seeking out eal infomative but potentially dangeous expeiences is to lean, at least in pat, fom synthetic expeiences. This can be done by using an a pioi available envionment model such as a physics simulato (Baett et al., 2010; Rusu et al., 2016), o by leaning a model fom the steam of expeiences itself and using that to geneate expeiences (Sutton, 1991; Kuvayev and Sutton, 1996; Gu et al., 2016; Caals and Schuitema, 2016). The availability of a geneative model still leaves the question of which expeiences to geneate. Pioitized sweeping bases updates again on supise, as measued by the size of the change to the leaned functions (Mooe and Atkeson, 1993; Ande et al., 1997). Ciosek and Whiteson (2017) dynamically adjusted the distibution of expeiences geneated by a simulato to educe the vaiance of leaning updates. Leaning a model can educe the sample complexity of a leaning algoithm when leaning the dynamics and ewad functions is easy compaed to leaning the value function o policy. Howeve, it is not staightfowad to get impoved pefomance in geneal. In contast, the intoduction of an expeience eplay buffe has shown to be both simple and vey beneficial fo many deep einfocement leaning techniques (Mnih et al., 2015; Lillicap et al., 2016; Wang et al., 2017; Gu et al., 2017). When a buffe is used, we can decide which expeiences to have in the buffe and which expeiences to sample fom the buffe. In contast to pevious wok on this topic we investigate the combined poblem of expeience etention and sampling. We also look at seveal diffeent poxies fo the usefulness of expeiences and how pio knowledge about the specific einfocement leaning poblem at hand can be used to choose between them, athe than attempting to find a single univesal expeience-utility poxy. 3. Peliminaies We conside a standad einfocement leaning setting (Section 3.1) in which an agent leans to act optimally in an envionment, using the implementation by Lillicap et al. (2016) of the off-policy acto-citic algoithm by Silve et al. (2014) (Section 3.2). Actocitic algoithms make it possible to deal with the continuous action spaces that ae often found in contol applications. The off-policy natue of the algoithm enables the use of expeience eplay (Section 3.3), which helps to educe the numbe of envionment steps needed by the algoithm to lean a successful policy and impoves the algoithms stability. Hee, we summaize the deep einfocement leaning (Lillicap et al., 2016) and expeience eplay (Schaul et al., 2016) methods that we use as a stating point. 3.1 Reinfocement Leaning In einfocement leaning, an agent inteacts with an envionment E with (nomalized) state s E by choosing (nomalized) actions a accoding to its policy π: a = π(s), whee s is the agent s peception of the envionment state. To simplify the analysis in Section 6 and 7, and to aid leaning, we nomalize the state and action spaces in ou benchmaks such that s E [ 1, 1] n and a E [ 1, 1] m, whee n 5

6 de Buin, Kobe, Tuyls and Babuška a unnom envionment s unnom denomalization nomalization N (0, σ a ) a E + + a agent s E s + + N (0, σ s ) Figue 1: Reinfocement leaning scheme and symbols used. and m ae the dimensions of the state and action spaces. We pefom the (de)nomalization on the connections between the agent and the envionment, so the agent only deals with nomalized states and actions. We conside the dynamics of the envionment to be deteministic: s E = f(s E, a E ). Hee, s E is the state of the envionment at the next time step afte applying action a E in state s E. Although the envionment dynamics ae deteministic, in some of ou expeiments we do conside senso and actuato noise. In these cases, the state s that the agent peceives is petubed fom the actual envionment state s E by additive Gaussian noise s = s E + N (0, σ s ). (1) Similaly, actuato noise changes the actions sent to the envionment accoding to: a E = a + N (0, σ a ). (2) A ewad function ρ descibes the desiability of being in an unnomalized state s unnom and taking an unnomalized action a unnom : k = ρ(s k unnom, a k unnom, s k+1 unnom), whee k indicates the time step. An oveview of the diffeent einfocement leaning signals and symbols used is given in Figue 1. The goal of the agent is to choose the actions that maximize the expected etun fom the cuent state, whee the etun is the discounted sum of futue ewads: k=0 γk k. The discount facto 0 γ < 1 keeps this sum finite and allows tading off shot-tem and long-tem ewads. Although we will come back to the effect of the senso and actuato noise late on, in the emainde of this section we will look at the einfocement leaning poblem fom the pespective of the agent and conside the noise to be pat of the envionment. This makes the tansition dynamics and ewad functions stochastic: F(s s, a) and P( s, a, s ). 3.2 Off-Policy Deep Acto-Citic Leaning In this pape we use the Deep Deteministic Policy Gadient (DDPG) einfocement-leaning method of Lillicap et al. (2016), with the exception of Section 6.3, whee we compae it to DQN (Mnih et al., 2015). In the DDPG method, based on the wok of Silve et al. (2014), a neual netwok with paametes θ π implements the policy: a = π(s; θ π ). A second neual netwok with paametes θ Q, the citic, is used to appoximate the Q function. The Q π (s, a) function gives the expected etun when taking action a in state s and following 6

7 Expeience Selection in Deep RL fo Contol the policy π fom next time-step onwads [ ] Q π (s, a) = E π γ k k s 0 = s, a 0 = a. (3) k=0 The citic function Q (s, a; θ Q ) is tained to appoximate the tue Q π (s, a) function by minimizing the squaed tempoal diffeence eo δ fo expeience s i, a i, s i, i δ i = L i (θ Q ) = δ 2 i, [ ( )] i + γq s i, π(s i; θπ ); θ Q Q (s i, a i ; θ Q ), (4) iθq θq L i (θ Q ). (5) The index i is a geneic index fo expeiences that we will in the following use to indicate the index of an expeience in a buffe. The paamete vectos θπ and θ Q ae copies of θ π and θ Q that ae updated with a low-pass filte to slowly tack θ π and θ Q θ π (1 τ)θ π + τθ π, θ Q (1 τ)θ Q + τθ Q, with τ (0, 1), τ 1. This was found to be impotant fo ensuing stability when using deep neual netwoks as function appoximatos in einfocement leaning (Mnih et al., 2015; Lillicap et al., 2016). The paametes θ π of the policy neual netwok π(s; θ π ) ae updated in the diection that changes the action a = π(s; θ π ) in the diection fo which the citic pedicts the steepest ascent in the expected sum of discounted ewads 3.3 Expeience Replay θπ a Q (s i, π(s i ; θ π ); θ Q ) θπ π(s i ; θ π ). (6) The acto and citic neual netwoks ae tained by using sample-based estimates of the gadients θq and θπ in a stochastic gadient optimization algoithm such as ADAM (Kingma and Ba, 2015). These algoithms ae based on the assumption of independent and identically distibuted (i.i.d.) data. This assumption is violated when the expeiences s i, a i, s i, i in (5) and (6) ae used in the same ode duing the optimization of the netwoks as they wee obseved by the agent. This is because the subsequent samples ae stongly coelated, since the wold only changes slowly ove time. To solve this poblem, an expeience eplay (Lin, 1992) buffe B with some finite capacity C can be intoduced. Most commonly, expeiences ae witten to this buffe in a Fist In Fist Out (FIFO) manne. When expeiences ae needed to tain the neual netwoks, they ae sampled unifomly at andom fom the buffe. This beaks the tempoal coelations of the updates and estoes the i.i.d. assumption of the optimization algoithms, which impoves thei pefomance (Mnih et al., 2015; Montavon et al., 2012). The inceased stability comes in addition to the main advantage of expeience eplay, which is that expeiences can be used multiple times fo updates, inceasing the sample efficiency of the algoithm. 7

8 de Buin, Kobe, Tuyls and Babuška Pioitized Expeience Replay Although sampling expeiences unifomly at andom fom the expeience buffe is an easy default, the pefomance of einfocement-leaning algoithms can be impoved by choosing the expeience samples used fo taining in a smate way. Hee, we summaize one of the vaiants of Pioitized Expeience Replay (PER) that was intoduced by Schaul et al. (2016). Ou enhancements to expeience eplay ae given in Section 7. The PER technique is based on the idea that the tempoal diffeence eo (4) povides a good poxy fo the instantaneous utility of an expeience. Schaul et al. (2016) ague that, when the citic made a lage eo on an expeience the last time it was used in an update, thee is moe to be leaned fom the expeience. Theefoe, its pobability of being sampled again should be highe than that of an expeience associated with a low tempoal diffeence eo. In this wok we conside the ank-based stochastic PER vaiant. In this method, the pobability of sampling an expeience i fom the buffe is appoximately given by: P (i) ( ) α 1 ank(i) ( 1 j ank(j) ) α. (7) Hee, ank(i) is the ank of sample i accoding to the absolute value of the tempoal diffeence eo δ accoding to (4), calculated when the expeience was last used to tain the citic. All expeiences that have not yet been used fo taining have δ =, esulting in a lage pobability of being sampled. The paamete α detemines how stongly the pobability of sampling an expeience depends on δ. We use α = 0.7 as poposed by Schaul et al. (2016) and have included a sensitivity analysis fo diffeent buffe sizes in Appendix 9.3. Note that the elation is only appoximate as sampling fom this pobability distibution diectly is inefficient. Fo efficient sampling, (7) is used to divide the buffe B into S segments of equal cumulative pobability, whee S is taken as the numbe of expeiences pe taining mini batch. Duing taining, one expeience is sampled unifomly at andom fom each of the segments Impotance Sampling The estimation of an expected value with stochastic updates elies on those updates coesponding to the same distibution as its expectation. Schaul et al. (2016) poposed to compensate fo the fact that the changed sampling pocedue can affect the value of the expectation in (3) by multiplying the gadients (5) with an Impotance Sampling (IS) weight ω i = ( ) 1 1 β. (8) C P (i) Hee, β allows scaling between not compensating at all (β = 0) to fully compensating fo the changes in the sample distibution caused by the sampling stategy (β = 1). In ou expeiments, when IS is used, we follow Schaul et al. (2016) in scaling β linealy pe episode fom 0.5 at the stat of a leaning un to β = 1 at the end of the leaning un. C indicates the capacity of the buffe. 8

9 Expeience Selection in Deep RL fo Contol Not all changes to the sampling distibution need to be compensated fo. Since we use a deteministic policy gadient algoithm with a Q-leaning citic, we do not need to compensate fo the fact that the samples ae obtained by a diffeent policy than the one we ae optimizing fo (Silve et al., 2014). We can change the sampling distibution fom the buffe, without compensating fo the change, so long as these samples accuately epesent the tansition and ewad functions. Sampling based on the TD eo can cause issues hee, as infequently occuing tansitions o ewads will tend to be supising. Replaying these samples moe often will intoduce a bias, which should be coected though impotance sampling. Howeve, the tempoal diffeence eo will also be patly caused by the function appoximation eo. These eos will be pesent even fo a stationay sample distibution afte leaning has conveged. The eos will vay ove the state-action space and thei magnitude will be elated to the sample density. Sampling based on this pat of the tempoal diffeence eo will make the function appoximation accuacy moe consistent ove the state-space. This effect might be unwanted when the leaned contolle will be tested on the same initial state distibution as it was tained on. In that case, it is pefeable to have the function appoximation accuacy be highest whee the sample density is highest. Howeve, when the aim is to tain a contolle that genealizes to a lage pat of the state space, we might not want to use impotance sampling to coect this effect. Note that impotance sampling based on the sample distibution ove the state space is heuistically motivated and based on function appoximation consideations. The motivation does not stem fom the einfocement leaning theoy, whee most methods assume that the Makov decision pocess is egodic and that the initial state distibution does not facto into the optimal policy (Aslanides et al., 2017). In pactice howeve, deep einfocement-leaning methods can be athe sensitive to the initial state distibution (Rajeswaan et al., 2017). Unfotunately, we do not know to what extent the tempoal diffeence eo is caused by the stochasticity of the envionment dynamics and to what extent it is caused by function appoximation eos. We will empiically investigate the use of impotance sampling in Section Expeimental Benchmaks In this section, we discuss two elatively simple contol tasks that ae consideed in this pape, so that an undestanding of thei popeties can be used in the following sections. The elative simplicity of these tasks enables a thoough analysis. We test ou findings on moe challenging benchmaks in Section 8.5. We pefom ou tests on two simulated contol benchmaks: a pendulum swing-up task and a magnetic manipulation poblem. Both wee peviously discussed by Alibekov et al. (2018). Although both epesent dynamical systems with a two dimensional statespace, it will be shown in Section 6 that they ae quite diffeent when it comes to the optimal expeience selection stategy. Hee, a high level desciption of these benchmaks is pesented, with the full mathematical desciption given in Appendix 9.3. The fist task is the classic unde-actuated pendulum swing-up poblem, shown in Figue 2a. The pendulum stats out hanging down unde gavity. The goal is to balance the pendulum in the upight position. The moto is toque limited such that a swing to one 9

10 de Buin, Kobe, Tuyls and Babuška x a θ g (a) Pendulum task magnetic foce a 1 a 2 a 3 a 4 (b) Magnetic manipulation task Figue 2: The two benchmak poblems consideed in this pape. In the pendulum task, an undeactuated pendulum needs to be swung up and balanced in the upight position by contolling the toque applied by a moto. In the magnetic manipulation (magman) task, a steel ball (top) needs to be positioned by contolling the cuents though fou electomagnets. The magnetic foces exeted on the ball ae shown at the bottom of the figue and can be seen to be a nonlinea function of the position. The foces scale linealy with the actions a 1,..., a 4, which epesent the squaed cuents though the magnets. side is needed to build up momentum befoe swinging towads the upight position in the opposite diection. Once the pendulum is upight it needs to stabilize aound this unstable equilibium point. The state of the poblem s E consists of nomalized vesions of the angle θ and angula velocity θ of the pendulum. The action space is a nomalized vesion of the voltage applied to the moto that applies a toque to the pendulum. A ewad is given at evey time-step, based on the absolute distance of the state fom the efeence state of being upight with no otational velocity. The second benchmak is a magnetic manipulation (magman) task, in which the goal is to accuately position a steel ball on a 1-D tack by dynamically changing a magnetic field. The elative magnitude and diection of the foce that each magnet exets on the ball is shown in Figue 2b. This foce is linealy dependent on the actions, which epesent the squaed cuents though the electomagnet coils. Nomalized vesions of the position x and velocity ẋ fom the state-space of the poblem. A ewad is given at evey time-step, based on the absolute distance of the state fom the efeence state of having the ball at the fixed desied position. In expeiments whee the buffe capacity C is limited, we take C = 10 4 expeiences, unless stated othewise. All ou expeiments have episodes which last fou seconds. Unless stated othewise, a sampling fequency of 50 Hz is used, which means the buffe can stoe 50 episodes of expeience tuples s i, a i, s i, i. Since we ae especially inteested in physical contol poblems whee sustained exhaustive exploation is infeasible, the amount of exploation is educed ove time fom its max- 10

11 Expeience Selection in Deep RL fo Contol imum at episode 1, to a minimum level fom episode 500 onwads in all ou expeiments. At the minimum level, the amplitude of the exploation noise we add to the neual netwok policy is 10% of the amplitude at episode 1. Details of the exploation stategies used ae given in Appendix Pefomance Measues and Expeience Selection Notation This section intoduces the pefomance measues used and the notation used to distinguish between the expeience selection stategies. 5.1 Pefomance Measues When we investigate the pefomance of the leaning methods in Sections 6 and 8, we ae inteested in the effect that these methods might have on thee aspects of the leaning pefomance: the leaning stability, the maximum contolle pefomance and the leaning speed. We define pefomance metics fo these aspects, elated to the nomalized mean ewad pe episode µ. The nomalization is pefomed such that µ = 0 is the pefomance achieved by a andom contolle, while µ = 1 is the pefomance of the off-line dynamic pogamming method descibed in Appendix 9.3. This baseline method is, at least fo the noise-fee tests, poven to be close to optimal. The fist leaning pefomance aspect we conside is the stability of the leaning pocess. As we have discussed in pevious wok (de Buin et al., 2015, 2016a), even when a good policy has aleady been leaned, the leaning pocess can become unstable and the pefomance can dop significantly when the popeties of the taining data change. We investigate to what extent diffeent expeience eplay methods can help pevent this instability. We use the mean of µ ove the last 100 episodes of each leaning un, whee the leaning uns should have conveged to good behavio aleady, as a measue of leaning stability. We denote this measue by µ final. Although changing the data distibution might help stability, it could at the same time pevent us fom accuately appoximating the tue optimal policy. Theefoe we also epot the maximum pefomance achieved pe leaning tial µ max. Finally, we want to know the effects of the expeience selection methods on the leaning speed. We theefoe epot the numbe of episodes befoe the leaning method achieves a nomalized mean ewad pe episode of µ = 0.8 and denote this by Rise-time 0.8. Fo these pefomance metics we epot the means and the 95% confidence bounds of those means ove 50 tials fo each expeiment. The confidence bounds ae based on bootstapping (Efon, 1992). 5.2 Expeience Selection Stategy Notation We conside the poblem of expeience selection, which we have defined as the combination of expeience etention and expeience sampling. The expeience etention stategy detemines which expeiences ae discaded when new expeiences ae available to a full buffe. The sampling stategy detemines which expeiences ae used in the updates of the einfocement-leaning algoithm. We use the following notation fo the complete expeience selection stategy: etention stategy[sampling stategy]. Ou abbeviations fo the etention 11

12 de Buin, Kobe, Tuyls and Babuška Notation Poxy Explanation FIFO age The oldest expeiences ae ovewitten with new ones. FULL DB - The buffe capacity C is chosen to be lage enough to etain all expeiences. Table 1: Commonly used expeience etention stategies fo deep einfocement leaning. Notation Poxy Explanation Unifom - Expeiences ae sampled unifomly at andom. PER supise Expeiences ae sampled using ank-based stochastic pioitized expeience eplay based on the tempoal diffeence eo. See Section PER+IS supise Sampling as above, but with weighted impotance sampling to compensate fo the distibution changes caused by the sampling pocedue. See Section Table 2: Expeience sampling stategies fom the liteatue. and sampling stategies commonly used in deep RL that wee intoduced in Section 3.3 ae given in Tables 1 and 2 espectively. The abbeviations used fo the new o uncommonly used methods intoduced in Section 7 ae given thee, in Tables 4 and Analysis of Expeience Utility As peviously noted by Schaul et al. (2016); Naasimhan et al. (2015); Pietes and Wieing (2016) and de Buin et al. (2016a, 2015), when using expeience eplay, the citeion that detemines which expeiences ae used to tain the einfocement leaning agent can have a lage impact on the pefomance of the method. The aim of this section is to investigate what makes an expeience useful and how this usefulness depends on seveal identifiable chaacteistics of the contol poblem at hand. In the following sections, we mention only some elevant aspects of ou implementation of the deep einfocement-leaning methods, with moe details given in Appendix The Limitations of a Single Poxy To motivate the need fo undestanding how the popeties of a contol poblem influence the applicability of diffeent expeience selection stategies, and the need fo multiple poxies fo the utility of expeiences athe than one univesal poxy, we compae the pefomance of the two stategies fom the liteatue that wee pesented in Section 3.3 on the benchmaks descibed in Section 4. 12

13 Expeience Selection in Deep RL fo Contol Pendulum swingup Magman µ 0.80 µ selection FULL DB[PER] FIFO[Unifom] Episode selection FULL DB[PER] FIFO[Unifom] Episode Figue 3: Compaison of the state-of-the-at (FULL DB[PER]) and the default method (FIFO[Unifom]) fo expeience selection on ou two benchmak poblems. The fist expeience selection stategy tested is FIFO[Unifom]: ovewiting the oldest expeiences when the buffe is full and sampling unifomly at andom fom the buffe. We compae this stategy to the state-of-the-at pioitized expeience eplay method FULL DB[PER] by Schaul et al. (2016). Hee, the buffe capacity C is chosen such that all expeiences ae etained duing the entie leaning un (C = N = fo this test). 1 The sampling stategy is the ank-based stochastic pioitized expeience eplay stategy as descibed in Section 3.3. The esults of the expeiments ae shown in Figue 3. Figue 3 shows that FULL DB[PER] method, which samples taining batches based on the tempoal diffeence eo fom a buffe that is lage enough to contain all pevious expeiences, woks well fo the pendulum swing-up task. The method vey eliably finds a nea optimal policy. The FIFO[Unifom] method, which keeps only the expeiences fom the last 50 episodes in memoy, pefoms much wose. As we epoted peviously (de Buin et al., 2016a), the pefomance degades ove time as the amount of exploation is educed and the expeiences in the buffe fail to cove the state-action space sufficiently. If we look at the esult on the magman benchmak in Figue 3, the situation is evesed. Compaed to simply sampling unifomly fom the most ecent expeiences, sampling fom all pevious expeiences accoding to thei tempoal diffeence eo limits the final pefomance significantly. As shown in Appendix 9.3, this is not simply a matte of the function appoximato capacity, as even much lage netwoks tained on all available data ae outpefomed by small netwoks tained on only ecent data. When choosing an expeience selection stategy fo a einfocement leaning task, it seems theefoe impotant to have some insights into how the chaacteistics of the task detemine the need fo specific kinds of expeiences duing taining. We will investigate some of these chaacteistics below. 1. Schaul et al. (2016) use a FIFO database with a capacity of 10 6 expeiences. We hee denote this as FULL DB since all ou expeiments use a smalle numbe of time-steps in total. 13

14 de Buin, Kobe, Tuyls and Babuška 6.2 Genealizability and Sample Divesity One impotant aspect of the poblem, which at least patly explains the diffeences in pefomance fo the two methods on the two benchmaks in Figue 3, is the complexity of genealizing the value function and policy acoss the state and action spaces. Fo the pendulum task, leaning acto and citic functions that genealize acoss the entie state and action spaces will be elatively simple as a sufficiently deep neual netwok can efficiently exploit the symmety in the value and policy functions (Montufa et al., 2014). Figue 4b shows the leaned policy afte 100 episodes fo a leaning un with FIFO[Unifom] expeience selection. Due to the thoough initial exploation, the expeiences in the buffe cove much of the state-action space. As a esult, a policy has been leaned that is capable of swinging the pendulum up and stabilizing it in both the clockwise and anticlockwise diections, although the cuent policy favos one diection ove the othe. Fo the next 300 episodes this favoed diection does not change and as the amount of exploation is decayed, the expeiences in the buffe become less divese and moe centeed aound this favoed tajectoy though the state-action space. Even though the infomation on how to futhe impove the policy becomes inceasingly local, the updates to the netwok paametes can cause the policy to be changed ove the whole state space, as neual netwoks ae global function appoximatos. This can be seen fom Figue 4d, whee the updates that futhe efine the policy fo swinging up in the cuently pefeed diection have emoved the peviously obtained skill of swinging up in the opposite diection. The policy has suffeed fom catastophic fogetting (Goodfellow et al., 2013) and has ove-fitted to the cuently pefeed swing up diection. Fo the pendulum swing up task, this ove-fitting is paticulaly isky since the pefeed swing up diection can and does change duing leaning, since both diections ae equivalent with espect to the ewad function. When this happens, the FIFO expeience etention method can cause the data distibution in the buffe to change apidly, which by itself can cause instability. In addition, the updates (4) and (6) now use the citic Q (s, a; θ Q ) function in egions of the state-action space that it has not been tained on in a while, esulting in potentially bad gadients. Both of these factos might destabilize the leaning pocess. This can be seen in Figue 4f whee, afte the pefeed swing up diection has apidly changed a few times, the leaning pocess is destabilized and the policy has deteioated to the point that it no longe accomplishes the balancing task. By keeping all expeiences in memoy and ensuing the citic eo δ stays low ove the entie state-action space, the FULL DB[PER] method lagely avoids these leaning stability issues. We believe that this accounts fo the much bette pefomance fo this benchmak shown in Figue 3. Fo the magman task, a policy that genealizes ove the whole state-space might be hade to find. This is because the effects of the actions, shown as the coloed lines in Figue 2b, ae stongly nonlinea functions of the (position)-state. The acto and citic functions must theefoe be vey accuate fo the states that ae visited unde the policy. Requiing the citic to explain all of the expeiences that have been collected so fa might limit the ability of the function appoximatos to achieve sufficient accuacy fo the elevant states. 14

15 Expeience Selection in Deep RL fo Contol position 1 0 velocity -1 0 position (a) citic, episode velocity -1 (b) acto, episode position 1 0 velocity -1 0 position (c) citic, episode velocity -1 (d) acto, episode position velocity position (e) citic, episode velocity (f) acto, episode 507 Figue 4: The citic Q (s, π(s; θπ ); θq ) and acto π(s; θπ ) functions tained on the pendulum swing up task using FIFO[Unifom] expeience selection. The sufaces epesent the functions. The black dots show the tajectoies though the state-action space esulting fom deteministically following the cuent policy. The ed and blue lines show espectively the positive and negative foces that shape the sufaces caused by the expeiences in the buffe: fo the citic these ae δ(s, a) (note a 6= π(s; θπ )). Fo the acto these foces epesent Q (s, π(s; θπ ); θq ) / a. Animations of these gaphs fo diffeent expeience selection stategies ae available at https: //youtu.be/hli1ky0bgt4. The episodes ae chosen to illustate the effect of educed sample divesity descibed in Section

16 de Buin, Kobe, Tuyls and Babuška 1.0 buffe capacity = buffe capacity = buffe capacity = µ final Pendulum swing-up µ final synthetic sample faction synthetic sample faction Synthetic none state action synthetic sample faction Magman Figue 5: The effect on the mean pefomance duing the last 100 episodes of the leaning uns µ final of the FIFO[Unifom] method when changing a faction of the obseved expeiences with synthetic expeiences, fo diffeent buffe sizes Buffe Size and Synthetic Sample Faction To test the hypothesis that the diffeences in pefomance obseved in Figue 3 evolve aound sample divesity, we will atificially alte the sample divesity and investigate how this affects the einfocement leaning pefomance. We will do so by pefoming the following expeiment. We use the plain FIFO[Unifom] method as a baseline. Howeve, with a cetain pobability we make a change to an expeience s i, a i, s i, i befoe it is witten to the buffe. We change eithe the state s i o the action a i. The changed states and actions ae sampled unifomly at andom fom the state and action spaces. When the state is e-sampled the action is ecalculated as the policy action fo the new state including exploation. In both cases, the next state and ewad ae ecalculated to complete the alteed expeience. To calculate the next state and ewad, we use the eal system model. This is not possible fo most pactical poblems; it seves hee meely to gain a bette undestanding of the need fo sample divesity. The esults of pefoming this expeiment fo diffeent pobabilities and buffe sizes ae given in Figues 5 and 6. Inteestingly, fo the pendulum swing up task, changing some faction of the expeiences to be moe divese impoves the stability of the leaning method damatically, egadless of whethe the divesity is in the states o in the actions. The effect is especially noticeable fo smalle expeience buffes. 16

17 Expeience Selection in Deep RL fo Contol buffe capacity = buffe capacity = buffe capacity = Rise-time 0.8 [episodes] Pendulum swing-up 120 Rise-time 0.8 [episodes] Synthetic none state action Magman synthetic sample faction synthetic sample faction synthetic sample faction (a) Effect on the numbe of episodes needed to each µ = 0.8. buffe capacity = buffe capacity = buffe capacity = µ max Pendulum swing-up µ max Synthetic none state action Magman synthetic sample faction synthetic sample faction synthetic sample faction (b) Effect on the numbe of maximum contolle pefomance obtained pe leaning un. Figue 6: The effects on the leaning pefomance of the FIFO[Unifom] method when eplacing a faction of the obseved expeiences with synthetic expeiences, fo diffeent buffe sizes. 17

18 de Buin, Kobe, Tuyls and Babuška Fo the magman benchmak, as expected, having moe divese states educes the pefomance significantly. Having a caefully chosen faction of moe divese actions in the oiginal states can howeve impove the stability and leaning speed slightly. This can be explained fom the fact that even though the effects of the actions ae stongly nonlinea in the state-space, they ae linea in the action space. Genealizing acoss the action space might thus be moe staightfowad and it is helped by having the taining data spead out ove this domain. 6.3 Reinfocement-Leaning Algoithm The need fo expeience divesity also depends on the algoithm that is used to lean fom those expeiences. In the est of this wok we exclusively conside the DDPG acto-citic algoithm, as the explicitly paameteized policy enables continuous actions, which makes it especially suitable fo contol. An altenative to using continuous actions is to discetize the action space. In this subsection, we compae the need fo divese data of the actocitic DDPG algoithm (Lillicap et al., 2016; Silve et al., 2014) to that of the closely elated citic-only DQN algoithm (Mnih et al., 2015). The expeiments ae pefomed on the pendulum benchmak, whee the one dimensional action is divided unifomly into 15 discete actions. Results fo the magman benchmak ae omitted as the fou dimensional action space makes discetization impactical. Fo the acto-citic scheme to wok, the citic needs to lean a geneal dependency of the Q-values on the states and actions. Fo the DQN citic, this is not the case as the Q-values fo diffeent actions ae sepaate. Although the pocessing of the inputs is shaed, the algoithm can lean at least patially independent value pedictions fo the diffeent µ final DDPG (state) DDPG (action) DQN (state) DQN (action) synthetic sample faction Rise-time 0.8 [episodes] synthetic sample faction (a) Effect on the mean pefomance duing the last 100 episodes of the leaning uns µ final. (b) Effect on the numbe of episodes needed to each µ = 0.8. Figue 7: RL algoithm dependent effect of adding synthetic expeiences to the FIFO[Unifom] method. Expeiments on the pendulum benchmak. The effect on µ max is given in Figue 22 in Appendix

19 Expeience Selection in Deep RL fo Contol actions. These functions additionally do not need to be coect, as long as the optimal action in a state has a highe value than the sub-optimal actions. These effects can be seen in Figue 7. The DDPG algoithm can make moe efficient use of the state-action space samples by leaning a single value pediction, esulting in significantly faste leaning than the DQN algoithm. The DDPG algoithm additionally benefits fom moe divese samples, with the pefomance impoving fo highe factions of andomly sampled states o actions. The DQN algoithm convesely seems to suffe fom a moe unifom sampling of the state-action space. This could be because it is now tasked with leaning accuate mappings fom the states to the state-action values fo all actions. While doing so might not help to impove the pedictions in the elevant pats of the state-action space, it could incease the time equied to lean the function and limit the function appoximation capacity available fo those pats of the state-space whee the values need to be accuate. Note again that leaning pecise Q-values fo all actions ove the whole state-space is not needed, as long as the optimal action has the lagest Q-value. Due to the bette scalability of policy-gadient methods in continuous contol settings, we exclusively conside the DDPG algoithm in the emainde of this wok Sample Age In the model-fee setting it is not possible to add synthetic expeiences to the buffe. Instead, in Section 7 we will intoduce ways to select eal expeiences that have desiable popeties and should be emembeed fo a longe time and eplayed moe often. This will inevitably mean that some expeiences ae used moe often than othes, which could have detimental effects such as that the leaning agent could ove-fit to those paticula expeiences. 1.0 Pendulum swing-up 1.0 Magman µ final Synthetic faction state [0.1] state [0.5] action [0.1] action [0.5] synthetic sample efesh pobability synthetic sample efesh pobability Figue 8: The effects on µ final of the FIFO[Unifom] method when changing a faction of the obseved expeiences with synthetic expeiences, when the synthetic expeiences ae updated only with a cetain pobability each time they ae ovewitten. The effects on µ max and the ise-time ae given in Figue 24 in Appendix

20 de Buin, Kobe, Tuyls and Babuška To investigate the effects of adding olde expeiences fo divesity, we pefom the following expeiment. As befoe, a FIFO buffe is used with a cetain faction of synthetic expeiences. Howeve, when a synthetic expeience is about to be ove-witten, we only sample a new synthetic expeience with a cetain pobability. Othewise, the expeience is left unchanged. The esult of this expeiment is shown in Figue 8. Fo the pendulum benchmak, old expeiences only hut when they wee added to povide divesity in the action space in states that wee visited by an olde policy. Fo the magman benchmak the age of the synthetic expeiences is not seen to affect the leaning pefomance. 6.4 Sampling Fequency An impotant popety of contol poblems that can influence the need fo expeience divesity is the fequency at which the agent needs to poduce contol decisions. The sampling fequency of a task is something that is often consideed a given popety of the envionment in einfocement leaning. Fo contol tasks howeve, a sufficiently high sampling fequency can be cucial fo the pefomance of the contolle and fo distubance ejection (Fanklin et al., 1998). At the same time, highe sampling fequencies can make einfocement leaning moe difficult as the effect of taking an action fo a single time-step diminishes fo inceasing sampling fequencies (Baid, 1994). Since the sampling ate can be an impotant hypepaamete to choose, we investigate whethe changing it changes the divesity demands fo the expeiences to be eplayed. In Figue 9, the pefomance of the FIFO[Unifom] method is shown fo diffeent sampling fequencies, with and without synthetic samples. The fist thing to note is that, as expected, low sampling fequencies limit the contolle pefomance. Inteestingly, much of the pefomance loss on the pendulum at low fequencies can be pevented though in- Pendulum swing-up Magman Pendulum swing-up Magman µ final Synthetic none state [0.5] action [0.5] µ max Synthetic none state [0.5] action [0.5] Sampling fequency [Hz] Sampling fequency [Hz] Sampling fequency [Hz] Sampling fequency [Hz] (a) Effect on the mean pefomance duing the last 100 episodes of the leaning uns µ final. (b) Effect on maximum contolle pefomance pe episode µ max. Figue 9: Sampling fequency dependent effect of adding synthetic expeiences to the FIFO[Unifom] method. The effect on the ise time is given in Figue 23 in Appendix

21 Expeience Selection in Deep RL fo Contol 1.0 Pendulum swingup [100Hz] Magman [200Hz] µ 0.4 synthetic samples none µ synthetic samples none 0.2 none [DE] action [0.5] none [DE] action [0.5] action [0.5] [DE] action [0.5] [DE] Episode Episode Figue 10: The effect of synthetic actions and stochastically peventing expeiences fom being witten to the buffe [DE] fo the FIFO[Unifom] method on the benchmaks with inceased sampling fequencies. ceased sample divesity. This indicates that on this benchmak most of the pefomance loss at the tested contol fequencies stems fom the leaning pocess athe than the fundamental contol limitations. When inceasing the sampling fequencies beyond ou baseline fequency of 50Hz, sample divesity becomes moe impotant fo both stability and pefomance. Fo the pendulum swing-up it can be seen that as sampling fequency inceases futhe, inceased divesity in the state-space becomes moe impotant. Fo the magman, adding synthetic action samples has clea benefits. This is vey likely elated to the idea that the effects of actions become hade to distinguish fo highe sampling fequencies (Baid, 1994; de Buin et al., 2016b). Thee ae seveal possible additional causes fo the pefomance decease at highe fequencies. The fist is that by inceasing the sampling fequency, we have inceased the numbe of data points that ae obtained and leaned fom pe episode. Yet the amount of infomation that the data contains has not inceased by the same amount. Since the buffe capacity is kept equal, the amount of infomation that the buffe contains has deceased and the leaning ate has effectively inceased. To compensate fo these specific effects, expeiments ae pefomed in which samples ae stochastically pevented fom being witten to the buffe with a pobability popotional to the incease in sampling fequency. The esults of these expeiments ae indicated with [DE] (dopped expeiences) in Figue 10 and ae indeed bette, but still wose than the pefomance fo lowe sampling fequencies. The second potential eason fo the dop in pefomance is that we have changed the poblem definition by changing the sampling fequency. This is because the fogetting facto γ detemines how fa into the futue we conside the effects of ou actions accoding to: γ = e Ts τγ, 21

22 de Buin, Kobe, Tuyls and Babuška 1.0 Pendulum swing-up 1.0 Magman µ final noise amplitude Synthetic none state [0.5] action [0.5] noise amplitude Figue 11: Expeiments with alteed expeiences and senso and actuato noise. Results ae fom the last 100 episodes of 50 leaning uns. A desciption of the pefomance measues is given in Section 5.1. whee T s is the sampling peiod in seconds and τ γ is the lookahead hoizon in seconds. To keep the same lookahead hoizon, we ecalculate γ, which is 0.95 in ou othe expeiments (T s = 0.02), to be γ pendulum = (T s = 0.01) and γ magman = (T s = 0.005). To keep the scale of the Q functions the same, which pevents lage gadients, the ewads ae scaled down. Coecting the lookahead hoizon was found to hut pefomance on both benchmaks. The likely cause of this is that highe values of γ incease the dependence on the biased estimation of Q ove the unbiased immediate ewad signal (see Equation (4)). This can cause instability (Fançois-Lavet et al., 2015). 6.5 Noise The final envionment popety that we conside is the pesence of senso and actuato noise. So fa, the agent has peceived the (nomalized) envionment state exactly and its (de-nomalized) chosen actions have been implemented without change. Now we conside Equations (1) and (2) with σ s = σ a {0, 0.01, 0.02, 0.05}. The esults of pefoming these expeiments ae shown in Figue 11. The esults indicate that the need fo data divesity is not dependent on the pesence of noise. Howeve, in Section 8.3 it will be shown that the methods used to detemine which expeiences ae useful can be affected by noise. 6.6 Summay This section has pesented an investigation into how diffeent aspects of the einfocement leaning poblem at hand influence the need fo expeience divesity. In Table 3 a summay is given of the investigated aspects and the stength of thei effect on the need fo expeience divesity. While this section has used the tue envionment model to examine the potential 22

23 Expeience Selection in Deep RL fo Contol Popety Effect Explanation Benchmak Vey high The need fo divese states and actions lagely depends on the ease and impotance of genealizing acoss the state-actions space, which is benchmak dependent. RL algoithm Vey high Genealizing acoss the action space is fundamental to acto-citic algoithms, but not to citic-only algoithms with discete action spaces. Sampling fequency High The stability of RL algoithms depends heavily on the sampling fequency. Expeience divesity can help leaning stability. Having divese actions at highe fequencies might be cucial as the size of thei effect on the obseved etuns diminishes. Buffe size Medium Small buffes can lead to apidly changing data distibutions, which causes unstable leaning. Lage buffes have moe inheent divesity. Sample age Low Although etaining old samples could theoetically be poblematic, these poblems wee not clealy obsevable in pactice. Noise None The pesence of noise was not obseved to influence the need fo expeience divesity, although it can influence expeience selection stategies, as will be shown in Section 8.3. Table 3: The dependence of the need fo divese expeiences on the investigated envionment and einfocement leaning popeties. benefits of divesity, the next section will popose stategies to obtain divese expeiences in ways that ae feasible on eal poblems. 7. New Expeience-Selection Stategies Fo the easons discussed in Section 2, we do not conside changing the steam of expeiences that an agent obseves by eithe changing the exploation o by geneating synthetic expeiences. Instead, to be able to eplay expeiences with desied popeties, valuable expeiences need to be identified, so that they can be etained in the buffe and eplayed fom it. In this section we look at how seveal poxies fo the utility of expeiences can be used in expeience selection methods. 23

24 de Buin, Kobe, Tuyls and Babuška 7.1 Expeience Retention Although we showed in Section 6.4 that high sampling ates might waant dopping expeiences, in geneal we assume that each new expeience has at least some utility. Theefoe, unless stated othewise, we will always wite newly obtained expeiences to the buffe. When the buffe is full, this means that we need some metic that can be used to decide which expeiences should be ovewitten Expeience Utility Poxies A citeion used to manage the contents of an expeience eplay buffe should be cheap enough to calculate, 2 should be a good poxy fo the usefulness of the expeiences and should not depend on the leaning pocess in a way that would cause a feedback loop and possibly might destabilize that leaning pocess. We conside thee citeia fo ovewiting expeiences. Age: The default and simplest citeion is age. Since the policy is constantly changing and we ae tying to lean its cuent effects, ecent expeiences might be moe elevant than olde ones. This (FIFO) citeion is computationally as cheap as it gets, since detemining which expeience to ovewite involves simply incementing a buffe index. Fo smalle buffes, this does howeve make the buffe contents quite sensitive to the leaning pocess, as a changing policy can quickly change the distibution of the expeiences in the buffe. As seen in Figue 4, this can lead to instability. Besides FIFO, we also conside esevoi sampling (Vitte, 1985). When the buffe is full, new expeiences ae added to it with a pobability C/i whee i is the index of the cuent expeience. If the expeience is witten to the buffe, the expeience it eplaces is chosen unifomly at andom. Note that this is the only etention stategy we conside that does not wite all new expeiences to the buffe. Resevoi sampling ensues that at evey stage of leaning, each expeience obseved so fa has an equal pobability of being in the buffe. As such, initial exploatoy samples ae kept in memoy and the data distibution conveges ove time. These popeties ae shaed with the FULL DB stategy, without needing the same amount of memoy. The method might in some cases even impove the leaning stability compaed to using a full buffe, as the data distibution conveges faste. Howeve, when the buffe is too small this convegence can be pematue, esulting in a buffe that does not adequately eflect the policy distibution. This can seiously compomise the leaning pefomance. Supise: Anothe possible citeion is the unexpectedness of the expeience, as measued by the tempoal diffeence eo δ fom (4). The success of the Pioitized Expeience Replay (PER) method of Schaul et al. (2016) shows that this can be a good poxy fo the utility of expeiences. Since the values have to be calculated to update the citic, the computational cost is vey small if we accept that the utility values might not be 2. We have discussed the need fo expeience divesity in Section 6 and we have peviously poposed ovewiting a buffe in a way that diectly optimized fo divesity (de Buin et al., 2016a). Howeve, calculating the expeience density in the state-action space is vey expensive and theefoe pohibits using the method on anything but small-scale poblems. 24

25 Expeience Selection in Deep RL fo Contol cuent since they ae only updated fo expeiences that ae sampled. The citeion is howeve stongly linked with the leaning pocess, as we ae actively tying to minimize δ. This means that, when the citic is able to accuately pedict the long tem ewads of the policy in a cetain egion of the state-action space, these samples can be ovewitten. If the pedictions of the citic late become wose in this egion, thee is no way of getting these samples back. An additional poblem might be that the eo accoding to (4) will be caused patially by state and actuato noise. Keeping expeiences fo which the tempoal diffeence eo is high might theefoe cause the samples saved in the buffe to be moe noisy than necessay. Exploation: We intoduce a new citeion based on the obsevation that poblems can occu when the amount of exploation is educed. On physical systems that ae susceptible to damage o wea, o fo tasks whee adequate pefomance is equied even duing taining, exploation can be costly. This means that peventing the poblems caused by insufficiently divese expeiences obseved in Section 6 simply by sustained thoough exploation might not be an option. We theefoe view the amount of exploation pefomed duing an expeience as a poxy fo its usefulness. We take the 1-nom of the deviation fom the policy action to be the usefulness metic. In ou expeiments on the small scale benchmaks we follow the oiginal DDPG pape (Lillicap et al., 2016) in using an Onstein-Uhlenbeck noise pocess added to the output of the policy netwok. The details of the implementation ae given in Appendix 9.3. In the expeiments in Section 8.5 a copy of the policy netwok with noise added to the paametes is used to calculate the exploatoy actions (Plappet et al., 2018). Fo discete actions, the cost of taking exploatoy actions could be used as a measue of expeience utility as well. The invese of the pobability of taking an action could be seen as a measue of the cost of the action. It could also be woth investigating the use of a low-pass filte, as a seies of (semi)consecutive exploatoy actions would be moe likely to esult in states that diffe fom the policy distibution in a meaningful way. These ideas ae not tested hee, as we only conside continuous actions in the emainde of this wok. Note that the size of the exploation signal is the deviation of the chosen action in a cetain state fom the policy action fo that state. Since the policy evolves ove time we could ecalculate this measue of deviation fom the policy actions pe expeience at a late time. Although we have investigated using this policy deviation poxy peviously (de Buin et al., 2016b), we found empiically that using the stength of the initial exploation yields bette esults. This can patly be explained by the fact that ecalculating the policy deviation makes the poxy dependent on the leaning pocess and patly by the fact that sequences with moe exploation also esult in diffeent states being visited Stochastic Expeience Retention Implementation Fo the tempoal diffeence eo and exploation-based expeience etention methods, keeping some expeiences in the buffe indefinitely might lead to ove-fitting to these samples. 25

26 de Buin, Kobe, Tuyls and Babuška Notation Poxy Explanation Expl(α) Exploation Expeiences with the least exploation ae stochastically ovewitten with new ones. TDE(α) Supise Expeiences with the smallest tempoal diffeence eo ae stochastically ovewitten with new ones. Resv Age The buffe is ovewitten such that each expeience obseved so fa has an equal pobability of being in the buffe. Table 4: New and uncommon expeience etention stategies consideed in this wok. Notation Poxy Explanation Unifom + FIS - Expeiences ae sampled unifomly at andom, FIS (Section 7.2) is used to account fo the distibution changes caused by the etention policy. PER+FIS Supise Expeiences ae sampled using ank based stochastic pioitized expeience eplay based on the tempoal diffeence eo. Full impotance sampling is used to account fo the distibution changes caused by both the etention and sampling policies. Table 5: New expeience sampling stategies consideed in this wok. Additionally, although the ovewite metic we choose might povide a decent poxy fo the usefulness of expeiences, we might still want to be able to scale the extent to which we base the contents of the buffe on this poxy. We theefoe use the same stochastic ank-based selection citeion of (7) suggested by Schaul et al. (2016), but now to detemine which expeience in the buffe is ovewitten by a new expeience. We denote this as TDE(α) fo the tempoal diffeence-based etention stategy and Expl(α) fo the exploation-based policy. Hee, α is the paamete in (7) which detemines how stongly the buffe contents will be based on the chosen utility poxy. A sensitivity analysis of α fo both Expl and PER is given in Appendix 9.3. The notation used fo the new expeience etention stategies is given in Table Expeience Sampling Fo the choice of poxy when sampling expeiences fom the buffe, we conside the available methods fom the liteatue: sampling eithe unifomly at andom [Unifom], using stochastic ank-based pioitized expeience eplay [PER] and combining this with weighted impotance sampling [PER+IS]. Given a buffe that contains useful expeiences, these methods have shown to wok well. We theefoe focus on investigating how the expeience eten- 26

27 Expeience Selection in Deep RL fo Contol tion and expeience sampling stategies inteact. In this context we intoduce a weighted impotance sampling method that accounts fo the full expeience selection stategy. Impotance sampling accoding to (8) can be used when pefoming pioitized expeience eplay fom a buffe that contains samples with a distibution that is unbiased with espect to the envionment dynamics. When this is not the case, we might need to compensate fo the effects of changing the contents of the buffe, potentially in addition to the cuent change in the sampling pobability. The contents of the buffe might be the esult of many subsequent etention pobability distibutions. Instead of keeping tack of all of these, we compensate fo both the etention and sampling pobabilities by using the numbe of times an expeience in the buffe has actually been eplayed. When eplaying an expeience i fo the K-th time, we elate the impotance-weight to the pobability unde unifom sampling fom a FIFO buffe of sampling an expeience X times, whee X is at least K: P(X K FIFO[Unifom]). We efe to this method as Full Impotance Sampling (FIS) and calculate the weights accoding to : ω FIS i = P(X K FIFO[Unifom]) [ np ] j=1 P(X j FIFO[Unifom]) /np Hee, n is the lifetime of an expeience fo a FIFO etention stategy in the numbe of batch updates, which is the numbe of batch updates pefomed so fa when the buffe is not yet full. The pobability of sampling an expeience duing a batch update when sampling unifomly at andom is denoted by p. Note that np is the expected numbe of eplays pe expeience, which following Schaul et al. (2016) we take as 8 by choosing the numbe of batch updates pe episode accodingly. As in Section we use β to scale between not coecting fo the changes and coecting fully. Since the pobability of being sampled at least K times is always smalle than one fo K > 0, we scale the weights such that the sum of the impotance weights fo the expected np eplays unde FIFO[Unifom] sampling is the same as when not using the impotance weights (n p 1). The pobability of sampling an expeience at least K times unde FIFO[Unifom] sampling is calculated using the binomial distibution: P(X K FIFO[Unifom]) = 1 K k=0 β ( ) n p k (1 p) n k. k Coecting fully (β = 1) fo the changed distibutions would make the updates as unbiased as those fom the unbiased FIFO unifom distibution (Needell et al., 2016). Howeve, since the impotance weights of expeiences that ae epeatedly sampled fo stability will quickly go to zeo, it might also undo the stabilizing effects that wee the intended outcome of changing the distibution in the fist place. Additionally, as discussed in Section 3.3.2, the FIFO Unifom distibution is not the only valid distibution. As demonstated in Section 8.4, it is theefoe impotant to detemine whethe compensating fo the etention stategy is necessay befoe doing so. The notation fo the selection stategies with this fom of impotance sampling is given in Table

28 de Buin, Kobe, Tuyls and Babuška selection FIFO[Unifom] FIFO[PER] TDE(1.20)[Unifom] TDE(1.20)[PER] Expl(1.20)[Unifom] Expl(1.20)[PER] Resv[Unifom] Resv[PER] FULL DB[Unifom] FULL DB[PER] µ final Rise-time 0.8 [episodes] µ max (a) Swing-up selection FIFO[Unifom] FIFO[PER] TDE(1.20)[Unifom] TDE(1.20)[PER] Expl(1.20)[Unifom] Expl(1.20)[PER] Resv[Unifom] Resv[PER] FULL DB[Unifom] FULL DB[PER] µ final Rise-time 0.8 [episodes] µ max (b) Magman Figue 12: Pefomance of the expeience selection methods unde the default conditions of modeate sampling fequencies and no state o actuato noise. A desciption of the pefomance measues is given in Section Expeience Selection Results Using the expeience etention and sampling methods discussed in Section 7, we evisit the scenaios discussed in Section 6. We fist focus on the methods without impotance sampling, which we discuss sepaately in Section 8.4. Besides the tests on the benchmaks of Section 4, we also show esults on six additional benchmaks in Section 8.5. Thee we also discuss how to choose the size of the expeience buffe. 8.1 Basic Configuation We stat by investigating how these methods pefom on the benchmaks in thei basic configuation, with a sampling ate of 50 Hz and no senso o actuato noise. The esults ae given in Figue 12 and show that it is pimaily the combination of etention method 28

29 Expeience Selection in Deep RL fo Contol and buffe size that detemines the pefomance. It is again clea that this choice hee depends on the benchmak. On the pendulum benchmak, whee stoing all expeiences woks well, the Resv method woks equally well while stoing only 10 4 expeiences, which equals 50 of the 3000 episodes. On the magman benchmak, using a small buffe with only ecent expeiences woks bette than any othe method. Sampling accoding to the tempoal diffeence eo can be seen to benefit pimaily the leaning speed on the pendulum. On the magman, PER only speeds up the leaning pocess when sampling fom ecent expeiences. When sampling fom divese expeiences, PER will attempt to make the function appoximation eos moe even acoss the state-action space, which as discussed befoe, huts pefomance on this benchmak. 8.2 Effect of the Sampling Fequency Fo highe sampling fequencies, the pefomance of the diffeent expeience selection methods is shown in Figue 13. We again see that highe sampling fequencies place diffeent demands on the taining data distibution. With the deceasing exploation, etaining the ight expeiences becomes impotant. This is most visible on the Magman benchmak whee FIFO etention, which esulted in the best pefomance at the end of taining fo the base sampling fequency, now pefoms wost. Retaining all expeiences woks well on both benchmaks. When not all expeiences can be etained, the esevoi etention method is still a good option hee, with the exploation-based method a close second. 8.3 Senso and Actuato Noise We also test the pefomance of the methods in the pesence of noise, similaly to Section 6.5. The main question hee is how the noise might affect the methods that use the tempoal diffeence eo δ as the usefulness poxy. The concen is that these methods might favo noisy samples, since these samples might cause bigge eos. To test this we pefom leaning uns on the pendulum task while collecting statistics on all of the expeiences in the mini-batches that ae sampled fo taining. The mean absolute values of the noise in the expeiences that ae sampled ae given in Table 6. It can be seen that the tempoal diffeence eo-based methods indeed pomote noisy samples. The noise is highest fo those dimensions that have the lagest influence on the value of Q. In Figue 14 the pefomance of the diffeent methods on the two benchmaks with noise is given. The tendency to seek out noisy samples in the buffe is now clealy huting the pefomance of PER sampling, as the pefomance with PER is consistently wose than with unifom sampling. Fo ou chosen buffe size the etention stategy is still moe influential and inteestingly the TDE-based etention method does not seem to suffe as much hee. The elative ankings of the etention stategies ae simila to those without noise. 29

30 de Buin, Kobe, Tuyls and Babuška selection FIFO[Unifom] TDE(1.20)[Unifom] Expl(1.20)[Unifom] Resv[Unifom] FULL DB[Unifom] µ final Rise-time 0.8 [episodes] µ max (a) Swing-up [100 Hz] selection FIFO[Unifom] TDE(1.20)[Unifom] Expl(1.20)[Unifom] Resv[Unifom] FULL DB[Unifom] µ final Rise-time 0.8 [episodes] µ max (b) Magman [200 Hz] Figue 13: Pefomance of the expeience selection methods with inceased sampling fequencies. Results ae fom 50 leaning uns. A desciption of the pefomance measues is given in Section 5.1. position velocity action Expl(1.0)[Unifom] Expl(1.0)[PER] TDE(1.0)[Unifom] TDE(1.0)[PER] Table 6: Mean absolute magnitude of the noise pe state-action dimension in the mini batches as a function of the expeience selection pocedue. 30

31 Expeience Selection in Deep RL fo Contol selection FIFO[Unifom] FIFO[PER] TDE(1.20)[Unifom] TDE(1.20)[PER] Expl(1.20)[Unifom] Expl(1.20)[PER] Resv[Unifom] Resv[PER] FULL DB[Unifom] FULL DB[PER] µ final Rise-time 0.8 [episodes] µ max (a) Swing-up [σ s = σ a = 0.02] selection FIFO[Unifom] FIFO[PER] TDE(1.20)[Unifom] TDE(1.20)[PER] Expl(1.20)[Unifom] Expl(1.20)[PER] Resv[Unifom] Resv[PER] FULL DB[Unifom] FULL DB[PER] µ final Rise-time 0.8 [episodes] µ max (b) Magman [σ s = σ a = 0.02] Figue 14: Pefomance of the expeience selection methods with with senso and actuato noise. Results ae fom 50 leaning uns. A desciption of the pefomance measues is given in Section

32 de Buin, Kobe, Tuyls and Babuška 8.4 Impotance Sampling Finally, we investigate the diffeent impotance sampling stategies that wee discussed in Sections and 7.2. We do this by using the FIFO, TDE and Resv etention stategies as epesentative examples. We conside the benchmaks with noise, since as we discussed in Section 3.3.2, the stochasticity in the envionment can make impotance sampling moe elevant. The esults ae shown in Figue 15. We discuss pe etention stategy how the sample distibution is changed and whethe the change intoduces a bias that should be compensated fo though impotance sampling. FIFO: This etention method esults in an unbiased sample distibution. When combined with unifom sampling, thee is no eason to compensate fo the selection method. Doing so anyway (FIFO[Unifom + FIS]) esults in downscaling the updates fom expeiences that happen to have been sampled moe often than expected, effectively educing the batch-size while not impoving the distibution. The vaiance of the updates is theefoe inceased without educing bias. This can be seen to hut pefomance in Figue 15, especially on the swing-up task whee sample divesity is most impotant. Using PER also huts pefomance in the noisy setting as this sampling pocedue does bias the sample distibution. Using impotance sampling to compensate fo just the sampling pocedue (FIFO[PER+IS]) helps, but the esulting method is not clealy bette than unifom sampling. TDE: When the etention stategy is based on the tempoal diffeence eo, thee is a eason to compensate fo the bias in the sample distibution. It can be seen fom Figue 15 howeve, that the full impotance sampling scheme impoves pefomance on the magman benchmak, but not on the swing-up task. The likely eason is again that impotance sampling indisciminately compensates fo both the unwanted e-sampling of the envionment dynamics and ewad distibutions as well as the beneficial e-sampling of the state-action space distibution. The detimental effects of compensating fo the latte seen to outweigh the beneficial effects of compensating fo the fome on this benchmak whee state-action space divesity has been shown to be so cucial. Resv: The esevoi etention method is not biased with espect to the ewad function o the envionment dynamics. Although the esulting distibution is stongly off-policy (assuming the policy has changed duing leaning), this does not pesent a poblem fo a deteministic policy gadient algoithm with Q-leaning updates, othe than that it might be hade to lean a function that genealizes to a lage pat of the state space. When sampling unifomly, we do sample cetain expeiences, fom ealy in the leaning pocess, fa moe often than would be expected unde a FIFO[Unifom] selection stategy. The FIS method compensates fo this by weighing these expeiences down, effectively educing the size of both the buffe and the mini-batches. In Figue 15, this can be seen to seveely hut the pefomance on the swing-up poblem, as well as the leaning stability on the magman benchmak. Inteesting to note is that on these two benchmaks, fo all thee consideed etention stategies, using impotance sampling to compensate fo the changes intoduces by PER 32

33 Expeience Selection in Deep RL fo Contol selection FIFO[Unifom] FIFO[Unifom+FIS] FIFO[PER] FIFO[PER+IS] FIFO[PER+FIS] TDE(1.20)[Unifom] TDE(1.20)[Unifom+FIS] TDE(1.20)[PER] TDE(1.20)[PER+IS] TDE(1.20)[PER+FIS] Resv[Unifom] Resv[Unifom+FIS] Resv[PER] Resv[PER+IS] Resv[PER+FIS] µ final Rise-time 0.8 [episodes] µ max (a) Swing-up [σ s = σ a = 0.02] selection FIFO[Unifom] FIFO[Unifom+FIS] FIFO[PER] FIFO[PER+IS] FIFO[PER+FIS] TDE(1.20)[Unifom] TDE(1.20)[Unifom+FIS] TDE(1.20)[PER] TDE(1.20)[PER+IS] TDE(1.20)[PER+FIS] Resv[Unifom] Resv[Unifom+FIS] Resv[PER] Resv[PER+IS] Resv[PER+FIS] µ final Rise-time 0.8 [episodes] µ max (b) Magman [σ s = σ a = 0.02] Figue 15: Pefomance of epesentative expeience selection methods with and without impotance sampling on the benchmaks with senso and actuato noise. A desciption of the pefomance measues is given in Section 5.1. only impoved the pefomance significantly when using PER esulted in pooe pefomance than not using PER. Similaly, using FIS to compensate fo the changes intoduced in the buffe distibution only impoved the pefomance when those changes should not have been intoduced to begin with. 33

Howeve, we also saw that the ight expeience selection stategy is benchmak dependent.

To obtain a moe complete pictue we theefoe pefom additional tests on 6 benchmaks of vaying complexity. 8.5.

We have adapted the baselines code to include the expeience selection methods consideed in this section. Ou adapted code is available online.

In contast to the othe expeiments in this wok, the stength of the exploation is kept constant duing the entie leaning un.

34 de Buin, Kobe, Tuyls and Babuška 8.5 Additional Benchmaks The computational and conceptual simplicity of the two benchmaks used so fa allowed fo compehensive tests and a good undestanding of the chaacteistics of the benchmaks. Howeve, we also saw that the ight expeience selection stategy is benchmak dependent. Futhemoe, deep einfocement leaning yields most of its advantages ove einfocement leaning without simple function appoximation on poblems with highe dimensional state and action spaces. To obtain a moe complete pictue we theefoe pefom additional tests on 6 benchmaks of vaying complexity Benchmaks In the inteest of epoducibility, we use the open souce RoboSchool (Klimov, 2017) benchmaks togethe with the openai baselines (Dhaiwal et al., 2017) implementation of DDPG. We have adapted the baselines code to include the expeience selection methods consideed in this section. Ou adapted code is available online. 3 The baselines vesion of DDPG uses Gaussian noise added to the paametes of the policy netwok fo exploation (Plappet et al., 2018). In contast to the othe expeiments in this wok, the stength of the exploation is kept constant duing the entie leaning un. Fo the Expl method we still conside the 1-nom of the distance between the exploation policy action and the unpetubed policy action as the utility of the sample. Fo the benchmaks listed in Table 7, we compae the default FULL DB[Unifom] selection stategy in the baselines code to the altenative etention stategies consideed in this wok with unifom sampling. We show the maximum pefomance fo these diffeent etention stategies as a function of the buffe size in Figue Results As shown in Figue 16, on these noise-fee benchmaks with constant exploation and modeate sampling fequencies, the gains obtained by using the consideed non-standad expeience selection stategies ae limited. Howeve, in spite of the limited numbe of tials pefomed due to the computational complexity, tends do emege on most of the bench- 3. The code is available at InvDoublePnd Reache Hoppe Walke2d HalfCheetah Ant S A Table 7: The RoboSchool benchmaks consideed in this section with the dimensionalities of thei state and action spaces. 34

4/18/2005. Statistical Learning Theory

4/18/2005. Statistical Learning Theory Statistical Leaning Theoy Statistical Leaning Theoy A model of supevised leaning consists of: a Envionment - Supplying a vecto x with a fixed but unknown pdf F x (x b Teache. It povides a desied esponse