Reinforcement Learning using Continuous Actions. Hado van Hasselt

Size: px

Start display at page:

Download "Reinforcement Learning using Continuous Actions. Hado van Hasselt"

Bruno Knight
6 years ago
Views:

1 Reinforcement Learning using Continuous Actions Hado van Hasselt 2005

2 Concluding thesis for Cognitive Artificial Intelligence University of Utrecht First supervisor: Dr. Marco A. Wiering, University of Utrecht Second supervisor: Dr. Lev D. Beklemishev, University of Utrecht Third supervisor: Dr. Vincent van Oostrom, University of Utrecht

3 Contents 1 Introduction 3 2 Reinforcement Learning General Framework Values and Q-Functions Learning the Q-Function Reinforcement Learning in Continuous Spaces Continuous State Space Continuous Action Space Exploration The Algorithms Wire Fitting Gradient Ascent on the Value Interpolating Actors Critic Actor Improvement Experiments Experiment 1: Tracking Different Settings Experiment 2: Tracking with an Obstacle Different Settings Experiment 3: Cart Pole Different Settings Implementation Specifics Results Results on the Tracking Experiments Final Results Speed of Learning

4 5.2 Results on the Cart Pole Experiments Final Results Speed of Learning Overview of Results Conclusion Summary Further Research A Convergence of Improve when Improving 79 B Storing the Variance 89 2

5 Chapter 1 Introduction There are many real world settings that need for an agent to behave in a certain way to perform some task. As an example consider a traffic light as an agent whose behaviour performs the task of regulating traffic. In some cases, the required behaviour is clear and we can simply describe what the actor should do in all situations. In other cases, the solution for a task might not be so obvious. Reinforcement Learning supplies us with a framework in which we can allow an agent to learn to solve a task by giving reinforcement signals to the agent. For these signals, we do not have to know the correct behaviour beforehand, instead we can tell the agent how good we think it is doing and it will learn from this. In short, Reinforcement Learning tells an agent what to do, not how to do it. This and the fact that the agent interacts with its environment during learning contrast Reinforcement Learning with supervised learning algorithms, since we do not need examples of good behaviour in order to achieve it. This allows for the possibility of an agent finding ways to solve a task that we might not have thought of in advance. Although Reinforcement Learning algorithms can not be classified as supervised learning algorithms, they are not unsupervised either. We do need some kind of reinforcement signal from the environment that tells the agent how it is doing. However, for a lot of problems, these signals can be much easier found than actual examples of correct behaviour. Reinforcement Learning has proven itself useful in finding solutions for problems in various field. Some examples are game playing, such as an agent capable of playing backgammon [21] and robotics, such as an elevator dispatching system [7] and robots playing soccer [24]. Basically Reinforcement Learning can be applied to any problem that involves actions and reinforce- 3

6 ments in some environment. In this thesis, we are looking at problems that involve continuous spaces. Quite some research has been done on Reinforcement Learning in continuous environments, but the research on problems where the actions can also be chosen from a continuous space is more limited. We will describe some important algorithms that have already been developed for these problems as well as two new algorithms. Then we will compare these algorithms experimentally to find the advantages and disadvantages for each of them. Of course, it can very well be that one algorithm is more suited for some tasks, while another performs better on other tasks. We will try to reason why the algorithms find certain solutions rather than others. CKI This thesis was written as a conclusion to the education Cognitieve Kunstmatige Intelligentie 1 (CKI) at the University of Utrecht. In CKI, philosophical, psychological and computational views on artificial intelligence are discussed. The subject of this thesis falls under the computational view by discussing ways to make artificial agents learn by experience. There are also some similarities between the computational and the biological Reinforcement Learning frameworks, although they differ in some important ways. Also there is some proof that in the human brain similar processes are active. None of these similarities will be discussed further in this thesis, as we will only concentrate on the computational version of Reinforcement Learning. Outline In chapter 2, first a relatively short introduction to classic Reinforcement Learning will be given, as well as some notation used in this thesis. In chapter 3, we will extend the framework to continuous spaces. Also we will describe the algorithms we used and give some theoretical argumentation on their workings. Then, in chapter 4 we will describe the experiments we have performed with the algorithms, after which the experimental results are presented in chapter 5. Finally, a conclusion is given in chapter 6. 1 Cognitive Artificial Intelligence. 4

7 Chapter 2 Reinforcement Learning In problems that can be handled using Reinforcement Learning, there is always an agent that interacts with the environment. The goal is to optimise the behaviour of the agent in terms of some reinforcement signal. The reinforcement signal is usually referred to as reward, even if in fact it is a punishing, negative signal. This reward is provided by the environment. The actions of the agent can also affect the environment, complicating the search for the optimal behaviour. For a detailed introduction in the field of Reinforcement Learning, see the book by Sutton and Barto [20]. Below follows a short introduction. 2.1 General Framework When an agent performs an action, the environment returns a new state and a reward. The transitions between states may be stochastic, meaning that it is possible that a certain action in a certain state does not always result in reaching the same state or receiving the same reward. Usually it is assumed that the probabilities of the state transitions stay the same, although when the environment is dynamical - for instance multiple interacting agents are learning at the same time - this may not be the case. The whole can be seen as a dynamical system, where the goal is to find the sequence of inputs to the system - the actions - so that the cumulative reward is optimised. The difficulty lies in the fact that the dynamics of the system are usually unknown to the agent. Reinforcement Learning has the advantage on analytical problem solving algorithms supplied by standard control theory that Reinforcement Learning can solve problems without a model of the environment, simply by interacting with it. Also in a lot of 5

8 real cases, even if a model is present, the complexity of finding an analytical solution may be very high. In these cases Reinforcement Learning can be used to find an optimal or near optimal solution. A Reinforcement Learning problem can be seen as a tuple (S, A, R, T ) where: s t S is the state the agent is in at time t. a t A is the action the agent takes at time t. R : S A S IR is the reward function that maps a state s t, an action a t and a next state s t+1 into a reward R(s t, a t, s t+1 ). This reward is known to the agent when reaching the state s t+1. We will use the shorthand notation r t for R(s t, a t, s t+1 ). T : S A S [0, 1] is the transition function, where T (s, a, s ) gives the probability of arriving in state s when taking action a in state s. Note that since t enumerates time and not states or actions, it can hold that s t = s t or a t = a t, while t t. Also note that it is possible that the reward is not directly dependent on the action and it is in fact a function S S IR or even S IR. In these cases the reward will still be indirectly affected by the action because of the dependence of the transitions between states on the actions. The goal of the agent can be formulated as the task of learning an action selection policy π : S A mapping states to actions, maximising the cumulative reward. Note that in some cases it may be better to have a stochastic policy, meaning that it might be good to allow the possibility of different actions being selected in a certain state. Then a policy would be a mapping of state-action pairs to probabilities: π : S A [0, 1]. We will use this second, more general definition of policies. When we are not using actual stochastic policies the probability will be 1 for exactly one action for each state and 0 for all other actions. All algorithms described below can be extended to problems with truly stochastic policies. For instance, each algorithm can be made to output a distribution of chances instead of an action, when given a state. 2.2 Values and Q-Functions In the field of Reinforcement Learning the agent learns by storing values for each state or for each state action pair. State values represent the cumulative 6

9 reward that it expects to receive in the future when reaching that state. State action values represent the cumulative reward it expects to receive when it performs that specific action in that state. Formally, when at some state s t, we want the agent to optimise the total return: r t + γr t+1 + γ 2 r t = γ i r t+i 0 γ 1 is a discount factor. When it is set lower than 1, it can be used to increase the importance of quick rewards compared to more distant ones. It also ensures that the above sum is finite, even if the sequence is infinite. In settings where some end state is always reached within finite time γ = 1 might be used, though care should be taken that the value of the sum does not become too large. Also, when for instance a maze problem is considered, a seemingly logical reward would be 1 when exiting the maze and 0 on all other time steps. When this is the case, any policy that will eventually lead to exiting the maze will result in a return of 1 when γ = 1. This allows no distinction between policies, resulting in the possibility that it takes very long before the exit is found. When γ < 1 there is a difference in values, because actions that allow faster escape will be higher valued, because of the discount factor. Another option would be to give a reward of 1 on each time step that the agent is still in the maze. In all problems, care should be taken that the reward that the agent receives is practical and meaningful to the task we want it to fulfil, since this is what will be optimised. This also means that care should be taken what kind of information is included in the reward function on how to perform a certain task. For instance, Touzet [22] performed an experiment with a robot whose task was to ride around in an environment with obstacles. He used a reward function that gave positive rewards when performing the act of avoidance and negative rewards when bumping into something. This means that in effect, the agent is encouraged to find objects it can avoid. This is exactly what the robot did, finding a corner with an obstacle nearby and moving around there. Probably using another reward function, such as positive rewards corresponding to the speed of the robot and negative rewards for bumping into things, more natural behaviour would have occured. The value of a state s is denoted V (s). The value of a state action pair (s, a) is denoted Q(s, a). Let Q π and V π denote the Q-function and the value-function corresponding to some policy π. Then, by definition: i=0 7

10 { } V π (s) = E π γ i r t+i s t = s i=0 = E π {r t + γ } γ i r t+i+1 s t = s i=0 = π(s, a ) ( T (s, a, s ) R(s, a, s ) + a A s S } ) γe π {γ γ i r t+i+1 s t+1 = s i=0 = π(s, a ) T (s, a, s ) ( R(s, a, s ) + γv π (s ) ) a A s S Here E is the expectance operator. Likewise, we can view the Q-function in terms of the value and, more importantly in terms of itself: { } Q π (s, a) = E π γ i r t+i s t = s, a t = a i=0 = s S T (s, a, s )(R(s, a, s ) + γv π (s )) = s S T (s, a, s )(R(s, a, s ) + γ( a A π(s, a )Q π (s, a ))) This definition states that the Q value is the expected immediate reward plus the expected future (discounted) rewards when following policy π. A policy is defined to be better than another policy when its values are higher for each state or state action pair. We denote the optimal policy by π and its corresponding state and state-action values by V and Q. There is always at least one optimal policy. We then have: V (s) = max (s) π π = s S : arg max V π (s) π V (s) = max Q (s, a) a 8

11 When the optimal Q-function is known or learnt, the agent can simply select the action with the highest Q value in a given state to follow the optimal policy. 2.3 Learning the Q-Function Of course, the Q-function is not known from the beginning. This means that in general it will not hold that π (s) = arg max a Q(s, a). To resolve this, actions are tried and rewards are noted and used to update the Q value corresponding to the state and action to make it closer to the actual value of the state action pair. We know that the Q-function corresponding to the optimal policy will have the following property: Q (s t, a t ) = s S T (s t, a t, s )(R(s t, a t, s ) + γ(max a Q (s, a ))) (2.1) which is called the Bellman optimality equation for Q [3, 20]. These values can then be updated via for instance Sarsa [15, 19]: Q t+1 (s t, a t ) = (1 α)q t (s t, a t ) + α(r t + γq t (s t+1, a t+1 )) (2.2) or Q-Learning [23]: Q t+1 (s t, a t ) = (1 α)q t (s t, a t ) + α(r t + γ max Q t (s t+1, a)) (2.3) a Where 0 α 1 is a learning rate. For now we consider the learning rate to be fixed, but sometimes it is taken to be a function of time, states and/or actions and then it is usually denoted as α t. As can be seen, the Q-Learning update rule is very similar to the definition of the optimal Q-function Q. Essentially the idea is that if this update is performed an infinite amount of times on all state action pairs, eventually equation (2.1) will hold and the optimal policy will be found. There are convergence proofs for Q-learning and Sarsa under the assumption that every state-action pair is experienced an infinite number of times. Because Q-learning can learn about the optimal policy without actually following it, it is called off-policy. Sarsa is an on-policy algorithm. This means that for Sarsa to converge to the optimal policy, exploration must decay during learning. When we are not choosing actions at random, but instead follow the policy that corresponds to the present values, this is called exploitation. More information on the methods we used for exploration is included in the chapter on the experiments. 9

12 Of course, it is also possible to learn the values of states instead of the values of state-action pairs. The update, known as Temporal Difference learning [18], would then become: V t+1 (s t ) = (1 α)v t (s t ) + α(r t + γv t (s t+1 )) (2.4) It should be noted that this equation learns the values of states given a certain policy and does not necessarily learn what the values of the states would be when following the optimal policy. It has been proven that when these values are stored in a table, using this update will allow the values to converge to the actual expected returns [18, 8]. 10

13 Chapter 3 Reinforcement Learning in Continuous Spaces The algorithms presented in the previous chapter assume finite and discrete state and action spaces. There are of course settings in which one or both spaces are continuous. This presents problems for the conventional algorithms, especially when considering the storage and access to the Q-values. Therefore extensions to these algorithms are presented that allow Reinforcement Learning in continuous state and action spaces. First a short introduction in handling continuous state spaces will be given. This is a relatively known field, where a lot of work has already been done. Then we will continue with the harder problem of continuous action spaces. Here less work is done and this will be the subject on which we shall concentrate. Methods that can only handle continuous action spaces are not discussed, since in this thesis we are only interested in algorithms that can handle continuity in both spaces. 3.1 Continuous State Space When the state space becomes large or even continuous and therefore infinite, parametrised function approximators can be used to store observed state-action pairs and generalise to unseen states. For instance, a neural network can be used with the weights of the network as parameters 2. The update is then performed on the parameters of this function approximator. Let θ Q denote these parameters. The update rule corresponding to 2 The workings of neural networks will not be covered in this thesis. There are a lot of introductory texts available on the subject, see for instance the book by Bishop [5]. 11

14 Q-learning is derived from (2.3) and then becomes: θ Q i = θ Q i + α(r t + γ max Q t (s t+1, a) Q t (s t, a t )) Q t(s t, a t ) a θ Q i (3.1) Here θ Q i is the i th component of the parameter vector θ Q and Q t (s, a) is the output of the function approximator at time t when give state s and action a as inputs. The update above can also be seen as error backpropagation. For this view consider r t + γ max a Q t (s t+1, a) as target output T t and Q t (s t, a t ) as present output Y t and use the squared difference as error function: E(t) = 1 2 (T t Y t ) 2. The above update rule then comes down to gradient descent on the error: θ Q i = θ Q i α E(t) θ Q (3.2) i Similarly, the update rules corresponding to SARSA and TD learning are: θ Q i = θ Q i + α(r t + γq t (s t+1, a t+1 ) Q t (s t, a t )) Q t(s t, a t ) θ Q i (3.3) θi V = θi V + α(r t + γv t (s t+1 ) V t (s t )) V t(s t ) θi V (3.4) These methods have been extensively studied. See for instance the book by Bertsekas and Tsitsiklis [4]. 3.2 Continuous Action Space A harder problem is to extend RL to continuous action spaces. If we would learn a good approximation of the Q-function using some form of Q-learning or SARSA, we still have the problem that we cannot trivially find the action that gives the highest value, given a state. So, we would like the algorithm to output an approximation of the best action given a certain state. For this again a function approximator could be used. The question then becomes how to improve this approximation. If the optimal action for all states would be known, this could be used as the target action. Of course, the problem 12

15 is that this optimal action is not known. However, this does not mean that we cannot handle continuous action spaces. Below we will present some algorithms for handling continuous state and action spaces, but first we will describe some desired criteria for these algorithms. Please note that we do not consider time to be continuous also, though all algorithms can be extended towards continuous time. Real Continuous Solutions A good algorithm for continuous spaces should in principle be able to find a policy arbitrarily close to the optimal policy for a given problem. This immediately excludes algorithms where the action space is discretised, since the probability is approximately 0 that the precise optimal action for all states is part of the finite set of discretised actions. These algorithms might perform good in many settings, but we want to put the bar high. Examples of algorithms that discretise the action space or the combined state-action space include Cerebellar Model Articulation Controllers (CMACs) [16] and variable resolution discretisation [12]. Good Generalisation An algorithm that stores knowledge about continuous - and therefore infinite - spaces in a finite memory can of course hardly ever store all information about the entire space 3. Also, we would like an algorithm to reach good solutions relatively fast, and thus do not want the entire space to be searched. Both these restrictions require a good generalisation property of the algorithm. We would like to see good generalisation in the state space, meaning a certain action in two similar states usually receives two similar predicted values. Also we would like to see good generalisation in the action dimensions, meaning a slight change in action in a given state should usually result in a slight change in predicted value. Of course, since there will be discontinuities in most actual value functions, the algorithm must also allow fast changes of value over a small range of actions or states, or even discontinuities. Fast Action Selection Given a state, it is useful to be able to quickly find the optimal action according to the current prediction of the value function. When you have found a value function that fits the true value function perfectly, it is still of little use when each time you want to select the best action in a state, you have to conduct a full search in the continuous action 3 The only exception being when it is possible to map the Q function perfectly with a function with few parameters, which in real-life settings will virtually never be the case. 13

16 space to find it. Therefore a good algorithm allows the optimal action to be found quickly. Model Free This is not really a criterium specific for algorithms in continuous spaces, but we would like our algorithms to be model free. With this, we mean that the agent does not have an internal model of the environment of reward function. In many real world cases, such models are hard to establish. Also, constructing a model brings extra complexity and uncertainty to the problem. There are a number of algorithms that even depend on a derivative of the model to determine the effect an action has on the environment. This not only assumes a model, but this also assumes the model is differentiable. Other algorithms depend on the derivative of the reward function, which requires similar conditions. We do not want to limit ourselves to settings where this information is obtainable by the agent. Therefore, we only consider model free algorithms that do not depend on the differentiability of a model of the environment or of the reward function. 3.3 Exploration There are various possible ways to explore unknown territory. Of course, an agent looking for a solution needs to try different policies to find a good solution. For this, exploration is required. In our experiments we compare two different methods of exploration. The first method of exploration we used is ɛ-greedy exploration. We then select an exploratory random action with probability ɛ and select the greedy, current approximation for the optimal action with probability (1 ɛ). It is then possible to decrease exploration by simply decreasing the factor ɛ. All experiments lasted for time steps, in which the exploration rate (ɛ) was exponentially dropped from 1 - a random action every time step - to a probability of 0.01 to perform a random action every time step. Setting the decay of exploration higher or lower did not result in better performance. The second method of exploration is Gaussian exploration around the current approximation of the optimal action. In this exploration, the action that is in fact performed is sampled from a Gaussian distribution with the mean at the action output of the algorithm we are using. When A(s) denotes the action that the algorithm outputs on a time t, the policy for that time step will be: 14

17 a A : π t (s t, a) = 1 2πσ e (at At(st))2 /(2σ 2 ) Note that π t (s, a) denotes the policy function, while π denotes the mathematical constant. We choose a standard deviation for the distribution of 0.1. In contrast with the ɛ-greedy exploration described above, this exploration did not decay. Also the action that was actually performed was sampled from the distribution every time step, so essentially there was some exploration every time step. 3.4 The Algorithms All the algorithms presented below comply to the criteria in the former section. Also, all of them make use of some sort of function approximator to give the optimal action given a certain state. Learning the mapping from state to action involves a value function which is used to determine which action is better than another action and eventually should lead to a (locally) optimal action. The algorithms differ in how they train the function approximators used and how they find the optimal action. The first two algorithms have been described before, the last two are new. As far as we know, no comparison between such Reinforcement Learning algorithms that can handle continuous action spaces has been made before. Function approximators can be used to map a continuous space to another continuous space. Here our goal is to use them to map the continuous state space to a continuous action space. The main problems we face are the following: We do not know the optimal actions, so we cannot use these as targets. We do not want to search the entire action space to find a current approximation of the best action, without losing the ability to find a true optimal policy. When the action space is finite and small, selecting the action with the highest value can be done simply by looking at the Q-value of all possible actions in the given state and then selecting the action. When the action space becomes large or continuous, this is no longer feasible. Below some ways are suggested to deal with this increased complexity. 15

18 3.4.1 Wire Fitting Baird and Klopf propose an algorithm that efficiently stores an approximation of the complete Q-function [2]. They propose using a function approximator to output multiple actions and corresponding values, given a certain state. Each action and corresponding value are output independently and concurrently. A fixed number of these action-value-pairs are output when a state is input to the function approximator. These outputs can be interpolated if the value of an interlying action is required. To find the action with the highest expected value given a certain state this is not necessary though, since the interpolation function is such that the highest value will always lie at one of the output actions. This allows fast action selection. The setup of the algorithm is shown in figure 3.1. Figure 3.1: Wire Fitting Setup The state s is input into the function approximator. Then the outputs of the function approximator, which represent actions and corresponding values, are interpolated to give the value of the action a in state s. The interpolation function as proposed by Baird and Klopf is: 16

19 f(s, a) = lim ε 0 = lim ε 0 n q i (s) i=0 a a i (s) 2 +c i (q max(s) q i (s))+ε n 1 i=0 a a i (s) 2 +c i (q max(s) q i (s))+ε n q i (s) i=0 distance i (s,a) n 1 i=0 distance i (s,a) = lim ε 0 wsum(s, a) norm(s, a) Where (s, a) is the state-action pair of which the value is wanted. a i (s) and q i (s) are the outputs corresponding to the i th action and value for this state, respectively. Time indicating subscripts are left out for increased legibility. Each action can be a vector, because an action might have multiple real valued components. q max (s) := max j q j (s) is defined as the maximum of all q i. c i is a small smoothing factor and ε is there to prevent division by zero. Basically the interpolation gives a weighted average of the different values, depending on the distance of the given action to the actions that are output on this particular state. These actions are denoted as functions a(s) because they are dependent on the state. The necessity of the smoothing factor c i (max j q j (s) q i (s)) may seem unclear, especially since without it we can just define f(s, a) = q i (s) when a = a i (s) for some i and then we also do not need ε. Baird and Klopf also define f(s, a i (s)) = q i (s) when finding the highest action, but they do note that this is not the same value that would be found when using the interpolation function with the smoothing factor. A result of including c is that the values of the actions that are output are stressed, since the interpolated values will always lie more towards the mean of all values. Gaskett has shown experimentally that the precise value of the smoothing factor is not important, as long as it s a small positive value [10]. Typical values for c would be 10 2 or Learning Consider that an agent using Wire Fitting has an experience, consisting of the state it was in (s t ) an action (a t ) it performed, a reward (r t ) received and the next state (s t+1 ) it reaches. We will now explain how this experience is then used to update the system. As mentioned, in this algorithm the output of the interpolation is to be interpreted as the value of a given state-action pair. We can then use the update rule (3.1) to update the parameters of the system to make the 17

20 output more closely resemble the target output of r t + γ max a Q(s t+1, a). Regardless of whether the action performed was exploratory or not, gradient descent on the squared difference of the value the interpolator gives for that action and the target output can be performed. This can be done according to equation (3.1). This results in updating the function approximators that output the actions by the following update, where θ a j i are the parameters of the function approximator that outputs action a j. These can in fact be the same parameters as for other actions, but also a single function approximator can be used per action, so we consider the general case. Again, the normal time indicating subscripts are left out to avoid confusion. Instead s is used to denote the state that is reached by performing action a in state s. Where: θ a j i = θ a j i + α(r + γ max Q(s Q(s, a), b) Q(s, a)) b θ a j i = θi A + α(r + γ max q j (s ) Q(s, a)) j Q(s, a) a j a j θ A i Q(s, a) 2(wsum(s, a) norm(s, a)q j )(a j a) = lim a j ε 0 (norm(s, a)distance i (s, a)) 2 A similar update can be done for the parameters of the function approximators that output the values. Then we use: Q(s, a) norm(s, a)(distance i (s, a) + q j c) wsum(s, a)c = lim q j ε 0 (norm(s, a)distance i (s, a)) 2 But what does it mean to update the parameters by gradient descent on the prediction error of the interpolator? Basically, all actions and values are changed slightly in such a way that the resulting interpolation more closely resembles the given action-value pair. Because the gradient is dependent on the distance between the action that was performed and the actions that are output, the action outputs that are closest and their values are updated the most. For a simple example, consider figure 3.2. Finding the optimal action In a fully trained system with a low error between experienced action-value pairs and the interpolation output, we can assume that the optimal action is the one corresponding to the highest value. 18

21 Figure 3.2: Wire Fitting Example The red crosses are the initial outputs of the function approximator on a given state. The red line is the output of the interpolation given those actions and values. The x-axis contains the (onedimensional) actions, while the y-axis contains the values of the actions. The blue + is an experienced action-value pair in the given state. The purple stars are updated values and the purple line the updated interpolation. As can be seen, not all outputs necessarily move towards the +, instead they move so that the interpolation is closer. Because of the nature of the interpolator, this action is always amongst the ones output by the function approximator. This means that to select the highest valued action we do not have to use the interpolation function. We just select the action with the highest corresponding value. This results in very fast action selection. Further Notes The approach is called Wire Fitting (WF), because essentially the interpolation function describes a surface in S A IR space, which is draped, so to say, over wires defined by the outputs of the function approximator. Because the whole value function is approximated, using enough outputs, the system can reach any real valued optimal policy, generalise well, while allowing fast action selection. 19

22 In effect, the value output and the whole interpolation are only used to generate the error with which the actions can be updated in order to reach a situation in which one of the outputs of the function approximator approximates the optimal action for each state. Which output is the optimal one can change for different states. In theory, this even allows the algorithm to find optimal policies with a limited number of discontinuities. This approach was also implemented by Gaskett et al [11]. The results look promising, though no comparisons with other algorithms for continuous action spaces are made Gradient Ascent on the Value Prokhorov and Wunsch described Adaptive Critic Designs of which we implemented a version of their Action Dependent Heuristic Dynamic Programming (ADHDP) algorithm [14]. This algorithm uses a single actor function approximator to output just the optimal action, given a state. We shall denote the output of this actor at time t as A t (s). When this action is selected, the Q-value can then be determined using a critic function approximator that outputs the Q-value given a state and action. This setup is shown in figure 3.3. Figure 3.3: Adaptive Critic Designs Setup The state s is input into the function approximator. The output a = A t (s) of the function approximator represents the current approximation of the optimal action. This action is then input into the critic with the state s to give the current approximation of Q(s, a). Of course, when the value of another (exploratory) action is needed, only this last function approximator is used. 20

23 Learning Equation (3.1) handles the updates to the parameters of the critic function approximator. This is relatively straightforward and an update can be made after each experience of a state, action, reward and new state. As stated above by equation (3.2), we can view this as gradient descent on the squared error E(t): ( E(t) = r t + γ max a ) 2 Q t (s t+1, a) Q t (s t, a t ) Training the actor function approximator is only slightly more complex. We need to find a target towards which we can train the output of the actor, given a certain state. For this we can use gradient information to determine how the value function would change if the action is changed locally. This can then be used to find a local maximum. Then, the newly found action (with a higher value, given the current state) can be used as a target to update the parameters of the action selection function approximator. Equivalently, the gradient information of the value can be propagated back immediately to the parameters of the actor. Calling the parameters of the actor θ A, this would result in the following update: θ A i θ A i + α Q t(s, a) θ A i = θi A + α Q t(s, a) A t (s) A t (s) θi A (3.5) So in summary, an update consists of a gradient descent update on the error of the output value for the critic function approximator and a gradient ascent update on the Q-value for the actor function approximator. A potential problem lies in the backpropagation of the value through the critic. Clearly, when the critic is not yet fully trained, the gradient information on the value will not always be accurate. In the beginning of training the information might even be completely incorrect. This can lead to incorrect updates to the parameters of the actor function approximator. These updates might hinder later learning for the actor when the critic is trained more. Note that in essence, there are no real boundaries for the outputs of the actor. This can prove to be a problem when the actor outputs values the lies outside the range of possible actions. Of course, the action that is in fact performed can be clipped to fall within the range of possible actions, but then still a problem exists. All inputs and outputs are scaled to fall within a [ 1, 1] interval. For the actor this means that a value of 1 in one of the dimensions of an action corresponds to the lowest possible action in that dimension. However, update 3.5 can result in outputs that lie outside the [ 1, 1] interval. Especially, this can happen when the critic is not yet 21

24 properly trained and the gradient information is thus still unreliable. This can lead to divergence of the actor when the outputs come to lie further from the [ 1, 1] interval. We introduce a way to avoid this potential divergence. To avoid divergence of the actor the probability for an update to the actor s parameters slowly increases during learning. This means that when the critic is not yet trained, not many updates to the actor will be performed. In the beginning of learning therefore there will be only updates to the critic and not to the actor on most time steps. This prevents early divergence of the actor function approximator due to the incorrect early gradient information provided by the critic. In our experiments, the probability of an update being performed was (1 p), where p decreases exponentially from 1 to 0.01 over the course of learning. We implemented both the original algorithm as proposed by Prokhorov and Wunsch and the version we just described which should avoid early divergence of the algorithm. Performance for both these algorithms is compared in chapter 5. Prokhorov and Wunsch did not mention any ways to prevent divergence of the actor for this algorithm. They do note that they did not get this algorithm to work on one of their problems. However, they propose using similar algorithms that do solve that problem and should have better performance on other problems. For these algorithms, they propose not actually learning the value function, but instead learning the derivative of the value function to the inputs of the critic. They state that this can result in better performance, because in effect we are only interested in the value function because we need its derivative to train the actor. Directly training towards this derivative would result in one less approximation in the process. A problem with trying to learn the derivative of the value function is that the derivative of the reward function is required. In model-free settings, this derivative is usually not available. Also in some cases the reward function may not be differentiable. For these reason, we only implement the algorithm with a value function and not a value-derivative function. Also, though the algorithms may perform better when using the derivative of the value, the actor may still diverge when the approximation of the derivative is not yet accurate. Finding the optimal action Finding the current approximation of the optimal action in a given state is quite easy when using this method. We only have to propagate the state information through the actor function approximator, the output will be the action we want. The optimal action is therefore found very quickly. 22

25 Similar Algorithms The algorithms we implemented were the ADHDP algorithm and our version of it with fewer updates to the actor at the beginning of training. Prokhorov and Wunsch also describe other algorithms, but all those other algorithms use a derivative of the world model to determine the effect an action has on the state values or a derivative of the reward function or even both. These other algorithms therefore are not interesting to us, since we only want to examine model-free algorithms. A similar algorithm was proposed by Kimura and Kobayashi [13], though they extended the approach to handling eligibility traces and a stochastic policy. For this stochastic policy they use a normal distribution on the action space, after which they can perform gradient ascent with regards to the value on the mean and standard deviation of this distribution. Of course, in practice an optimal stochastic policy will not always follow a normal distribution on the action space. Kimura and Kobayashi only compare their approach to a fully discretised (state and action space) actor critic system. Other variations on this algorithm that also take continuous time into account are given by Doya [9] and Coulom [6]. Basically their algorithms also perform gradient ascent on the value, though these last two use a state value function V (s) instead of a Q-function Q(s, a). Unfortunately in these cases, because there is no direct line from the value to the action through which the error can be propagated, it is assumed that the state update is linear with respect to the action performed. This means that we would have to assume that the transition function T (s, a, s ) is linear with respect to a. We do not wish to make such an assumption, which is why we do not implement this version of gradient ascent on the value Interpolating Actors The Interpolating Actors (IA) algorithm is the first of two new algorithms that are proposed in this thesis. For this approach we first consider actions consisting of only one real valued element. Extending the approach to action vectors of multiple dimensions is possible by simply running the approach in parallel for all action dimensions separately. This algorithm lets an actor function approximator output n outputs where n 2. For these outputs, the following properties should hold: n o i = 1 and i(o i 0) i=1 Where o i is the i th output. Subscripts indicating time are left out for clarity. 23

26 The input is the current state s. These outputs can then be used as weights for predetermined values v that we want to interpolate between. For example, if the action space ranges from 0 to 100, we might use 3 outputs with the corresponding action value vector: v = (0, 50, 100). An action of for instance 70 can then be reached by letting the function approximator output the vector (0.2, 0.2, 0.6) or (0.0, 0.6, 0.4). This can for instance be accomplished by allowing the actor function approximator to output arbitrary positive values that we shall denote as y i and then normalising them as follows: o i = y i n j=1 y j Alternatively, if we cannot or do not want to restrict all y i to be positive, the following normalisation can be used: o i = f(y i ) n j=1 f(y j) Where f : IR [0, is a continuous differentiable function. A common choice in similar systems is the strictly monotonically increasing: f(x) = e x It s probably an advantage to have a relatively smooth function, such as f(x) = e x or f(x) = x 2, since the learning algorithm will use only the first derivative to try to predict how the output changes when the actions change. If the function is too complex the first derivative becomes a less reliable approximation to the true local behaviour of the function. Each output will have a fixed corresponding action value. These values are typically distributed evenly across the action space, which needs to be bounded to make this possible. The output of the complete actor system, denoted A(s) then is: A(s) = o, v = n o i v i Here o is the output vector of the function approximator (after normalisation) and v is its corresponding action value vector. i=1 24

27 Learning The goal now is to learn how to interpolate the fixed action values so that the resulting action is optimal. To change the outputs gradient ascent on the value can be used. Denoting the parameters of the actor as θ A, the update to these parameters then would become: Where: Q(s, a) y i = = = = = θi A θi A Q(s, a) + α = θi A + α θ i Q(s, a) A(s) Q(s, a) A(s) Q(s, a) A(s) Q(s, a) A(s) n ( A(s) k=1 o k f(y i ) n j=1 ) f(yi ) Q(s, a) y j (3.6) y j θ i o k y i n δ ik v k n j=1 f(y j) f(y k ) ( n 2 f j=1 j)) (y i ) f(y k=1 ( f (y i ) n n j=1 f(y j) k=1 ( f (y i ) n j=1 f(y v i j) Q(s, a) f (y i )(v i A(s)) A(s) n j=1 f(y j) v k δ ik v kf(y k ) n j=1 f(y j) ) n v k o k Q(s, a)/ A(s) can be determined by calculating the gradient of the value (which is the output of the critic of course) to the action input. δ ik is the Kronecker Delta, defined by: { 0 iff i j δ ij = 1 iff i = j The update above is inversely dependent on the sum of all f(y i ). This might become a problem if we allow the f(y i ) to become any value. In particular, if the sum is very small, the update might be far too big, resulting in oscillations or divergence. If the sum is too big, the update might become too small, resulting in no change in the parameters. This might be remedied by updating the y i towards values where i f(y i) 1. For this consider the fact that updating the parameters by (3.6) results in the same update as updating the parameters as follows: k=1 ) 25

28 θ A i θ A i α n j=1 Where E j is an error term, defined as: Where: E j (s, a) f(y j) θ i E j (s, a) = f(y j ) f(y j ) target f(y j ) target = f(y j ) + Q(s, a) f(y j ) We can then scale the targets such that the sum of all targets is 1. This is allowed, since the interpolation is linear and relative scaling of the inputs to the interpolation does not change the output of the interpolation and thus does not change the resulting action. To show this we define g(y) = λf(y) where λ is some real valued number. Then we get: g(y i ) n j=1 g(y j) = λf(y i ) n j=1 λf(y j) = λf(y i ) λ n j=1 f(y j) = f(y i ) n j=1 f(y j) = o i We can choose λ so that i g(y i) = 1. Now, we can scale the targets such that i f(y i) target = 1, resulting in i f(y i) 1 at the next time step. For this, we redefine the targets as follows: f(y j ) target scaled = This results in the following update: θ A i θ A i + α n j=1 f(y j ) target n k=1 f(y k) target ( (f(y j ) target scaled f(y j ))f (y j ) y ) j θ i (3.7) A critic such as described in the former algorithm can then be used to predict the value of such an action in the given state. Then once again backpropagation of the output of the critic, which is the predicted Q value, to the actor output vector can be used to update the parameters of the actor function approximator to perform gradient ascent on the value. Because the update is once again dependent on backpropagation of the value through the critic, again the possibility exists that the earlier updates are less accurate than later updates, when the critic is trained better. Therefore, we 26

29 As a multidimensional example consider a robot that can drive at a certain speed and turn at a certain rate. The speed bounds might be [1, 10] and the turn rate bounds [ 180, 180]. Two interpolators with fixed values v speed = (1, 10) and v turn = ( 180, 180) could be used. If the possible turning speed is linearly dependent on the speed, for instance by turnratebounds = ( ( speed), ( speed)) then a single interpolator might be used with fixed vectors v = (( 180, 1), (180, 1), (0, 1), (0, 10)). Alternatively the constraint can be forced simply by defining the turn bounds as v turn = ( ( speed), ( speed)) and then first select the speed and then determine the turning rate, with the speed given as an input. Of course, the constraint can also be expressed as speed being dependent on the turning rate and then changing the order of action selection, if this is convenient. Figure 3.4: Example for Interpolating Actors implemented two versions of this algorithm, one with an update to the actor system every time step and one where the probability on an update to the actor is (1 p), with p decreasing exponentially from 1 to 0.01, similarly to the approach described in section Finding the optimal action Finding the optimal action is relatively straightforward. We only have to forward propagate the state description through the actor function approximator, after which the outputs can be interpolated to form the action that is to be performed. Further Notes When the action space has more than one dimension, the different dimensions can be split over different interpolators. This is because in general it is not possible given the constraint n i=1 o i = 1 to interpolate a fixed set of action value vectors such that all values that are in the possible actions set can be reached and none outside this set can be reached. If there are linear constraints between different dimensions in the action space, these can sometimes be handled by one interpolator. In general this will not often be the case. These constraints might also be fulfilled by handling the different dimensions asynchronously as can be seen in the example below. This algorithm also allows for nonlinear constraints. For a numeric example, see figure 3.4. We shall examine experimentally if the number of fixed outputs has an impact on performance. Consider one action dimension that ranges from 0 to 27

30 100. Note that with 3 of more outputs, the number of output configurations that result in the output of extremer numbers such as 10 or 90 is significantly lower than the number of output configurations that result in the output of 50. This is not the case when using only 2 outputs as then the number of configurations is equal for all outputs. A possible drawback of this algorithm is that the action space should not only be bounded, but the bounds must also be known beforehand. This is in order to be able to choose the fixed actions that will be interpolated. An extension that we did not implement would be to also update the values between which is interpolated. This can be done in a similar way to adapting the parameters of the function approximators Critic Actor Improvement The Critic Actor Improvement (CAI) algorithm is the second new algorithm that is proposed in this thesis. To outline the idea, we first consider the tabular case with discrete states and actions and without function approximators. The algorithm then works as follows. One table stores the values of the states. These can be updated with the TD-learning update rule (2.4). Another table stores the actions that should be performed in each state. Of course, these actions are performed with some exploration. For instance ɛ-greedy exploration can be used, where a random action is chosen with a small probability ɛ. Another possibility would be to associate a certain probability with each action for every state so that every possible action has a non zero probability of being chosen in every state. Now, whenever the value of a state is to be increased, conclude that the action that was just performed was probably a good action for that state. This is backed up by the following reasoning. The values of the states will converge to the actual discounted future rewards, given the current policy. If now performing a certain exploratory action results in a positive change for the value of a state, then this action will in principle lead to a higher discounted future reward and thus a better policy. Therefore, we reinforce this action for this state. In pseudo-code we get: if : then : r t + γv t (s t+1 ) > V t (s t ) increase(π(s t, a t )) 28

Reinforcement Learning in Continuous Action Spaces

Reinforcement Learning in Continuous Action Spaces Hado van Hasselt and Marco A. Wiering Intelligent Systems Group, Department of Information and Computing Sciences, Utrecht University Padualaan 14, 3508