Reinforcement Learning using Continuous Actions. Hado van Hasselt

Size: px
Start display at page:

Download "Reinforcement Learning using Continuous Actions. Hado van Hasselt"

Transcription

1 Reinforcement Learning using Continuous Actions Hado van Hasselt 2005

2 Concluding thesis for Cognitive Artificial Intelligence University of Utrecht First supervisor: Dr. Marco A. Wiering, University of Utrecht Second supervisor: Dr. Lev D. Beklemishev, University of Utrecht Third supervisor: Dr. Vincent van Oostrom, University of Utrecht

3 Contents 1 Introduction 3 2 Reinforcement Learning General Framework Values and Q-Functions Learning the Q-Function Reinforcement Learning in Continuous Spaces Continuous State Space Continuous Action Space Exploration The Algorithms Wire Fitting Gradient Ascent on the Value Interpolating Actors Critic Actor Improvement Experiments Experiment 1: Tracking Different Settings Experiment 2: Tracking with an Obstacle Different Settings Experiment 3: Cart Pole Different Settings Implementation Specifics Results Results on the Tracking Experiments Final Results Speed of Learning

4 5.2 Results on the Cart Pole Experiments Final Results Speed of Learning Overview of Results Conclusion Summary Further Research A Convergence of Improve when Improving 79 B Storing the Variance 89 2

5 Chapter 1 Introduction There are many real world settings that need for an agent to behave in a certain way to perform some task. As an example consider a traffic light as an agent whose behaviour performs the task of regulating traffic. In some cases, the required behaviour is clear and we can simply describe what the actor should do in all situations. In other cases, the solution for a task might not be so obvious. Reinforcement Learning supplies us with a framework in which we can allow an agent to learn to solve a task by giving reinforcement signals to the agent. For these signals, we do not have to know the correct behaviour beforehand, instead we can tell the agent how good we think it is doing and it will learn from this. In short, Reinforcement Learning tells an agent what to do, not how to do it. This and the fact that the agent interacts with its environment during learning contrast Reinforcement Learning with supervised learning algorithms, since we do not need examples of good behaviour in order to achieve it. This allows for the possibility of an agent finding ways to solve a task that we might not have thought of in advance. Although Reinforcement Learning algorithms can not be classified as supervised learning algorithms, they are not unsupervised either. We do need some kind of reinforcement signal from the environment that tells the agent how it is doing. However, for a lot of problems, these signals can be much easier found than actual examples of correct behaviour. Reinforcement Learning has proven itself useful in finding solutions for problems in various field. Some examples are game playing, such as an agent capable of playing backgammon [21] and robotics, such as an elevator dispatching system [7] and robots playing soccer [24]. Basically Reinforcement Learning can be applied to any problem that involves actions and reinforce- 3

6 ments in some environment. In this thesis, we are looking at problems that involve continuous spaces. Quite some research has been done on Reinforcement Learning in continuous environments, but the research on problems where the actions can also be chosen from a continuous space is more limited. We will describe some important algorithms that have already been developed for these problems as well as two new algorithms. Then we will compare these algorithms experimentally to find the advantages and disadvantages for each of them. Of course, it can very well be that one algorithm is more suited for some tasks, while another performs better on other tasks. We will try to reason why the algorithms find certain solutions rather than others. CKI This thesis was written as a conclusion to the education Cognitieve Kunstmatige Intelligentie 1 (CKI) at the University of Utrecht. In CKI, philosophical, psychological and computational views on artificial intelligence are discussed. The subject of this thesis falls under the computational view by discussing ways to make artificial agents learn by experience. There are also some similarities between the computational and the biological Reinforcement Learning frameworks, although they differ in some important ways. Also there is some proof that in the human brain similar processes are active. None of these similarities will be discussed further in this thesis, as we will only concentrate on the computational version of Reinforcement Learning. Outline In chapter 2, first a relatively short introduction to classic Reinforcement Learning will be given, as well as some notation used in this thesis. In chapter 3, we will extend the framework to continuous spaces. Also we will describe the algorithms we used and give some theoretical argumentation on their workings. Then, in chapter 4 we will describe the experiments we have performed with the algorithms, after which the experimental results are presented in chapter 5. Finally, a conclusion is given in chapter 6. 1 Cognitive Artificial Intelligence. 4

7 Chapter 2 Reinforcement Learning In problems that can be handled using Reinforcement Learning, there is always an agent that interacts with the environment. The goal is to optimise the behaviour of the agent in terms of some reinforcement signal. The reinforcement signal is usually referred to as reward, even if in fact it is a punishing, negative signal. This reward is provided by the environment. The actions of the agent can also affect the environment, complicating the search for the optimal behaviour. For a detailed introduction in the field of Reinforcement Learning, see the book by Sutton and Barto [20]. Below follows a short introduction. 2.1 General Framework When an agent performs an action, the environment returns a new state and a reward. The transitions between states may be stochastic, meaning that it is possible that a certain action in a certain state does not always result in reaching the same state or receiving the same reward. Usually it is assumed that the probabilities of the state transitions stay the same, although when the environment is dynamical - for instance multiple interacting agents are learning at the same time - this may not be the case. The whole can be seen as a dynamical system, where the goal is to find the sequence of inputs to the system - the actions - so that the cumulative reward is optimised. The difficulty lies in the fact that the dynamics of the system are usually unknown to the agent. Reinforcement Learning has the advantage on analytical problem solving algorithms supplied by standard control theory that Reinforcement Learning can solve problems without a model of the environment, simply by interacting with it. Also in a lot of 5

8 real cases, even if a model is present, the complexity of finding an analytical solution may be very high. In these cases Reinforcement Learning can be used to find an optimal or near optimal solution. A Reinforcement Learning problem can be seen as a tuple (S, A, R, T ) where: s t S is the state the agent is in at time t. a t A is the action the agent takes at time t. R : S A S IR is the reward function that maps a state s t, an action a t and a next state s t+1 into a reward R(s t, a t, s t+1 ). This reward is known to the agent when reaching the state s t+1. We will use the shorthand notation r t for R(s t, a t, s t+1 ). T : S A S [0, 1] is the transition function, where T (s, a, s ) gives the probability of arriving in state s when taking action a in state s. Note that since t enumerates time and not states or actions, it can hold that s t = s t or a t = a t, while t t. Also note that it is possible that the reward is not directly dependent on the action and it is in fact a function S S IR or even S IR. In these cases the reward will still be indirectly affected by the action because of the dependence of the transitions between states on the actions. The goal of the agent can be formulated as the task of learning an action selection policy π : S A mapping states to actions, maximising the cumulative reward. Note that in some cases it may be better to have a stochastic policy, meaning that it might be good to allow the possibility of different actions being selected in a certain state. Then a policy would be a mapping of state-action pairs to probabilities: π : S A [0, 1]. We will use this second, more general definition of policies. When we are not using actual stochastic policies the probability will be 1 for exactly one action for each state and 0 for all other actions. All algorithms described below can be extended to problems with truly stochastic policies. For instance, each algorithm can be made to output a distribution of chances instead of an action, when given a state. 2.2 Values and Q-Functions In the field of Reinforcement Learning the agent learns by storing values for each state or for each state action pair. State values represent the cumulative 6

9 reward that it expects to receive in the future when reaching that state. State action values represent the cumulative reward it expects to receive when it performs that specific action in that state. Formally, when at some state s t, we want the agent to optimise the total return: r t + γr t+1 + γ 2 r t = γ i r t+i 0 γ 1 is a discount factor. When it is set lower than 1, it can be used to increase the importance of quick rewards compared to more distant ones. It also ensures that the above sum is finite, even if the sequence is infinite. In settings where some end state is always reached within finite time γ = 1 might be used, though care should be taken that the value of the sum does not become too large. Also, when for instance a maze problem is considered, a seemingly logical reward would be 1 when exiting the maze and 0 on all other time steps. When this is the case, any policy that will eventually lead to exiting the maze will result in a return of 1 when γ = 1. This allows no distinction between policies, resulting in the possibility that it takes very long before the exit is found. When γ < 1 there is a difference in values, because actions that allow faster escape will be higher valued, because of the discount factor. Another option would be to give a reward of 1 on each time step that the agent is still in the maze. In all problems, care should be taken that the reward that the agent receives is practical and meaningful to the task we want it to fulfil, since this is what will be optimised. This also means that care should be taken what kind of information is included in the reward function on how to perform a certain task. For instance, Touzet [22] performed an experiment with a robot whose task was to ride around in an environment with obstacles. He used a reward function that gave positive rewards when performing the act of avoidance and negative rewards when bumping into something. This means that in effect, the agent is encouraged to find objects it can avoid. This is exactly what the robot did, finding a corner with an obstacle nearby and moving around there. Probably using another reward function, such as positive rewards corresponding to the speed of the robot and negative rewards for bumping into things, more natural behaviour would have occured. The value of a state s is denoted V (s). The value of a state action pair (s, a) is denoted Q(s, a). Let Q π and V π denote the Q-function and the value-function corresponding to some policy π. Then, by definition: i=0 7

10 { } V π (s) = E π γ i r t+i s t = s i=0 = E π {r t + γ } γ i r t+i+1 s t = s i=0 = π(s, a ) ( T (s, a, s ) R(s, a, s ) + a A s S } ) γe π {γ γ i r t+i+1 s t+1 = s i=0 = π(s, a ) T (s, a, s ) ( R(s, a, s ) + γv π (s ) ) a A s S Here E is the expectance operator. Likewise, we can view the Q-function in terms of the value and, more importantly in terms of itself: { } Q π (s, a) = E π γ i r t+i s t = s, a t = a i=0 = s S T (s, a, s )(R(s, a, s ) + γv π (s )) = s S T (s, a, s )(R(s, a, s ) + γ( a A π(s, a )Q π (s, a ))) This definition states that the Q value is the expected immediate reward plus the expected future (discounted) rewards when following policy π. A policy is defined to be better than another policy when its values are higher for each state or state action pair. We denote the optimal policy by π and its corresponding state and state-action values by V and Q. There is always at least one optimal policy. We then have: V (s) = max (s) π π = s S : arg max V π (s) π V (s) = max Q (s, a) a 8

11 When the optimal Q-function is known or learnt, the agent can simply select the action with the highest Q value in a given state to follow the optimal policy. 2.3 Learning the Q-Function Of course, the Q-function is not known from the beginning. This means that in general it will not hold that π (s) = arg max a Q(s, a). To resolve this, actions are tried and rewards are noted and used to update the Q value corresponding to the state and action to make it closer to the actual value of the state action pair. We know that the Q-function corresponding to the optimal policy will have the following property: Q (s t, a t ) = s S T (s t, a t, s )(R(s t, a t, s ) + γ(max a Q (s, a ))) (2.1) which is called the Bellman optimality equation for Q [3, 20]. These values can then be updated via for instance Sarsa [15, 19]: Q t+1 (s t, a t ) = (1 α)q t (s t, a t ) + α(r t + γq t (s t+1, a t+1 )) (2.2) or Q-Learning [23]: Q t+1 (s t, a t ) = (1 α)q t (s t, a t ) + α(r t + γ max Q t (s t+1, a)) (2.3) a Where 0 α 1 is a learning rate. For now we consider the learning rate to be fixed, but sometimes it is taken to be a function of time, states and/or actions and then it is usually denoted as α t. As can be seen, the Q-Learning update rule is very similar to the definition of the optimal Q-function Q. Essentially the idea is that if this update is performed an infinite amount of times on all state action pairs, eventually equation (2.1) will hold and the optimal policy will be found. There are convergence proofs for Q-learning and Sarsa under the assumption that every state-action pair is experienced an infinite number of times. Because Q-learning can learn about the optimal policy without actually following it, it is called off-policy. Sarsa is an on-policy algorithm. This means that for Sarsa to converge to the optimal policy, exploration must decay during learning. When we are not choosing actions at random, but instead follow the policy that corresponds to the present values, this is called exploitation. More information on the methods we used for exploration is included in the chapter on the experiments. 9

12 Of course, it is also possible to learn the values of states instead of the values of state-action pairs. The update, known as Temporal Difference learning [18], would then become: V t+1 (s t ) = (1 α)v t (s t ) + α(r t + γv t (s t+1 )) (2.4) It should be noted that this equation learns the values of states given a certain policy and does not necessarily learn what the values of the states would be when following the optimal policy. It has been proven that when these values are stored in a table, using this update will allow the values to converge to the actual expected returns [18, 8]. 10

13 Chapter 3 Reinforcement Learning in Continuous Spaces The algorithms presented in the previous chapter assume finite and discrete state and action spaces. There are of course settings in which one or both spaces are continuous. This presents problems for the conventional algorithms, especially when considering the storage and access to the Q-values. Therefore extensions to these algorithms are presented that allow Reinforcement Learning in continuous state and action spaces. First a short introduction in handling continuous state spaces will be given. This is a relatively known field, where a lot of work has already been done. Then we will continue with the harder problem of continuous action spaces. Here less work is done and this will be the subject on which we shall concentrate. Methods that can only handle continuous action spaces are not discussed, since in this thesis we are only interested in algorithms that can handle continuity in both spaces. 3.1 Continuous State Space When the state space becomes large or even continuous and therefore infinite, parametrised function approximators can be used to store observed state-action pairs and generalise to unseen states. For instance, a neural network can be used with the weights of the network as parameters 2. The update is then performed on the parameters of this function approximator. Let θ Q denote these parameters. The update rule corresponding to 2 The workings of neural networks will not be covered in this thesis. There are a lot of introductory texts available on the subject, see for instance the book by Bishop [5]. 11

14 Q-learning is derived from (2.3) and then becomes: θ Q i = θ Q i + α(r t + γ max Q t (s t+1, a) Q t (s t, a t )) Q t(s t, a t ) a θ Q i (3.1) Here θ Q i is the i th component of the parameter vector θ Q and Q t (s, a) is the output of the function approximator at time t when give state s and action a as inputs. The update above can also be seen as error backpropagation. For this view consider r t + γ max a Q t (s t+1, a) as target output T t and Q t (s t, a t ) as present output Y t and use the squared difference as error function: E(t) = 1 2 (T t Y t ) 2. The above update rule then comes down to gradient descent on the error: θ Q i = θ Q i α E(t) θ Q (3.2) i Similarly, the update rules corresponding to SARSA and TD learning are: θ Q i = θ Q i + α(r t + γq t (s t+1, a t+1 ) Q t (s t, a t )) Q t(s t, a t ) θ Q i (3.3) θi V = θi V + α(r t + γv t (s t+1 ) V t (s t )) V t(s t ) θi V (3.4) These methods have been extensively studied. See for instance the book by Bertsekas and Tsitsiklis [4]. 3.2 Continuous Action Space A harder problem is to extend RL to continuous action spaces. If we would learn a good approximation of the Q-function using some form of Q-learning or SARSA, we still have the problem that we cannot trivially find the action that gives the highest value, given a state. So, we would like the algorithm to output an approximation of the best action given a certain state. For this again a function approximator could be used. The question then becomes how to improve this approximation. If the optimal action for all states would be known, this could be used as the target action. Of course, the problem 12

15 is that this optimal action is not known. However, this does not mean that we cannot handle continuous action spaces. Below we will present some algorithms for handling continuous state and action spaces, but first we will describe some desired criteria for these algorithms. Please note that we do not consider time to be continuous also, though all algorithms can be extended towards continuous time. Real Continuous Solutions A good algorithm for continuous spaces should in principle be able to find a policy arbitrarily close to the optimal policy for a given problem. This immediately excludes algorithms where the action space is discretised, since the probability is approximately 0 that the precise optimal action for all states is part of the finite set of discretised actions. These algorithms might perform good in many settings, but we want to put the bar high. Examples of algorithms that discretise the action space or the combined state-action space include Cerebellar Model Articulation Controllers (CMACs) [16] and variable resolution discretisation [12]. Good Generalisation An algorithm that stores knowledge about continuous - and therefore infinite - spaces in a finite memory can of course hardly ever store all information about the entire space 3. Also, we would like an algorithm to reach good solutions relatively fast, and thus do not want the entire space to be searched. Both these restrictions require a good generalisation property of the algorithm. We would like to see good generalisation in the state space, meaning a certain action in two similar states usually receives two similar predicted values. Also we would like to see good generalisation in the action dimensions, meaning a slight change in action in a given state should usually result in a slight change in predicted value. Of course, since there will be discontinuities in most actual value functions, the algorithm must also allow fast changes of value over a small range of actions or states, or even discontinuities. Fast Action Selection Given a state, it is useful to be able to quickly find the optimal action according to the current prediction of the value function. When you have found a value function that fits the true value function perfectly, it is still of little use when each time you want to select the best action in a state, you have to conduct a full search in the continuous action 3 The only exception being when it is possible to map the Q function perfectly with a function with few parameters, which in real-life settings will virtually never be the case. 13

16 space to find it. Therefore a good algorithm allows the optimal action to be found quickly. Model Free This is not really a criterium specific for algorithms in continuous spaces, but we would like our algorithms to be model free. With this, we mean that the agent does not have an internal model of the environment of reward function. In many real world cases, such models are hard to establish. Also, constructing a model brings extra complexity and uncertainty to the problem. There are a number of algorithms that even depend on a derivative of the model to determine the effect an action has on the environment. This not only assumes a model, but this also assumes the model is differentiable. Other algorithms depend on the derivative of the reward function, which requires similar conditions. We do not want to limit ourselves to settings where this information is obtainable by the agent. Therefore, we only consider model free algorithms that do not depend on the differentiability of a model of the environment or of the reward function. 3.3 Exploration There are various possible ways to explore unknown territory. Of course, an agent looking for a solution needs to try different policies to find a good solution. For this, exploration is required. In our experiments we compare two different methods of exploration. The first method of exploration we used is ɛ-greedy exploration. We then select an exploratory random action with probability ɛ and select the greedy, current approximation for the optimal action with probability (1 ɛ). It is then possible to decrease exploration by simply decreasing the factor ɛ. All experiments lasted for time steps, in which the exploration rate (ɛ) was exponentially dropped from 1 - a random action every time step - to a probability of 0.01 to perform a random action every time step. Setting the decay of exploration higher or lower did not result in better performance. The second method of exploration is Gaussian exploration around the current approximation of the optimal action. In this exploration, the action that is in fact performed is sampled from a Gaussian distribution with the mean at the action output of the algorithm we are using. When A(s) denotes the action that the algorithm outputs on a time t, the policy for that time step will be: 14

17 a A : π t (s t, a) = 1 2πσ e (at At(st))2 /(2σ 2 ) Note that π t (s, a) denotes the policy function, while π denotes the mathematical constant. We choose a standard deviation for the distribution of 0.1. In contrast with the ɛ-greedy exploration described above, this exploration did not decay. Also the action that was actually performed was sampled from the distribution every time step, so essentially there was some exploration every time step. 3.4 The Algorithms All the algorithms presented below comply to the criteria in the former section. Also, all of them make use of some sort of function approximator to give the optimal action given a certain state. Learning the mapping from state to action involves a value function which is used to determine which action is better than another action and eventually should lead to a (locally) optimal action. The algorithms differ in how they train the function approximators used and how they find the optimal action. The first two algorithms have been described before, the last two are new. As far as we know, no comparison between such Reinforcement Learning algorithms that can handle continuous action spaces has been made before. Function approximators can be used to map a continuous space to another continuous space. Here our goal is to use them to map the continuous state space to a continuous action space. The main problems we face are the following: We do not know the optimal actions, so we cannot use these as targets. We do not want to search the entire action space to find a current approximation of the best action, without losing the ability to find a true optimal policy. When the action space is finite and small, selecting the action with the highest value can be done simply by looking at the Q-value of all possible actions in the given state and then selecting the action. When the action space becomes large or continuous, this is no longer feasible. Below some ways are suggested to deal with this increased complexity. 15

18 3.4.1 Wire Fitting Baird and Klopf propose an algorithm that efficiently stores an approximation of the complete Q-function [2]. They propose using a function approximator to output multiple actions and corresponding values, given a certain state. Each action and corresponding value are output independently and concurrently. A fixed number of these action-value-pairs are output when a state is input to the function approximator. These outputs can be interpolated if the value of an interlying action is required. To find the action with the highest expected value given a certain state this is not necessary though, since the interpolation function is such that the highest value will always lie at one of the output actions. This allows fast action selection. The setup of the algorithm is shown in figure 3.1. Figure 3.1: Wire Fitting Setup The state s is input into the function approximator. Then the outputs of the function approximator, which represent actions and corresponding values, are interpolated to give the value of the action a in state s. The interpolation function as proposed by Baird and Klopf is: 16

19 f(s, a) = lim ε 0 = lim ε 0 n q i (s) i=0 a a i (s) 2 +c i (q max(s) q i (s))+ε n 1 i=0 a a i (s) 2 +c i (q max(s) q i (s))+ε n q i (s) i=0 distance i (s,a) n 1 i=0 distance i (s,a) = lim ε 0 wsum(s, a) norm(s, a) Where (s, a) is the state-action pair of which the value is wanted. a i (s) and q i (s) are the outputs corresponding to the i th action and value for this state, respectively. Time indicating subscripts are left out for increased legibility. Each action can be a vector, because an action might have multiple real valued components. q max (s) := max j q j (s) is defined as the maximum of all q i. c i is a small smoothing factor and ε is there to prevent division by zero. Basically the interpolation gives a weighted average of the different values, depending on the distance of the given action to the actions that are output on this particular state. These actions are denoted as functions a(s) because they are dependent on the state. The necessity of the smoothing factor c i (max j q j (s) q i (s)) may seem unclear, especially since without it we can just define f(s, a) = q i (s) when a = a i (s) for some i and then we also do not need ε. Baird and Klopf also define f(s, a i (s)) = q i (s) when finding the highest action, but they do note that this is not the same value that would be found when using the interpolation function with the smoothing factor. A result of including c is that the values of the actions that are output are stressed, since the interpolated values will always lie more towards the mean of all values. Gaskett has shown experimentally that the precise value of the smoothing factor is not important, as long as it s a small positive value [10]. Typical values for c would be 10 2 or Learning Consider that an agent using Wire Fitting has an experience, consisting of the state it was in (s t ) an action (a t ) it performed, a reward (r t ) received and the next state (s t+1 ) it reaches. We will now explain how this experience is then used to update the system. As mentioned, in this algorithm the output of the interpolation is to be interpreted as the value of a given state-action pair. We can then use the update rule (3.1) to update the parameters of the system to make the 17

20 output more closely resemble the target output of r t + γ max a Q(s t+1, a). Regardless of whether the action performed was exploratory or not, gradient descent on the squared difference of the value the interpolator gives for that action and the target output can be performed. This can be done according to equation (3.1). This results in updating the function approximators that output the actions by the following update, where θ a j i are the parameters of the function approximator that outputs action a j. These can in fact be the same parameters as for other actions, but also a single function approximator can be used per action, so we consider the general case. Again, the normal time indicating subscripts are left out to avoid confusion. Instead s is used to denote the state that is reached by performing action a in state s. Where: θ a j i = θ a j i + α(r + γ max Q(s Q(s, a), b) Q(s, a)) b θ a j i = θi A + α(r + γ max q j (s ) Q(s, a)) j Q(s, a) a j a j θ A i Q(s, a) 2(wsum(s, a) norm(s, a)q j )(a j a) = lim a j ε 0 (norm(s, a)distance i (s, a)) 2 A similar update can be done for the parameters of the function approximators that output the values. Then we use: Q(s, a) norm(s, a)(distance i (s, a) + q j c) wsum(s, a)c = lim q j ε 0 (norm(s, a)distance i (s, a)) 2 But what does it mean to update the parameters by gradient descent on the prediction error of the interpolator? Basically, all actions and values are changed slightly in such a way that the resulting interpolation more closely resembles the given action-value pair. Because the gradient is dependent on the distance between the action that was performed and the actions that are output, the action outputs that are closest and their values are updated the most. For a simple example, consider figure 3.2. Finding the optimal action In a fully trained system with a low error between experienced action-value pairs and the interpolation output, we can assume that the optimal action is the one corresponding to the highest value. 18

21 Figure 3.2: Wire Fitting Example The red crosses are the initial outputs of the function approximator on a given state. The red line is the output of the interpolation given those actions and values. The x-axis contains the (onedimensional) actions, while the y-axis contains the values of the actions. The blue + is an experienced action-value pair in the given state. The purple stars are updated values and the purple line the updated interpolation. As can be seen, not all outputs necessarily move towards the +, instead they move so that the interpolation is closer. Because of the nature of the interpolator, this action is always amongst the ones output by the function approximator. This means that to select the highest valued action we do not have to use the interpolation function. We just select the action with the highest corresponding value. This results in very fast action selection. Further Notes The approach is called Wire Fitting (WF), because essentially the interpolation function describes a surface in S A IR space, which is draped, so to say, over wires defined by the outputs of the function approximator. Because the whole value function is approximated, using enough outputs, the system can reach any real valued optimal policy, generalise well, while allowing fast action selection. 19

22 In effect, the value output and the whole interpolation are only used to generate the error with which the actions can be updated in order to reach a situation in which one of the outputs of the function approximator approximates the optimal action for each state. Which output is the optimal one can change for different states. In theory, this even allows the algorithm to find optimal policies with a limited number of discontinuities. This approach was also implemented by Gaskett et al [11]. The results look promising, though no comparisons with other algorithms for continuous action spaces are made Gradient Ascent on the Value Prokhorov and Wunsch described Adaptive Critic Designs of which we implemented a version of their Action Dependent Heuristic Dynamic Programming (ADHDP) algorithm [14]. This algorithm uses a single actor function approximator to output just the optimal action, given a state. We shall denote the output of this actor at time t as A t (s). When this action is selected, the Q-value can then be determined using a critic function approximator that outputs the Q-value given a state and action. This setup is shown in figure 3.3. Figure 3.3: Adaptive Critic Designs Setup The state s is input into the function approximator. The output a = A t (s) of the function approximator represents the current approximation of the optimal action. This action is then input into the critic with the state s to give the current approximation of Q(s, a). Of course, when the value of another (exploratory) action is needed, only this last function approximator is used. 20

23 Learning Equation (3.1) handles the updates to the parameters of the critic function approximator. This is relatively straightforward and an update can be made after each experience of a state, action, reward and new state. As stated above by equation (3.2), we can view this as gradient descent on the squared error E(t): ( E(t) = r t + γ max a ) 2 Q t (s t+1, a) Q t (s t, a t ) Training the actor function approximator is only slightly more complex. We need to find a target towards which we can train the output of the actor, given a certain state. For this we can use gradient information to determine how the value function would change if the action is changed locally. This can then be used to find a local maximum. Then, the newly found action (with a higher value, given the current state) can be used as a target to update the parameters of the action selection function approximator. Equivalently, the gradient information of the value can be propagated back immediately to the parameters of the actor. Calling the parameters of the actor θ A, this would result in the following update: θ A i θ A i + α Q t(s, a) θ A i = θi A + α Q t(s, a) A t (s) A t (s) θi A (3.5) So in summary, an update consists of a gradient descent update on the error of the output value for the critic function approximator and a gradient ascent update on the Q-value for the actor function approximator. A potential problem lies in the backpropagation of the value through the critic. Clearly, when the critic is not yet fully trained, the gradient information on the value will not always be accurate. In the beginning of training the information might even be completely incorrect. This can lead to incorrect updates to the parameters of the actor function approximator. These updates might hinder later learning for the actor when the critic is trained more. Note that in essence, there are no real boundaries for the outputs of the actor. This can prove to be a problem when the actor outputs values the lies outside the range of possible actions. Of course, the action that is in fact performed can be clipped to fall within the range of possible actions, but then still a problem exists. All inputs and outputs are scaled to fall within a [ 1, 1] interval. For the actor this means that a value of 1 in one of the dimensions of an action corresponds to the lowest possible action in that dimension. However, update 3.5 can result in outputs that lie outside the [ 1, 1] interval. Especially, this can happen when the critic is not yet 21

24 properly trained and the gradient information is thus still unreliable. This can lead to divergence of the actor when the outputs come to lie further from the [ 1, 1] interval. We introduce a way to avoid this potential divergence. To avoid divergence of the actor the probability for an update to the actor s parameters slowly increases during learning. This means that when the critic is not yet trained, not many updates to the actor will be performed. In the beginning of learning therefore there will be only updates to the critic and not to the actor on most time steps. This prevents early divergence of the actor function approximator due to the incorrect early gradient information provided by the critic. In our experiments, the probability of an update being performed was (1 p), where p decreases exponentially from 1 to 0.01 over the course of learning. We implemented both the original algorithm as proposed by Prokhorov and Wunsch and the version we just described which should avoid early divergence of the algorithm. Performance for both these algorithms is compared in chapter 5. Prokhorov and Wunsch did not mention any ways to prevent divergence of the actor for this algorithm. They do note that they did not get this algorithm to work on one of their problems. However, they propose using similar algorithms that do solve that problem and should have better performance on other problems. For these algorithms, they propose not actually learning the value function, but instead learning the derivative of the value function to the inputs of the critic. They state that this can result in better performance, because in effect we are only interested in the value function because we need its derivative to train the actor. Directly training towards this derivative would result in one less approximation in the process. A problem with trying to learn the derivative of the value function is that the derivative of the reward function is required. In model-free settings, this derivative is usually not available. Also in some cases the reward function may not be differentiable. For these reason, we only implement the algorithm with a value function and not a value-derivative function. Also, though the algorithms may perform better when using the derivative of the value, the actor may still diverge when the approximation of the derivative is not yet accurate. Finding the optimal action Finding the current approximation of the optimal action in a given state is quite easy when using this method. We only have to propagate the state information through the actor function approximator, the output will be the action we want. The optimal action is therefore found very quickly. 22

25 Similar Algorithms The algorithms we implemented were the ADHDP algorithm and our version of it with fewer updates to the actor at the beginning of training. Prokhorov and Wunsch also describe other algorithms, but all those other algorithms use a derivative of the world model to determine the effect an action has on the state values or a derivative of the reward function or even both. These other algorithms therefore are not interesting to us, since we only want to examine model-free algorithms. A similar algorithm was proposed by Kimura and Kobayashi [13], though they extended the approach to handling eligibility traces and a stochastic policy. For this stochastic policy they use a normal distribution on the action space, after which they can perform gradient ascent with regards to the value on the mean and standard deviation of this distribution. Of course, in practice an optimal stochastic policy will not always follow a normal distribution on the action space. Kimura and Kobayashi only compare their approach to a fully discretised (state and action space) actor critic system. Other variations on this algorithm that also take continuous time into account are given by Doya [9] and Coulom [6]. Basically their algorithms also perform gradient ascent on the value, though these last two use a state value function V (s) instead of a Q-function Q(s, a). Unfortunately in these cases, because there is no direct line from the value to the action through which the error can be propagated, it is assumed that the state update is linear with respect to the action performed. This means that we would have to assume that the transition function T (s, a, s ) is linear with respect to a. We do not wish to make such an assumption, which is why we do not implement this version of gradient ascent on the value Interpolating Actors The Interpolating Actors (IA) algorithm is the first of two new algorithms that are proposed in this thesis. For this approach we first consider actions consisting of only one real valued element. Extending the approach to action vectors of multiple dimensions is possible by simply running the approach in parallel for all action dimensions separately. This algorithm lets an actor function approximator output n outputs where n 2. For these outputs, the following properties should hold: n o i = 1 and i(o i 0) i=1 Where o i is the i th output. Subscripts indicating time are left out for clarity. 23

26 The input is the current state s. These outputs can then be used as weights for predetermined values v that we want to interpolate between. For example, if the action space ranges from 0 to 100, we might use 3 outputs with the corresponding action value vector: v = (0, 50, 100). An action of for instance 70 can then be reached by letting the function approximator output the vector (0.2, 0.2, 0.6) or (0.0, 0.6, 0.4). This can for instance be accomplished by allowing the actor function approximator to output arbitrary positive values that we shall denote as y i and then normalising them as follows: o i = y i n j=1 y j Alternatively, if we cannot or do not want to restrict all y i to be positive, the following normalisation can be used: o i = f(y i ) n j=1 f(y j) Where f : IR [0, is a continuous differentiable function. A common choice in similar systems is the strictly monotonically increasing: f(x) = e x It s probably an advantage to have a relatively smooth function, such as f(x) = e x or f(x) = x 2, since the learning algorithm will use only the first derivative to try to predict how the output changes when the actions change. If the function is too complex the first derivative becomes a less reliable approximation to the true local behaviour of the function. Each output will have a fixed corresponding action value. These values are typically distributed evenly across the action space, which needs to be bounded to make this possible. The output of the complete actor system, denoted A(s) then is: A(s) = o, v = n o i v i Here o is the output vector of the function approximator (after normalisation) and v is its corresponding action value vector. i=1 24

27 Learning The goal now is to learn how to interpolate the fixed action values so that the resulting action is optimal. To change the outputs gradient ascent on the value can be used. Denoting the parameters of the actor as θ A, the update to these parameters then would become: Where: Q(s, a) y i = = = = = θi A θi A Q(s, a) + α = θi A + α θ i Q(s, a) A(s) Q(s, a) A(s) Q(s, a) A(s) Q(s, a) A(s) n ( A(s) k=1 o k f(y i ) n j=1 ) f(yi ) Q(s, a) y j (3.6) y j θ i o k y i n δ ik v k n j=1 f(y j) f(y k ) ( n 2 f j=1 j)) (y i ) f(y k=1 ( f (y i ) n n j=1 f(y j) k=1 ( f (y i ) n j=1 f(y v i j) Q(s, a) f (y i )(v i A(s)) A(s) n j=1 f(y j) v k δ ik v kf(y k ) n j=1 f(y j) ) n v k o k Q(s, a)/ A(s) can be determined by calculating the gradient of the value (which is the output of the critic of course) to the action input. δ ik is the Kronecker Delta, defined by: { 0 iff i j δ ij = 1 iff i = j The update above is inversely dependent on the sum of all f(y i ). This might become a problem if we allow the f(y i ) to become any value. In particular, if the sum is very small, the update might be far too big, resulting in oscillations or divergence. If the sum is too big, the update might become too small, resulting in no change in the parameters. This might be remedied by updating the y i towards values where i f(y i) 1. For this consider the fact that updating the parameters by (3.6) results in the same update as updating the parameters as follows: k=1 ) 25

28 θ A i θ A i α n j=1 Where E j is an error term, defined as: Where: E j (s, a) f(y j) θ i E j (s, a) = f(y j ) f(y j ) target f(y j ) target = f(y j ) + Q(s, a) f(y j ) We can then scale the targets such that the sum of all targets is 1. This is allowed, since the interpolation is linear and relative scaling of the inputs to the interpolation does not change the output of the interpolation and thus does not change the resulting action. To show this we define g(y) = λf(y) where λ is some real valued number. Then we get: g(y i ) n j=1 g(y j) = λf(y i ) n j=1 λf(y j) = λf(y i ) λ n j=1 f(y j) = f(y i ) n j=1 f(y j) = o i We can choose λ so that i g(y i) = 1. Now, we can scale the targets such that i f(y i) target = 1, resulting in i f(y i) 1 at the next time step. For this, we redefine the targets as follows: f(y j ) target scaled = This results in the following update: θ A i θ A i + α n j=1 f(y j ) target n k=1 f(y k) target ( (f(y j ) target scaled f(y j ))f (y j ) y ) j θ i (3.7) A critic such as described in the former algorithm can then be used to predict the value of such an action in the given state. Then once again backpropagation of the output of the critic, which is the predicted Q value, to the actor output vector can be used to update the parameters of the actor function approximator to perform gradient ascent on the value. Because the update is once again dependent on backpropagation of the value through the critic, again the possibility exists that the earlier updates are less accurate than later updates, when the critic is trained better. Therefore, we 26

29 As a multidimensional example consider a robot that can drive at a certain speed and turn at a certain rate. The speed bounds might be [1, 10] and the turn rate bounds [ 180, 180]. Two interpolators with fixed values v speed = (1, 10) and v turn = ( 180, 180) could be used. If the possible turning speed is linearly dependent on the speed, for instance by turnratebounds = ( ( speed), ( speed)) then a single interpolator might be used with fixed vectors v = (( 180, 1), (180, 1), (0, 1), (0, 10)). Alternatively the constraint can be forced simply by defining the turn bounds as v turn = ( ( speed), ( speed)) and then first select the speed and then determine the turning rate, with the speed given as an input. Of course, the constraint can also be expressed as speed being dependent on the turning rate and then changing the order of action selection, if this is convenient. Figure 3.4: Example for Interpolating Actors implemented two versions of this algorithm, one with an update to the actor system every time step and one where the probability on an update to the actor is (1 p), with p decreasing exponentially from 1 to 0.01, similarly to the approach described in section Finding the optimal action Finding the optimal action is relatively straightforward. We only have to forward propagate the state description through the actor function approximator, after which the outputs can be interpolated to form the action that is to be performed. Further Notes When the action space has more than one dimension, the different dimensions can be split over different interpolators. This is because in general it is not possible given the constraint n i=1 o i = 1 to interpolate a fixed set of action value vectors such that all values that are in the possible actions set can be reached and none outside this set can be reached. If there are linear constraints between different dimensions in the action space, these can sometimes be handled by one interpolator. In general this will not often be the case. These constraints might also be fulfilled by handling the different dimensions asynchronously as can be seen in the example below. This algorithm also allows for nonlinear constraints. For a numeric example, see figure 3.4. We shall examine experimentally if the number of fixed outputs has an impact on performance. Consider one action dimension that ranges from 0 to 27

30 100. Note that with 3 of more outputs, the number of output configurations that result in the output of extremer numbers such as 10 or 90 is significantly lower than the number of output configurations that result in the output of 50. This is not the case when using only 2 outputs as then the number of configurations is equal for all outputs. A possible drawback of this algorithm is that the action space should not only be bounded, but the bounds must also be known beforehand. This is in order to be able to choose the fixed actions that will be interpolated. An extension that we did not implement would be to also update the values between which is interpolated. This can be done in a similar way to adapting the parameters of the function approximators Critic Actor Improvement The Critic Actor Improvement (CAI) algorithm is the second new algorithm that is proposed in this thesis. To outline the idea, we first consider the tabular case with discrete states and actions and without function approximators. The algorithm then works as follows. One table stores the values of the states. These can be updated with the TD-learning update rule (2.4). Another table stores the actions that should be performed in each state. Of course, these actions are performed with some exploration. For instance ɛ-greedy exploration can be used, where a random action is chosen with a small probability ɛ. Another possibility would be to associate a certain probability with each action for every state so that every possible action has a non zero probability of being chosen in every state. Now, whenever the value of a state is to be increased, conclude that the action that was just performed was probably a good action for that state. This is backed up by the following reasoning. The values of the states will converge to the actual discounted future rewards, given the current policy. If now performing a certain exploratory action results in a positive change for the value of a state, then this action will in principle lead to a higher discounted future reward and thus a better policy. Therefore, we reinforce this action for this state. In pseudo-code we get: if : then : r t + γv t (s t+1 ) > V t (s t ) increase(π(s t, a t )) 28

Reinforcement Learning in Continuous Action Spaces

Reinforcement Learning in Continuous Action Spaces Reinforcement Learning in Continuous Action Spaces Hado van Hasselt and Marco A. Wiering Intelligent Systems Group, Department of Information and Computing Sciences, Utrecht University Padualaan 14, 3508

More information

Lecture 8: Policy Gradient

Lecture 8: Policy Gradient Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Lecture 7: Value Function Approximation

Lecture 7: Value Function Approximation Lecture 7: Value Function Approximation Joseph Modayil Outline 1 Introduction 2 3 Batch Methods Introduction Large-Scale Reinforcement Learning Reinforcement learning can be used to solve large problems,

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

ARTIFICIAL INTELLIGENCE. Reinforcement learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 7: Eligibility Traces R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Midterm Mean = 77.33 Median = 82 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted 15-889e Policy Search: Gradient Methods Emma Brunskill All slides from David Silver (with EB adding minor modificafons), unless otherwise noted Outline 1 Introduction 2 Finite Difference Policy Gradient

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Decision Theory: Q-Learning

Decision Theory: Q-Learning Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning

More information

RL 3: Reinforcement Learning

RL 3: Reinforcement Learning RL 3: Reinforcement Learning Q-Learning Michael Herrmann University of Edinburgh, School of Informatics 20/01/2015 Last time: Multi-Armed Bandits (10 Points to remember) MAB applications do exist (e.g.

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

Replacing eligibility trace for action-value learning with function approximation

Replacing eligibility trace for action-value learning with function approximation Replacing eligibility trace for action-value learning with function approximation Kary FRÄMLING Helsinki University of Technology PL 5500, FI-02015 TKK - Finland Abstract. The eligibility trace is one

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart

More information

Reinforcement learning an introduction

Reinforcement learning an introduction Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

Lecture 4: Approximate dynamic programming

Lecture 4: Approximate dynamic programming IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques

More information

Reinforcement Learning: the basics

Reinforcement Learning: the basics Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46 Introduction Action selection/planning Learning by trial-and-error

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))] Review: TD-Learning function TD-Learning(mdp) returns a policy Class #: Reinforcement Learning, II 8s S, U(s) =0 set start-state s s 0 choose action a, using -greedy policy based on U(s) U(s) U(s)+ [r

More information

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating? CSE 190: Reinforcement Learning: An Introduction Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of

More information

Notes on Reinforcement Learning

Notes on Reinforcement Learning 1 Introduction Notes on Reinforcement Learning Paulo Eduardo Rauber 2014 Reinforcement learning is the study of agents that act in an environment with the goal of maximizing cumulative reward signals.

More information

Policy Gradient Reinforcement Learning for Robotics

Policy Gradient Reinforcement Learning for Robotics Policy Gradient Reinforcement Learning for Robotics Michael C. Koval mkoval@cs.rutgers.edu Michael L. Littman mlittman@cs.rutgers.edu May 9, 211 1 Introduction Learning in an environment with a continuous

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Sequential Decision Problems

Sequential Decision Problems Sequential Decision Problems Michael A. Goodrich November 10, 2006 If I make changes to these notes after they are posted and if these changes are important (beyond cosmetic), the changes will highlighted

More information

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN Reinforcement Learning for Continuous Action using Stochastic Gradient Ascent Hajime KIMURA, Shigenobu KOBAYASHI Tokyo Institute of Technology, 4259 Nagatsuda, Midori-ku Yokohama 226-852 JAPAN Abstract:

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning RL in continuous MDPs March April, 2015 Large/Continuous MDPs Large/Continuous state space Tabular representation cannot be used Large/Continuous action space Maximization over action

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating? CSE 190: Reinforcement Learning: An Introduction Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Reinforcement Learning II

Reinforcement Learning II Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning

More information

A Gentle Introduction to Reinforcement Learning

A Gentle Introduction to Reinforcement Learning A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

Olivier Sigaud. September 21, 2012

Olivier Sigaud. September 21, 2012 Supervised and Reinforcement Learning Tools for Motor Learning Models Olivier Sigaud Université Pierre et Marie Curie - Paris 6 September 21, 2012 1 / 64 Introduction Who is speaking? 2 / 64 Introduction

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018 Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)

More information

CSC321 Lecture 22: Q-Learning

CSC321 Lecture 22: Q-Learning CSC321 Lecture 22: Q-Learning Roger Grosse Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21 Overview Second of 3 lectures on reinforcement learning Last time: policy gradient (e.g. REINFORCE) Optimize

More information

Q-learning. Tambet Matiisen

Q-learning. Tambet Matiisen Q-learning Tambet Matiisen (based on chapter 11.3 of online book Artificial Intelligence, foundations of computational agents by David Poole and Alan Mackworth) Stochastic gradient descent Experience

More information

Reinforcement Learning. George Konidaris

Reinforcement Learning. George Konidaris Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom

More information

15-780: ReinforcementLearning

15-780: ReinforcementLearning 15-780: ReinforcementLearning J. Zico Kolter March 2, 2016 1 Outline Challenge of RL Model-based methods Model-free methods Exploration and exploitation 2 Outline Challenge of RL Model-based methods Model-free

More information

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent

More information

Temporal difference learning

Temporal difference learning Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Daniel Hennes 19.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns Forward and backward view Function

More information

CS 570: Machine Learning Seminar. Fall 2016

CS 570: Machine Learning Seminar. Fall 2016 CS 570: Machine Learning Seminar Fall 2016 Class Information Class web page: http://web.cecs.pdx.edu/~mm/mlseminar2016-2017/fall2016/ Class mailing list: cs570@cs.pdx.edu My office hours: T,Th, 2-3pm or

More information

Reinforcement Learning Part 2

Reinforcement Learning Part 2 Reinforcement Learning Part 2 Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ From previous tutorial Reinforcement Learning Exploration No supervision Agent-Reward-Environment

More information

(Deep) Reinforcement Learning

(Deep) Reinforcement Learning Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015

More information

Reinforcement Learning In Continuous Time and Space

Reinforcement Learning In Continuous Time and Space Reinforcement Learning In Continuous Time and Space presentation of paper by Kenji Doya Leszek Rybicki lrybicki@mat.umk.pl 18.07.2008 Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Stuart Russell, UC Berkeley Stuart Russell, UC Berkeley 1 Outline Sequential decision making Dynamic programming algorithms Reinforcement learning algorithms temporal difference

More information

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016) Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016) Yoonho Lee Department of Computer Science and Engineering Pohang University of Science and Technology October 11, 2016 Outline

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396 Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Cyber Rodent Project Some slides from: David Silver, Radford Neal CSC411: Machine Learning and Data Mining, Winter 2017 Michael Guerzhoy 1 Reinforcement Learning Supervised learning:

More information

CO6: Introduction to Computational Neuroscience

CO6: Introduction to Computational Neuroscience CO6: Introduction to Computational Neuroscience Lecturer: J Lussange Ecole Normale Supérieure 29 rue d Ulm e-mail: johann.lussange@ens.fr Solutions to the 2nd exercise sheet If you have any questions regarding

More information

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning. Machine Learning, Fall 2010 Reinforcement Learning Machine Learning, Fall 2010 1 Administrativia This week: finish RL, most likely start graphical models LA2: due on Thursday LA3: comes out on Thursday TA Office hours: Today 1:30-2:30

More information

State Space Abstractions for Reinforcement Learning

State Space Abstractions for Reinforcement Learning State Space Abstractions for Reinforcement Learning Rowan McAllister and Thang Bui MLG RCC 6 November 24 / 24 Outline Introduction Markov Decision Process Reinforcement Learning State Abstraction 2 Abstraction

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Dual Memory Model for Using Pre-Existing Knowledge in Reinforcement Learning Tasks

Dual Memory Model for Using Pre-Existing Knowledge in Reinforcement Learning Tasks Dual Memory Model for Using Pre-Existing Knowledge in Reinforcement Learning Tasks Kary Främling Helsinki University of Technology, PL 55, FI-25 TKK, Finland Kary.Framling@hut.fi Abstract. Reinforcement

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

Lecture 10 - Planning under Uncertainty (III)

Lecture 10 - Planning under Uncertainty (III) Lecture 10 - Planning under Uncertainty (III) Jesse Hoey School of Computer Science University of Waterloo March 27, 2018 Readings: Poole & Mackworth (2nd ed.)chapter 12.1,12.3-12.9 1/ 34 Reinforcement

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement

More information

The Book: Where we are and where we re going. CSE 190: Reinforcement Learning: An Introduction. Chapter 7: Eligibility Traces. Simple Monte Carlo

The Book: Where we are and where we re going. CSE 190: Reinforcement Learning: An Introduction. Chapter 7: Eligibility Traces. Simple Monte Carlo CSE 190: Reinforcement Learning: An Introduction Chapter 7: Eligibility races Acknowledgment: A good number of these slides are cribbed from Rich Sutton he Book: Where we are and where we re going Part

More information

Linear Least-squares Dyna-style Planning

Linear Least-squares Dyna-style Planning Linear Least-squares Dyna-style Planning Hengshuai Yao Department of Computing Science University of Alberta Edmonton, AB, Canada T6G2E8 hengshua@cs.ualberta.ca Abstract World model is very important for

More information

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. CS 188 Spring 2017 Introduction to Artificial Intelligence Midterm V2 You have approximately 80 minutes. The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. Mark

More information

Lecture 3: The Reinforcement Learning Problem

Lecture 3: The Reinforcement Learning Problem Lecture 3: The Reinforcement Learning Problem Objectives of this lecture: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

arxiv: v1 [cs.ai] 5 Nov 2017

arxiv: v1 [cs.ai] 5 Nov 2017 arxiv:1711.01569v1 [cs.ai] 5 Nov 2017 Markus Dumke Department of Statistics Ludwig-Maximilians-Universität München markus.dumke@campus.lmu.de Abstract Temporal-difference (TD) learning is an important

More information

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5 Table of contents 1 Introduction 2 2 Markov Decision Processes 2 3 Future Cumulative Reward 3 4 Q-Learning 4 4.1 The Q-value.............................................. 4 4.2 The Temporal Difference.......................................

More information

Reinforcement Learning Active Learning

Reinforcement Learning Active Learning Reinforcement Learning Active Learning Alan Fern * Based in part on slides by Daniel Weld 1 Active Reinforcement Learning So far, we ve assumed agent has a policy We just learned how good it is Now, suppose

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

Off-Policy Actor-Critic

Off-Policy Actor-Critic Off-Policy Actor-Critic Ludovic Trottier Laval University July 25 2012 Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 1 / 34 Table of Contents 1 Reinforcement Learning Theory

More information

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning. Monte Carlo is important in practice CSE 190: Reinforcement Learning: An Introduction Chapter 6: emporal Difference Learning When there are just a few possibilitieo value, out of a large state space, Monte

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010 Lecture 25: Learning 4 Victor R. Lesser CMPSCI 683 Fall 2010 Final Exam Information Final EXAM on Th 12/16 at 4:00pm in Lederle Grad Res Ctr Rm A301 2 Hours but obviously you can leave early! Open Book

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Chapter 8: Generalization and Function Approximation

Chapter 8: Generalization and Function Approximation Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. Overview

More information

Reinforcement Learning: An Introduction

Reinforcement Learning: An Introduction Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is

More information

REINFORCEMENT LEARNING

REINFORCEMENT LEARNING REINFORCEMENT LEARNING Larry Page: Where s Google going next? DeepMind's DQN playing Breakout Contents Introduction to Reinforcement Learning Deep Q-Learning INTRODUCTION TO REINFORCEMENT LEARNING Contents

More information

Reinforcement Learning and Control

Reinforcement Learning and Control CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

More information

Branes with Brains. Reinforcement learning in the landscape of intersecting brane worlds. String_Data 2017, Boston 11/30/2017

Branes with Brains. Reinforcement learning in the landscape of intersecting brane worlds. String_Data 2017, Boston 11/30/2017 Branes with Brains Reinforcement learning in the landscape of intersecting brane worlds FABIAN RUEHLE (UNIVERSITY OF OXFORD) String_Data 2017, Boston 11/30/2017 Based on [work in progress] with Brent Nelson

More information

16.4 Multiattribute Utility Functions

16.4 Multiattribute Utility Functions 285 Normalized utilities The scale of utilities reaches from the best possible prize u to the worst possible catastrophe u Normalized utilities use a scale with u = 0 and u = 1 Utilities of intermediate

More information

Machine Learning I Continuous Reinforcement Learning

Machine Learning I Continuous Reinforcement Learning Machine Learning I Continuous Reinforcement Learning Thomas Rückstieß Technische Universität München January 7/8, 2010 RL Problem Statement (reminder) state s t+1 ENVIRONMENT reward r t+1 new step r t

More information