Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Size: px
Start display at page:

Download "Reinforcement Learning with Function Approximation. Joseph Christian G. Noel"

Transcription

1 Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011

2

3 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is for an agent to learn to act in an unknown environment in a way that maximizes a reward that it receives from the environment. As the number of states in an environment grows larger, it becomes more important for the agent to generalize what it has learned from some states to other similar states. An agent is able to do this with a number of function approximation techniques. This report presents a general overview of reinforcement learning when combined with function approximation. There are two main sources for this report, Neurodynamic Programming [Bertsekas and Tsitsiklis 1996] and Reinforcement Learning: An Introduction [Sutton and Barto 1998]. [Bertsekas and Tsitsiklis 1996] discusses RL from a mathematicians perspective, and is very hard to read. [Sutton and Barto 1998] is intuitive, but does not discuss theoretical issues like convergence properties. The goal of this report is to create an overview that encompasses these two sources, that is both easy to read and contains sufficient theoretical background. We restrict ourselves to online gradient-based methods in this report. After discussing the mathematical theory we implement some of the techniques on the cart pole and mountain car domain and report on the results. iii

4 iv

5 Contents Abstract iii 1 Introduction Artificial Intelligence Machine Learning Reinforcement Learning RL with Function Approximation Background Model Rewards Markov Decision Processes Value Functions Bellman Equations for the Value Functions Exploration versus Exploitation Reinforcement Learning Algorithms Dynamic Programming Policy Iteration Value Iteration Monte-Carlo Methods Temporal-Difference Learning TD(0) Eligibility Traces for TD(λ) Convergence of TD Sarsa (λ) Q-Learning Convergence of Q-Learning Reinforcement Learning with Function Approximation Function Approximation (Regression) Gradient Descent Methods Stochastic Gradient Descent Convergence of SGD for Markov Processes TD Learning Control with Function Approximation Convergence of TD With Function Approximation v

6 vi Contents 4.4 Residual Gradients Convergence of Residual Gradients Control with Residual Gradients Experimental Results Domains Mountain Car Cart Pole Results Optimal Parameter Values Results Final Remarks 27 Bibliography 29

7 Chapter 1 Introduction 1.1 Artificial Intelligence Artificial intelligence (AI) has been a common fixture in science fiction stories and in people s imaginations for centuries. However, the field of formal AI research only started in the summer 1956 at a conference at Dartmouth College. Since then, the goal of the field of artificial intelligence has been to create intelligent agents that can mimic or go beyond human level intelligence. Within the field of AI there are many subfields which each study a specific aspect of what we humans usually define as Intelligence. Examples of this subfields are computer vision, logic, planning, robotics, and machine learning. This paper deals with a particular branch of machine learning called reinforcement learning Machine Learning Machine learning is concerned with the design and development of algorithms that allow computers to improve their performance over time based on data. There are three main types of machine learning algorithms: supervised learning, unsupervised learning, and reinforcement learning. In supervised learning, the algorithm is given a set of examples for training, and uses this to infer a function mapping that will enable it to generalize to unseen test data. In unsupervised learning, the algorithm is given just the test data and has to infer some inherent structure from the data. We discuss reinforcement learning in this report. 1.2 Reinforcement Learning Reinforcement learning (RL) is a fundamental problem in artificial intelligence. In RL an agent learns to act in an unknown environment. The agent interacts with the environment by performing actions which changes the state of the environment. After each action, the agent also receives a reward signal from the environment. The goal of the agent is to maximize the cumulative reward it receives from the environment. This interaction can be seen in Figure

8 2 Introduction Figure 1.1: The Agent-Environment architecture in RL. Retrieved from [Sutton and Barto 1998] As an example, imagine a reinforcement learning agent trying to learn how to play the game of blackjack. The environment signals it receives can be what cards it has in its hand as well as the one card shown by the dealer. The set of actions it can do can be {hit, hold}. The rewards signals the agent receives can be a reward of 1 when it busts or is beaten by the dealer s hand, a reward of 1 when it beats the dealer s hand or when the dealer busts, and a reward of 0 the remaining times. Given enough episodes/trials, the RL agent will eventually learn the optimal actions to do given what cards it has in its hands in a way to maximize the rewards it gets. In effect, it will learn how to play blackjack optimally RL with Function Approximation In the blackjack example above, the number of possible states is bounded by the number of permutations of the 52 cards in a normal card deck. However as the number of states in an environment gets larger and larger, it becomes infeasible for an agent to visit all possible states enough times to find the optimal actions for those states. Thus, it becomes important to be able to generalize the learning experiences in a particular state to the other states in the environment. A common way to do this is through function approximation. The agent extracts the relevant information it needs from the state through feature extraction, and uses the resulting feature vector to calculate the approximate value it gets from being in that state. This is done by doing a dot product between the feature vector and a parameter vector. In this way, similar state features will also have similar values. The goal of RL with function approximation is then to learn the best values for this parameter vector. Combining reinforcement learning with function approximation techniques allows the agent to generalize and hence handle large (even infinite) number of states.

9 Chapter 2 Background 2.1 Model The reinforcement learning model consists of a set of states S, a set of actions A, and transition rules between states depending on the action. At a given state s S at time t, an agent chooses an action a A, transitions to a new state s, and receives a reward r t. A series of actions will eventually lead the agent to the terminal state or goal state, which ends an episode. At this point the environment is reset and the agent starts again from an initial state and the process repeats itself. Examples of episodes can be a single game of chess or a single run through of a maze. An agent follows a policy π, which describes how an agent should act at a given time. Formally, π(s, a) is a mapping that gives the probability of taking action a when in state s. A policy π is a proper policy if when following this policy there is a positive probability that an agent will eventually reach the goal state, i.e., it will not infinitely cycle through some states and never terminate an episode. There is always at least one policy that is better than or equal to all other policies, and we denote all these optimal policies as π. Optimal policies will be explained in more detail in the section on value functions. 2.2 Rewards The goal of reinforcement learning is to maximize the expected reward R t, R t = r t + r t+1 + r t+2...r T where T is the final time step. This is the finite horizon model, wherein the agent tries to maximize the reward for the next T steps without regard for the succeeding steps after it. There are drawbacks with this model, however. First, in most domains the agent does not know how many time steps it will take to reach the end state, and will need to properly handle infinite time steps. Second, it is possible for an agent to be lazy in a finite horizon model. For example if the horizon is ten steps, the agent could forever be trying to maximize the reward that happens ten steps from the current time step, without ever doing the action that actually receives that reward. 3

10 4 Background These drawbacks are fixed by adding a discount factor γ to the sum, R t = r t + γr t+1 + γ 2 r t+2... = γ k r t+k where 0 γ < 1. This is the infinite discounted horizon model. The agent considers all rewards into the future, but γ acts as a discount factor so that the infinite sum is bounded to a finite value. The value of γ defines how much the agent takes into account future rewards. When γ = 0, the agent takes the short-term view and tries to maximize the reward only for the next time step. As γ 1, the agent considers future rewards more strongly and takes the longer term view. k=0 2.3 Markov Decision Processes An environment S, A, T, R, λ defines the set of possible states S, actions A and transition probabilities T. The transition probability at time t of an agent moving from state s t to the next state s upon performing action a t, and receiving reward r, depends on the history of all state and action pairs before t. That is, what happens next is defined by a probability P r(s t+1 = s, r t+1 = r s t, a t, r t, s t 1, a t 1, r t 1,..., r 1, s 0, a 0 ). However, there is a class of environments for which the relevant information from the history of all states, actions, and rewards before time t is encapsulated in the state at time t. This is called the Markov property, and hence tasks which exhibit this property are called Markov decision processes. The Markov property states that P r(s t+1 = s, r t+1 = r s t, a t ) = P r(s t+1 = s, r t+1 = r s t, a t, r t, s t 1, a t 1, r t 1,..., r 1, s 0, a 0 ). 2.4 Value Functions As said previously, reinforcement learning agents strive to maximize the expected reward when at a given state at time t. It does this by estimating value functions for a given state or state-action pair. The state value function V is an estimate of the expected rewards the agent expects to receive on being at state s and then following a policy π thereafter. V π (s) = E π (R t s t = s). The state-action value function Q is an estimate of the expected rewards the agent expects to receive on being at state s, taking action a, and then following policy π thereafter.

11 2.5 Bellman Equations for the Value Functions 5 Q π (s, a) = E π (R t s t = s, a t = a). We can now properly define the optimal policy π. First, we define a policy π to be better than or equal to policy π if and only if V π (s) V π (s) for all s S. There is always at least one policy that is better than or equal to all other policies, and we denote all these optimal policies as π. The state value functions for these optimal policies are given as V (s) = max V π (s). π The state-action value functions for the optimal policies are given as Q (s, a) = max π Qπ (s, a). The optimal policy π is therefore the policy that chooses the action a that maximizes Q(s, a) for all s S, π (s) = arg max Q(s, a). a Reinforcement learning uses the value functions to approximate the optimal policy. Simply choosing the immediate action that maximizes V (s) or Q (s, a) at state s leads to an optimal policy. 2.5 Bellman Equations for the Value Functions Reinforcement learning uses the Bellman equations for reformulating the value function. The Bellman equations expresses the relationship between the value of a state and the value of its successor state in the case of the state value function, and the value of a state-action combination and the value of the succeeding state and action in the case of state-action value function. This allows V π and Q π to be defined recursively. The Bellman equation for V π is V π (s) = max E[r t+1 + γv π (s + 1) s t = s, a t = a]. a The Bellman equation for Q π is Q π (s, a) = E[r t+1 + γ max Q π (s t+1, a ) s t = s, a t = a]. a 2.6 Exploration versus Exploitation One of the fundamental problems in reinforcement learning is balancing exploration and exploitation. In exploitation, the agent exploits what it already knows and does a greedy selection when choosing an action at a particular time step,

12 6 Background a t = arg max Q(s t, a) a In this way the agent is maximizing the rewards it receives given what it already knows. However, what if there is another action that actually gives a better reward than the one returned by a greedy selection? The agent s current value function estimates just may not be reflecting this yet. If the agent always chooses the greedy action, it will never find out about the actually better action. This is where exploration comes in. One method of exploration is ɛ-greedy. At a probability ɛ, the agent chooses an action randomly from the set of available actions instead of doing a greedy selection. This allows the agent to discover new actions that are actually better than what it currently perceives to be best, and eventually find the optimal policy. Another way of doing exploration is through optimistic initialization. The model parameters for the value functions are initialized such that the expected reward coming from the states and actions are higher than they actually are. The agent will then seek out all these states and actions through the greedy selection until the expected rewards for these states and actions drop to its actual value. We use optimistic initialization to enforce exploration in the RL domains we implement for this paper. As the timestep t, continuing exploration allows all states to eventually be visited infinite times, and this is a key requirement for the convergence of RL algorithms to the optimal policy π.

13 2.6 Exploration versus Exploitation 7 Figure 2.1: Average performance for different values of ɛ on the 10-armed bandit. Retrieved from [Sutton and Barto 1998]

14 8 Background

15 Chapter 3 Reinforcement Learning Algorithms A variety of reinforcement learning algorithms have already been developed for finite state space environments. In most of these cases the state and state-action value functions are most often stored in a tabular format corresponding to the state space. We discuss these algorithms before moving on to their function approximation extensions in the next chapter. 3.1 Dynamic Programming Dynamic programming (DP) is a set of algorithms that can be used to compute optimal policies. A complete and perfect model of the environment as a Markov decision process is required. We show two popular DP methods, policy iteration and value iteration Policy Iteration In policy iteration, DP first evaluates V π for an initial policy π, and then uses this to find a better policy π. It repeats the process again until π converges into the optimal policy. Let P (s, a, s ) be the transition probability of moving from state s to s upon doing action a, and let R(s, a, s ) be the returned reward after moving from state s to state s upon doing action a. As before, π(s, a) is the probability of doing action a at state s under policy π. For a given policy π, V π (s) is calculated as V π (s) = a π(s, a) s P (s, a, s )[R(s, a, s ) + γv π (s )]. Q π can then be easily calculated from V π using the following equation: Q π (s, a) = s P (s, a, s )[R(s, a, s ) + γv π (s )] 9

16 10 Reinforcement Learning Algorithms After computing the value functions, we can easily get the improved policy π by letting π (s) = arg max Q π (s, a) a DP then repeats the process with π. When π is as good as, but not better than π such that V π = V π, then both π and π are already optimal policies Value Iteration In policy iteration each iteration does a policy evaluation and calculates V π, and this step can take a long time because calculating V π is itself an iterative computation that loops through the entire state space repeatedly. The value iteration algorithm improves on this by combining policy improvement and a truncated policy evaluation into a single update step V k+1 (s) = max P (s, a, s )[R(s, a, s ) + γv k (s )] a s For any initial V 0, the sequence {V k } will eventually converge to V as long as 0 < γ < Monte-Carlo Methods Monte-Carlo (MC) methods take the average of returns of a state to estimate their value functions. Unlike dynamic programming, Monte-Carlo does not even require any knowledge of the environment. All it needs is some function that generates episodes by following the policy π. Each episode will contain a set of states that were passed through by the policy, a set of actions done at each passed through state, and the returns following the first occurrence of that state-action pair. Let Return(s) and Return(s, a) be a set of return values R t for a state or state action pair, respectively, one for each episode, and N be the total number of episodes that has been generated from π. MC estimates V π and Q π as V π (s) = N i Return i (s) N N Q π i Return i (s, a) (s, a) = N As in DP, policy improvement again follows as π (s) = arg max Q π (s, a) a After which the process starts again with a new set of episodes generated from π. The algorithm terminates when V π (s) V π (s) or Q π (s, a) Q π (s, a).

17 3.3 Temporal-Difference Learning Temporal-Difference Learning Temporal-difference (TD) learning is a combination of monte-carlo methods and dynamic programming. Like MC methods, TD learns directly from experience and does not need to have a model of the environment. Like DP, TD learning bootstraps by updating value estimates from earlier estimates. This allows TD algorithms to update the value functions already before the end of an episode, they only need to wait for the next time step. This property defines TD as an online learning method. The simplest TD algorithms focus on the policy evaluation or the prediction problem. Algorithms like TD(0) do this by estimating the value function V π for a given policy π. More sophisticated algorithms like Sarsa and Q-Learning go further by solving the control problem in which they find an optimal policy π instead of just using a given policy. The most common TD learning algorithms are TD(λ), Sarsa(λ), and Q-Learning TD(0) TD(0) is the simplest TD algorithm for evaluating a policy. It works by treating the return R t as the sum of the reward immediately following s t and the expected returns in the future. The state value function can then be updated at every time step by V (s t ) = V (s t ) + α[r t+1 + γv (s t+1 ) V (s t )] where α > 0 is a learning rate parameter of the algorithm. TD(0) is thus easily implemented as an on-line, fully incremental algorithm and does not need to wait for the termination of the episode to begin updating the value estimates. TD(0) has been proved to converge to V π for the states that are visited infinitely often. For convergence to be guaranteed, π should be a proper policy and the learning rate α t should have the following constraints: Learning rate constraints: α t > 0, t T t=0 α t = t=0 α2 t < Eligibility Traces for TD(λ) The temporal-difference algorithms discussed in the previous section have so far all been one-step methods, in that they consider the return of only the one next reward. This is in contrast with monte-carlo methods where all the returns until the end of the episode are considered. Eligibility traces are a method of bridging the gap between these two kinds of learning algorithms. As we have seen, the return in monte carlo methods are R MC t = r t + γr t+1 + γ 2 r t γ T t 1 r T.

18 12 Reinforcement Learning Algorithms The one step return in TD(0) is R T D(0) t = r t + γv (s t+1 ) where γv (s t+1 ) replaces the γr t+1 + γ 2 r t γ T t 1 r T terms of the montecarlo methods. Eligibility traces interpolates between these two returns. The TD(λ) algorithm, where λ is the eligibility trace parameter and 0 λ 1 now defines the λ return as R λ t = (1 λ) n=1 λ n 1 R (n) t. When λ = 0, the λ return reduces to the one-step return R T D(0). When λ = 1, the λ return becomes equal to R MC. Setting λ to values less than 1 and greater than 0 allows a TD algorithm to vary in the space between the two extremes. Eligibility traces are very important for temporal-difference methods. In fact, convergence guarantees of the algorithms relies on the eligibility trace values having a few specific properties. Let e t (s)be the eligibility trace value for any state s at any time t. Then for TD to converge the following must hold: Eligibility traces constraints: e t (s) 0 e 0 (s) = 0. Eligibility traces are initially 0. e t (s) e t 1 (s) if s t s. Eligibility traces remain at 0 until the first time that state s is visited. e t (s) e t 1 (s) + 1 if s t = s. Eligibility traces may increase by at most 1 with every visit to state s. e t (s) is completely determined by s 0, s 1,..., s t. e t (s) is bounded above by a deterministic constant C Convergence of TD Assuming that the learning rate constraints in section and the eligibility traces constraints in section hold. Then if the policy π is a proper policy, TD converges to V π with a probability of Sarsa (λ) Sarsa(λ) is an on-policy control TD learning algorithm. On-policy methods estimate the value of a policy while simultaneously using it for control. This is in contrast to off-policy methods, which uses two different policies: a behavior policy for generating

19 3.3 Temporal-Difference Learning 13 behaviors from the agent, and an estimation policy which is the policy to be evaluated and improved. Sarsa uses only one policy and changes this policy along the way. Sarsa gets its name from the state-action-reward-state-action cycle of the algorithm. Instead of learning the state value function V π as in TD(0), Sarsa learns the state-action value function Q π for policy π. At each time step Sarsa updates the state action value through Q-Learning Q(s t, a t ) = Q(s t, a t ) + α[r t+1 + γq(s t+1, a t+1 ) Q(s t, a t )]. Q-leaning is an off-policy control TD algorithm. Off-policy methods uses two different policies: a behavior policy for generating behaviors from the agent, and an estimation policy which is the policy to be evaluated and improved. In off-policy algorithms the two policies need not even be related. As in Sarsa, Q-learning uses the stateaction value function Q(s, a). The difference with Sarsa is that Q-learning right away directly approximates the optimal state-action value function Q, irrespective of the actual policy being followed by the agent. The simple one-step Q-learning algorithm is defined by its update function Q(s t, a t ) = Q(s t, a t ) + α[r t+1 + γ max Q(s t+1, a t+1 ) Q(s t, a t )]. a Convergence of Q-Learning Assuming that the learning rate constraints in section hold. Then Q-Learning converges to the optimal state-action value function Q with probability 1.

20 14 Reinforcement Learning Algorithms

21 Chapter 4 Reinforcement Learning with Function Approximation The reinforcement learning algorithms discussed in the previous chapter assume that the value functions can be represented as a table with one entry for each state or stateaction pair. However, this is only practical for very few tasks with a limited number of states and actions. In environments with large numbers of states and actions, using a table to store the value functions becomes impractical and may even make computing them intractable. Moreso, a lot of environments will have continuous state and action spaces, making the size of the table infinite. Another problem with tabular methods is that it does not use generalization. Given two states s and s, the value of V (s) does not say anything about the value of V (s ). Ideally, value functions should be able to generalize so that having a good estimate of V (s) will help get a good estimate of V (s ). Combining the traditional reinforcement learning algorithms with function approximation techniques solves both these problems. 4.1 Function Approximation (Regression) Function approximation takes example data generated by a function and generalizes from them to construct a function that approximates the original function. Because it needs sample data to learn the function, it is a type of supervised learning that was discussed earlier. A general form of function approximation used in reinforcement learning is f w (x) = w, φ(x) where w and φ(x) are n-element vectors with w, φ R n. w is a vector of weight values and φ(x) is feature mapping column vector of the input values φ(x) = (φ 1 (x), φ 2 (x),..., φ n (x)) T 15

22 16 Reinforcement Learning with Function Approximation We define a matrix Φ for m states φ 1 (s 1 ) φ 1 (s 2 )... φ 1 (s m ) φ 2 (s 1 ) φ 2 (s 2 )... φ 2 (s m ) Φ =.... φ n (s 1 ) φ n (s 2 )... φ n (s m ) In the case of reinforcement learning, these input values are the states or stateaction pairs. Translating this into the state and state-action value functions is simply V w (s) = w, φ(s) Q w (s, a) = w, φ(s, a) Finding the optimal policy means finding the values of w that best approximates the optimal value functions V and Q, or just V π and Q π under policy evaluation. 4.2 Gradient Descent Methods Gradient-based methods are among the most widely used function optimization techniques. To find a local minimum of a differentiable function, gradient descent takes steps towards the negative of the gradient of the function at the current point. The gradient of a function points to the direction of its greatest rate of increase, hence the negative of the gradient points to its greatest rate of decrease. Gradients can easily be calculated from the first-order derivatives of a function, making gradient descent a first-order optimization algorithm. One class of functions for which gradient descent works particularly well are convex functions. A function f : X > R is convex if for all x 1, x 2 X, λ [0, 1], f(λx 1 + (1 λ)x 2 ) λf(x 1 ) + (1 λ)f(x 2 ) Meaning that for any interval on the domain, the function values are less than or equal to the function values at the extreme points of the interval. This means that in the case of convex functions, there is only one local minimum and it is also the global minimum. In supervised learning, a common function to minimize is the squared error. If f(x) is the unknown function that we are trying to learn and g(w, φ(x)) = w, φ(x) is our estimator of f, the total squared error Err over all inputs x is Err = 1 2 E x[f(x) g(w, φ(x))] 2 which is the objective we want to minimize. However, it is impossible to calculate the expectation because we do not know the values that f will return for all possible inputs. Usually, we only have a sample of n input-output pairs (x 1, y 1 ), (x 2, y 2 ),...(x n, y n ).

23 4.2 Gradient Descent Methods 17 We can therefore only reduce the error over these empirical observations Err = 1 2 n [y i g(w, φ(x i ))] 2. i=1 To optimize g to be a more accurate estimate of f, we take the gradient of Err over w and use this to update w w = w α w Err where α is the step size. What this update does is move w a small gradient step towards the direction that minimizes Err. w is updated until it converges to a local optima. To calculate w Err w Err = = n [y i g(w, φ(x i ))] w g(w, φ(x i )) i=1 n [y i g(w, φ(x i ))]φ(x i ) i=1 This method of doing gradient descent over all samples x, y is called batch gradient descent Stochastic Gradient Descent However, there will be situations where it is not possible to compute the gradient over all samples. This may be because the samples are coming one at a time, or there may be too many of them (even infinite) that it is intractable to calculate their entire sum for the gradient. To be able to do function approximation in these situations, the method we use is stochastic gradient descent (SGD). As we shall see later, this is the method we will use to incorporate function approximation to our TD learning algorithms. Let err(x) = [y g(w, φ(x))] be the error of a single sample input and output (x, y), Err is therefore the sum of the squares of serr Err = 1 err(x) 2 2 Instead of taking the gradient of Err and using that to update w, we only take the gradient of err at a single sample (x, y), w err(x), and use that as an estimate of w Err. x

24 18 Reinforcement Learning with Function Approximation w err(x) = [y g(w, φ(x))] w g(w, φ(x)) We then use w err(x) to update w = [y g(w, φ(x))]φ(x). w = w α w err(x). (4.1) Convergence of SGD for Markov Processes Stochastic gradient descent for RL is a special case because RL data is generated by a Markov process. Hence, the convergence guarantee we show is specific to Markov chains only. Let {X i } i=1,... be a time-homogenous Markov process, A( ) be a mapping which maps every X χ to a d d matrix, b( ) map each X to a vector b(x). Under the following assumptions: 1. The learning rates α t are deterministic, non-negative, and satisfy the learningrate constraints in Section The Markov process {X i } has a steady state distribution π such that lim t P (x t X o ) = π(x t ). E 0 [ ] is the expectation with respect to this invariant distribution. 3. The matrix A = E 0 [A(X t )] is negative definite. 4. constant K such that A(x) K and b(x) K, X χ. 5. For any initial state X 0, the expectation of A(X t ) and b(x t ) converges exponentially fast to the steady-state expectation A and b. The stochastic algorithm w t+1 = w t + α t (A(X t )w t + b(x t )) (4.2) converges with probability 1 to the unique solution w of the system Aw + b = 0. This means that given the above assumptions, SGD will eventually converge to a local optima for Markov processes. Not that Equation 4.1 takes almost the same form as Equation 4.2. One can choose A and b such that that it matches the SGD update in Equation 4.1, and hence make it fall under the same convergence guarantee. 4.3 TD Learning We wish to to optimize V t, our estimate of V π at time t. Recall that V π is an estimate of the expected rewards the agent expects to receive on being at state s and following

25 4.3 TD Learning 19 policy π thereafter, which we designate R t. In RL, rewards come one sample at a time, and we need to be able to update our estimates of the value functions given this one sample. Hence, we do not have sample values of R t nor of V π, and gradient descent on objectives with those values are not possible. Instead, the error we minimize is the Bellman error. The Bellman error at a single time step t is defined as e(s t ) = 1 2 [r t+1 + γv t (s t+1 ) V t (s t )] 2. Recall that V (s) is approximated as a linear function V w (s) = w, φ(s). However when we take the gradient of e(s t ), we treat the r t+1 + γv t (s t+1 ) term as just a sample constant value R t and not as a function of w. Therefore the gradient of e(s t ) is just w e(s t ) = [r t+1 + γv t (s t+1 ) V t (s t )] w V t (s t ) = [r t+1 + γv t (s t+1 ) V t (s t )]φ(s t ). Using this gradient to update w results in the following update operation w t+1 = w t + α wt E(w t ) where α is again a step size value. = w t + α[r t+1 + γv t (s t+1 ) V t (s t )]φ(s) Control with Function Approximation For state-action value functions, the Bellman error given state s, action a, at time t is e(s t, a t ) = 1 2 [r t+1 + γq t (s t+1, a t+1 ) Q t (s t, a t )] 2. Again, we treat the r t+1 + γq t (s t+1, a t+1 ) term as just a sample constant value of R t and not as a function of w. Therefore the gradient of e(s t ) is just w e(s t, a t ) = [r t+1 + γq t (s t+1, a t+1 ) Q t (s t, a t )]φ(s t ) and we update w using this gradient to get w t+1 = w t + α[r t+1 + γq t (s t+1, a t+1 ) Q t (s t, a t )]φ(s) Convergence of TD With Function Approximation We now provide convergence guarantees for policy evaluation under TD(0). Assuming the same step size constraints as in section hold. Additional constraints for con-

26 20 Reinforcement Learning with Function Approximation vergence are: 1. The state space is an aperiodic Markov chain when we follow π and all states are visited an infinite number of times during an infinitely long episode. 2. The policy π is a proper policy. 3. The feature mapping φ(s) is a linearly independent function on the state space. (The matrix Φ has full rank) If all the constraints hold, then TD(0) converges to V π. This is based on the convergence of SGD for Markov processes discussed earlier in Section Residual Gradients Recall that when we took the gradient of e(s t ) when applying gradient descent on TDlearning, we treated the r t+1 + γv t (s t+1 ) term as just a sample constant value of R t and not as a function of w. Since V t (s t+1 ) = w, φ(s t+1 ) is in fact a function of w, this actually makes TD-learning not a proper gradient descent method. Hence, TD(0) can diverge when the above constraints aren t met. There is another form of RL algorithm with function approximation that is exactly gradient descent, and it is called residual gradients (RG). With residual gradients, we now consider V t (s t ) to also be a function of w as it is. Since RG is a proper gradient descent method, convergence is much more robust than with TD(0). The gradient of e(s t ) with respect to w now becomes w e(s t ) = [r t+1 + γv t (s t+1 ) V t (s t )][ w γv t (s t+1 ) w V t (s t )] = [r t+1 + γv t (s t+1 ) V t (s t )][γφ(s t+1 ) φ(s t )]. The TD update for the weights is now w t+1 = w t α[r t+1 + γv t (s t+1 ) V t (s t )][γφ(s t+1 ) φ(s t )] Convergence of Residual Gradients Because residual gradients is directly a stochastic gradient method, convergence for policy evaluation with a fixed policy π is guaranteed based on Section No other constraints are necessary.

27 4.4 Residual Gradients Control with Residual Gradients For the control problem, the gradient for e(s t, a t ) is now w e(s t, a t ) = [r t+1 + γq t (s t+1, s t+1 ) Q t (s t, a t )][ w γq t (s t+1, a t+1 ) w V t (s t, a t )] = [r t+1 + γq t (s t+1, a t+1 ) Q t (s t, a t )][γφ(s t+1, a t+1 ) φ(s t, a t )] and the update for w becomes w t+1 = w t + α[r t+1 + γq t (s t+1, a t+1 ) Q t (s t, a t )][γφ(s t+1 ) φ(s t+1 )]

28 22 Reinforcement Learning with Function Approximation

29 Chapter 5 Experimental Results We now implement some of the techniques discusses in this paper to show the results of RL with function approximation. The two techniques we use are Sarsa and Residual Gradients. We implement it on two domains that require function approximation, Cart Pole and Mountain Car. Additionally, we use two different feature mappings for state features, Radial Basis Functions (RBF) Coding and Tile Coding [Sutton and Barto 1998]. 5.1 Domains Mountain Car In the mountain car domain, the agent tries to drive a car up a hill towards the goal position. It has a two dimensional state space, the position of the car in the hill and the velocity of the car. Dimension s 1 is the position of the car and is a continuous value bounded between [ 1.2, 0.5]. Dimension s 2 is the velocity and is also a continuous value and is bounded between [ 0.07, 0.07]. The agent can choose among three actions, a { 1, 0, 1} which corresponds to accelerating left, neutral, right. The goal of the agent is to get the car into the rightmost position, that is, at the state with s 1 = 0.5. At each time step the agents received a reward of -1 until it reaches the goal state wherein it receives a reward of Cart Pole In the cart pole domain, the agent tries to balance a pole hinged on top of a cart by moving the cart along a frictionless track. It has a four dimensional state space, the position of the cart, cart velocity, the angle of the pole, and the pole angular velocity. Dimension s 1 is the position of the cart, and is bounded by [ 2.4, 2.4]. Dimension s 2 is the cart velocity and has infinite state space [, ]. Dimension s 3 is the pole angle and is bounded by [ 12, 12], any angular position outside of this bounds results in failure. Dimension s 4 is the angular velocity of the pole, and also has infinite state space [, ]. The agent receives zero rewards at each time step until the angle of 23

30 24 Experimental Results Figure 5.1: The Mountain Car Domain the pole exceeds the bounds, at which point the agent receives a reward of -1 and an episode ends. Figure 5.2: The Cart Pole Domain 5.2 Results Optimal Parameter Values Optimal parameter values for each domain was found by repeatedly testing different values for each parameter and recording the best results. For tile coding we used 10 fillings, for RBF coding we used 10 radial basic functions with variance of 0.05 each Results As seen in Figures 5.3 and 5.4, Sarsa is able to converge faster than using residual gradients. This matches up with the discussion in Section 4.4 that while TD learning is not directly a stochastic gradient descent method and hence diverges more often,

31 5.2 Results 25 α γ λ Sarsa - Tile Coding Sarsa - RBF Coding RG - Tile Coding RG - RBF Coding Table 5.1: Optimal Parameter Values for the Cart Pole Domain α γ λ Sarsa - Tile Coding Sarsa - RBF Coding RG - Tile Coding RG - RBF Coding Table 5.2: Optimal Parameter Values for the Mountain Car Domain when it does converge the rate of convergence is faster than residual gradients which is a true SGD method. In all cases except one, using tile coding converges faster than RBF coding. Tuning parameters has a big effect on the learning performance of the agent. For some sub-optimal parameter values, the agent never learns the optimal policy for the domain. Figure 5.3: Average Results for the Cart Pole domain. Higher is better.

32 26 Experimental Results Figure 5.4: Average Results for the Mountain Car domain. Lower is better.

33 Chapter 6 Final Remarks As we have discussed in this report, combining reinforcement learning with function approximation techniques allows an agent to learn to operate in environments with infinitely large number of states. It does this by letting an agent generalize what it has learned in some states to other similar states. We first discussed traditional reinforcement learning methods that stores the value of the states in a tabular format, and proceeded to discuss the function approximation extensions to these methods. We used linear function approximation where we optimized the parameters with respect to a mean square error using stochastic gradient decent. The function approximation used in Sarsa isn t a direct SGD method and hence there are times when Sarsa will diverge. Residual gradients is a direct SGD method and hence its convergence is more robust. However when Sarsa does converge, it s rate of convergence is usually faster than residual gradients. The faster convergence of Sarsa is mainly an experimental result, not theoretical. There may be situations in which residual gradients converges faster. Finally, we implemented the RL with function approximation techniques discussed on two domains and showed the results. Although the RL theory for environments with state values stored in tabular formats are quite mature, the theory for RL with function approximation as discussed here are still very much being actively developed, with new techniques and methods still being discovered. More work in this particular aspect of reinforcement learning is needed and will provide better results in the future. 27

34 28 Final Remarks

35 Bibliography Baird, L Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the Twelfth International Conference on Machine Learning (1995). Bertsekas, D. P. and Tsitsiklis, J. N Neuro-Dynamic Programming. Athena Scientific. (p. iii) Rummery, G. A. and Niranjan, M On-line q-learning using connectionist systems. Technical report. Sutton, R. and Barto, A Reinforcement Learning. The MIT Press. (pp. iii, 2, 7, 23) Watkins, C Learning from delayed rewards. Watkins, C. and Dayan, P Q-learning. Machine Learning. 29

Temporal difference learning

Temporal difference learning Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning

More information

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

Reinforcement Learning. George Konidaris

Reinforcement Learning. George Konidaris Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom

More information

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396 Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction

More information

Lecture 7: Value Function Approximation

Lecture 7: Value Function Approximation Lecture 7: Value Function Approximation Joseph Modayil Outline 1 Introduction 2 3 Batch Methods Introduction Large-Scale Reinforcement Learning Reinforcement learning can be used to solve large problems,

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Reinforcement Learning: the basics

Reinforcement Learning: the basics Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46 Introduction Action selection/planning Learning by trial-and-error

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning RL in continuous MDPs March April, 2015 Large/Continuous MDPs Large/Continuous state space Tabular representation cannot be used Large/Continuous action space Maximization over action

More information

Reinforcement learning an introduction

Reinforcement learning an introduction Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,

More information

arxiv: v1 [cs.ai] 5 Nov 2017

arxiv: v1 [cs.ai] 5 Nov 2017 arxiv:1711.01569v1 [cs.ai] 5 Nov 2017 Markus Dumke Department of Statistics Ludwig-Maximilians-Universität München markus.dumke@campus.lmu.de Abstract Temporal-difference (TD) learning is an important

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Daniel Hennes 19.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns Forward and backward view Function

More information

Notes on Reinforcement Learning

Notes on Reinforcement Learning 1 Introduction Notes on Reinforcement Learning Paulo Eduardo Rauber 2014 Reinforcement learning is the study of agents that act in an environment with the goal of maximizing cumulative reward signals.

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

REINFORCEMENT LEARNING

REINFORCEMENT LEARNING REINFORCEMENT LEARNING Larry Page: Where s Google going next? DeepMind's DQN playing Breakout Contents Introduction to Reinforcement Learning Deep Q-Learning INTRODUCTION TO REINFORCEMENT LEARNING Contents

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it

More information

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control Lecturer: Nikolay Atanasov: natanasov@ucsd.edu Teaching Assistants: Tianyu Wang: tiw161@eng.ucsd.edu Yongxi Lu: yol070@eng.ucsd.edu

More information

Reinforcement Learning II

Reinforcement Learning II Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini

More information

Reinforcement Learning in Continuous Action Spaces

Reinforcement Learning in Continuous Action Spaces Reinforcement Learning in Continuous Action Spaces Hado van Hasselt and Marco A. Wiering Intelligent Systems Group, Department of Information and Computing Sciences, Utrecht University Padualaan 14, 3508

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Assignment #1 Roll-out: nice example paper: X.

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of

More information

Replacing eligibility trace for action-value learning with function approximation

Replacing eligibility trace for action-value learning with function approximation Replacing eligibility trace for action-value learning with function approximation Kary FRÄMLING Helsinki University of Technology PL 5500, FI-02015 TKK - Finland Abstract. The eligibility trace is one

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Reinforcement Learning: An Introduction

Reinforcement Learning: An Introduction Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Reinforcement Learning. Summer 2017 Defining MDPs, Planning Reinforcement Learning Summer 2017 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating? CSE 190: Reinforcement Learning: An Introduction Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 7: Eligibility Traces R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Midterm Mean = 77.33 Median = 82 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

More information

ilstd: Eligibility Traces and Convergence Analysis

ilstd: Eligibility Traces and Convergence Analysis ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca

More information

Chapter 8: Generalization and Function Approximation

Chapter 8: Generalization and Function Approximation Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. Overview

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

CSC321 Lecture 22: Q-Learning

CSC321 Lecture 22: Q-Learning CSC321 Lecture 22: Q-Learning Roger Grosse Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21 Overview Second of 3 lectures on reinforcement learning Last time: policy gradient (e.g. REINFORCE) Optimize

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

An online kernel-based clustering approach for value function approximation

An online kernel-based clustering approach for value function approximation An online kernel-based clustering approach for value function approximation N. Tziortziotis and K. Blekas Department of Computer Science, University of Ioannina P.O.Box 1186, Ioannina 45110 - Greece {ntziorzi,kblekas}@cs.uoi.gr

More information

Sequential Decision Problems

Sequential Decision Problems Sequential Decision Problems Michael A. Goodrich November 10, 2006 If I make changes to these notes after they are posted and if these changes are important (beyond cosmetic), the changes will highlighted

More information

Generalization and Function Approximation

Generalization and Function Approximation Generalization and Function Approximation 0 Generalization and Function Approximation Suggested reading: Chapter 8 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.

More information

ARTIFICIAL INTELLIGENCE. Reinforcement learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

Reinforcement Learning. Value Function Updates

Reinforcement Learning. Value Function Updates Reinforcement Learning Value Function Updates Manfred Huber 2014 1 Value Function Updates Different methods for updating the value function Dynamic programming Simple Monte Carlo Temporal differencing

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques

More information

CS599 Lecture 2 Function Approximation in RL

CS599 Lecture 2 Function Approximation in RL CS599 Lecture 2 Function Approximation in RL Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. Overview of function approximation (FA)

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning. Machine Learning, Fall 2010 Reinforcement Learning Machine Learning, Fall 2010 1 Administrativia This week: finish RL, most likely start graphical models LA2: due on Thursday LA3: comes out on Thursday TA Office hours: Today 1:30-2:30

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning 1 / 58 An Introduction to Reinforcement Learning Lecture 01: Introduction Dr. Johannes A. Stork School of Computer Science and Communication KTH Royal Institute of Technology January 19, 2017 2 / 58 ../fig/reward-00.jpg

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague Sequential decision making under uncertainty Jiří Kléma Department of Computer Science, Czech Technical University in Prague https://cw.fel.cvut.cz/wiki/courses/b4b36zui/prednasky pagenda Previous lecture:

More information

Temporal Difference Learning & Policy Iteration

Temporal Difference Learning & Policy Iteration Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating? CSE 190: Reinforcement Learning: An Introduction Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to

More information

Linear Least-squares Dyna-style Planning

Linear Least-squares Dyna-style Planning Linear Least-squares Dyna-style Planning Hengshuai Yao Department of Computing Science University of Alberta Edmonton, AB, Canada T6G2E8 hengshua@cs.ualberta.ca Abstract World model is very important for

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Lecture 8: Policy Gradient

Lecture 8: Policy Gradient Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve

More information

Dual Temporal Difference Learning

Dual Temporal Difference Learning Dual Temporal Difference Learning Min Yang Dept. of Computing Science University of Alberta Edmonton, Alberta Canada T6G 2E8 Yuxi Li Dept. of Computing Science University of Alberta Edmonton, Alberta Canada

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010 Lecture 25: Learning 4 Victor R. Lesser CMPSCI 683 Fall 2010 Final Exam Information Final EXAM on Th 12/16 at 4:00pm in Lederle Grad Res Ctr Rm A301 2 Hours but obviously you can leave early! Open Book

More information

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018 Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)

More information

Reinforcement Learning and Deep Reinforcement Learning

Reinforcement Learning and Deep Reinforcement Learning Reinforcement Learning and Deep Reinforcement Learning Ashis Kumer Biswas, Ph.D. ashis.biswas@ucdenver.edu Deep Learning November 5, 2018 1 / 64 Outlines 1 Principles of Reinforcement Learning 2 The Q

More information

Reinforcement Learning using Continuous Actions. Hado van Hasselt

Reinforcement Learning using Continuous Actions. Hado van Hasselt Reinforcement Learning using Continuous Actions Hado van Hasselt 2005 Concluding thesis for Cognitive Artificial Intelligence University of Utrecht First supervisor: Dr. Marco A. Wiering, University of

More information

On and Off-Policy Relational Reinforcement Learning

On and Off-Policy Relational Reinforcement Learning On and Off-Policy Relational Reinforcement Learning Christophe Rodrigues, Pierre Gérard, and Céline Rouveirol LIPN, UMR CNRS 73, Institut Galilée - Université Paris-Nord first.last@lipn.univ-paris13.fr

More information

Reinforcement Learning: An Introduction. ****Draft****

Reinforcement Learning: An Introduction. ****Draft**** i Reinforcement Learning: An Introduction Second edition, in progress ****Draft**** Richard S. Sutton and Andrew G. Barto c 2014, 2015 A Bradford Book The MIT Press Cambridge, Massachusetts London, England

More information

On the Convergence of Optimistic Policy Iteration

On the Convergence of Optimistic Policy Iteration Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information

Olivier Sigaud. September 21, 2012

Olivier Sigaud. September 21, 2012 Supervised and Reinforcement Learning Tools for Motor Learning Models Olivier Sigaud Université Pierre et Marie Curie - Paris 6 September 21, 2012 1 / 64 Introduction Who is speaking? 2 / 64 Introduction

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B. Advanced Topics in Machine Learning, GI13, 2010/11 Advanced Topics in Machine Learning, GI13, 2010/11 Answer any THREE questions. Each question is worth 20 marks. Use separate answer books Answer any THREE

More information

Lecture 17: Reinforcement Learning, Finite Markov Decision Processes

Lecture 17: Reinforcement Learning, Finite Markov Decision Processes CSE599i: Online and Adaptive Machine Learning Winter 2018 Lecture 17: Reinforcement Learning, Finite Markov Decision Processes Lecturer: Kevin Jamieson Scribes: Aida Amini, Kousuke Ariga, James Ferguson,

More information

RL 3: Reinforcement Learning

RL 3: Reinforcement Learning RL 3: Reinforcement Learning Q-Learning Michael Herrmann University of Edinburgh, School of Informatics 20/01/2015 Last time: Multi-Armed Bandits (10 Points to remember) MAB applications do exist (e.g.

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement

More information

ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6.

ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6. ROB 537: Learning-Based Control Week 5, Lecture 1 Policy Gradient, Eligibility Traces, Transfer Learning (MaC Taylor Announcements: Project background due Today HW 3 Due on 10/30 Midterm Exam on 11/6 Reading:

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.

More information

A Residual Gradient Fuzzy Reinforcement Learning Algorithm for Differential Games

A Residual Gradient Fuzzy Reinforcement Learning Algorithm for Differential Games International Journal of Fuzzy Systems manuscript (will be inserted by the editor) A Residual Gradient Fuzzy Reinforcement Learning Algorithm for Differential Games Mostafa D Awheda Howard M Schwartz Received:

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning II Daniel Hennes 11.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns

More information

Off-Policy Actor-Critic

Off-Policy Actor-Critic Off-Policy Actor-Critic Ludovic Trottier Laval University July 25 2012 Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 1 / 34 Table of Contents 1 Reinforcement Learning Theory

More information

Introduction to Reinforcement Learning. Part 5: Temporal-Difference Learning

Introduction to Reinforcement Learning. Part 5: Temporal-Difference Learning Introduction to Reinforcement Learning Part 5: emporal-difference Learning What everybody should know about emporal-difference (D) learning Used to learn value functions without human input Learns a guess

More information

Reinforcement Learning (1)

Reinforcement Learning (1) Reinforcement Learning 1 Reinforcement Learning (1) Machine Learning 64-360, Part II Norman Hendrich University of Hamburg, Dept. of Informatics Vogt-Kölln-Str. 30, D-22527 Hamburg hendrich@informatik.uni-hamburg.de

More information

Reinforcement Learning and Control

Reinforcement Learning and Control CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

More information

Variance Reduction for Policy Gradient Methods. March 13, 2017

Variance Reduction for Policy Gradient Methods. March 13, 2017 Variance Reduction for Policy Gradient Methods March 13, 2017 Reward Shaping Reward Shaping Reward Shaping Reward shaping: r(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ Theorem: r admits

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning 1 Reinforcement Learning Mainly based on Reinforcement Learning An Introduction by Richard Sutton and Andrew Barto Slides are mainly based on the course material provided by the

More information

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN Reinforcement Learning for Continuous Action using Stochastic Gradient Ascent Hajime KIMURA, Shigenobu KOBAYASHI Tokyo Institute of Technology, 4259 Nagatsuda, Midori-ku Yokohama 226-852 JAPAN Abstract:

More information

Lecture 3: The Reinforcement Learning Problem

Lecture 3: The Reinforcement Learning Problem Lecture 3: The Reinforcement Learning Problem Objectives of this lecture: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information