A Gentle Introduction to Reinforcement Learning

Size: px
Start display at page:

Download "A Gentle Introduction to Reinforcement Learning"

Transcription

1 A Gentle Introduction to Reinforcement Learning Alexander Jung Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple, we model the office room as a plain rectangular area (see figure below) which is discretised using a regular grid of small squares (cells). Each cell is identified by its coordinates m {1,..., K} and y {1,..., L}. We say that Rumba is in state s t = m t, y t at time t, if it is currently located at the cell with coordinates m t and y t. The set of all states (cells) constitutes the state space of Rumba, i.e., S = { m, y m {1,..., K}, y {1,..., L}}. (1) Some cells m, y are occupied by obstacles or contain a charging station for Rumba. We denote the set of cells occupied by obstacles as B S and the cells with a charging station by C S. In order to move around, Rumba can choose to take actions a A = {N, S, E, W}, each action corresponding to a compass direction into which Rumba can move. For example, if Rumba chooses action a = N it will move into direction north (it there is no obstacle or the border of the room). A simple, yet quite useful, model which formally describes the behaviour of Rumba as it moves around room B329 consists of the following building blocks the state space S; (e.g., the grid-world {1,..., K} {1,..., L}). the set of actions A; (e.g., the set of directions {N, S, E, W}). the transition model T : S A S; (e.g., moves intro compass direction a but neither can move into an obstacle nor move out of gridworld). the reward function R : S A R; (e.g., zero reward for reaching a charging station, otherwise negative reward). the discount factor γ (0, 1) (e.g., γ = 1/2); which together define a Markov decision process (MDP). Formally, a MDP model is a tuple M = S, A, T, R, γ consisting of a state space S, an action set A, a transition map T : S A S, a reward function R : S A R and a discount factor γ [0, 1]. 1

2 Figure 1: The office room, which Rumba has to keep tidy, and a simple grid-world model of the room s floor space. On a higher level, a MDP is nothing but an abstract mathematical model (much like propositional logic) which can be used to describe the interaction between an AI system (such as the cleaning robot Rumba) and its environment (such as the office room containing Rumba). Having an accurate MDP model allows to derive efficient methods for computing (approximately) optimal actions in the sense of maximizing a long-term average reward or return. Note that the precise specification of a MDP model for Rumba requires perfect knowledge of all the occupied cells (obstacles) in the room B329. In particular, only if we know all locations of obstacles, we can specify the transition map T of the MDP model. In some applications this might be a reasonable assumption, e.g., if the obstacles are furniture which do not move often. However, if the obstacles are chairs which are moved often, then it is unreasonable to assume perfect knowledge of the obstacle locations and, in turn, we do not know the transition map T. Thus, in this case, the MDP model for the behaviour of Rumba must be adapted over the course of time. In what follows, we will restrict ourselves to deterministic MDP models which involve a deterministic transition map T (st, at ) which maps the current state st and action at taken by Rumba to a well-defined successor state st+1 = T (st, at ). Moreover, the reward received when taking action at in state st is determined exactly by a deterministic map R(st, at ). In many real-world applications it is more convenient to allow for stochastic (random) transition maps and rewards, e.g., to cope with random failures in the mechanical parts of Rumba which cause that it will move not exactly as ordered by the action at, but only with high probability. In this case, the successor state st+1 is a random variable with conditional distribution p(st+1 st, at ) which depends on the current state st and action at. However, it turns out that the concepts and methods developed for deterministic MDP models can be extended quite easily to stochastic MDP models (which include deterministic MDP models as a special case). 1.1 Learning Outcomes After completing this chapter, you should understand tabular based Q-learning. understand limitations of tabular based methods. understand the basic idea of function approximation for action-value functions. 2

3 be able to implement the Q-learning algorithm using linear function approximation. 2 The Problem Consider the cleaning robot Rumba which has to keep the office room B329 tidy. At some time t, when Rumba is currently in state (at cell) s t = m t, y t, it finds itself running out of battery and that it should reach a charging station, which can found at some cell s C, as soon as possible. We want to program Rumba such that it reaches a charging station as quickly as possible. The programming of Rumba amounts to specifying a policy π which maps its current state s S to a good (hopefully rational) action a A. Figure 2: (a) The control software of Rumba implements a policy π : S A which maps the current state s S (e.g., s t = 2, 3 ) to an action a t A (e.g., move east a t = E). (b) We can think of a policy also in terms of a sub-routine which is executed by the operating system of Rumba. Mathematically, we can represent a policy as a map (or function) π : S A from the state space S to the action set A of a MDP. The policy π maps a particular state s S to the action a = π(s) {N, S, E, W}, which the AI system takes next. For the simple gridworld MDP model used for the Rumba application, we can illustrate a particular policy by drawing arrows in the grid world (see figure below). Note that once we specify a policy π and the starting state s t at which Rumba starts to execute π from time t onwards, the future behaviour is completely determined by the transition model T 3

4 of the MDP, since we have a t = π(s t ) s t+1 = T (s t, a t ) a t+1 = π(s t+1 ) s t+2 = T (s t+1, a t+1 ) and so on. (2) We evaluate the quality of a particular policy π implemented by Rumba using the time difference t (π) c t between the first time t (π) c when Rumba reaches a charging station and the starting time t when Rumba was in the starting state s t. It can be shown that searching for a policy which minimises t (π) c t is fully equivalent to searching a policy which has maximum value function v π (s t ) := γ j R(s t+j, a t+j ) (3) j=0 using the reward function { 1 when T (s, a) / C action did not lead to charger R(s, a) = 0 else. action lead to charger (4) The value function v π (s t ) is a global (long-term) measure for the quality of the policy π followed by Rumba when starting in state s t at time t. In contrast, the reward R(s, a) is a local (instantaneous) measure for the usefulness (or rationality) of Rumba taking the particular action a A when it is currently in state s S. Carefully note that the definition (4) of the reward function involves also the transition map T of the MDP model. Thus, the rewards received by the AI system depend also on the transition model of the MDP. This is intuitively reasonable, since we expect the rewards obtained by an AI system to depend on how the environment responds to the different actions taken by the AI system. In what follows, we will focus on the problem of finding a policy which leads Rumba from its starting state s t as quickly as possible to a charging station, i.e., to some state s C S. As we just discussed, this is equivalent to finding a policy which has maximum value function v π (s t ) (cf. (3)) for any possible start state s t S. 3 Computing Optimal Policies for Known MDPs Given a particular MDP model M = S, A, T, R, γ, we are interested in finding an optimal policy π, i.e., having maximum value function v π (s) = max π v π(s) for all states s S. (5) As discussed in a previous chapter, we can find optimal policies indirectly via first computing the optimal value function v (s) = max v π(s) (6) π and then acting greedily according to v (s) (see Chapter on MDP for details). The computation of the optimal value function v (s), in turn, can be accomplished rather using, e.g., the value iteration algorithm, which we repeat here as Algorithm 1 for convenience: 4

5 Algorithm 1 Value Iteration (deterministic MDP) Input: MDP M = S, A, T, R, γ with discount γ (0, 1); error tolerance η Initialize: v 0 (s) = 0 for every state s S, iteration counter k := 0, Step 1: for each state s S, update value function, i.e., [ v k+1 (s) = max R(s, a) + γvk (T (s, a)) ] (7) a A Step 2: increment iteration counter k := k + 1 Step 3: if max s S v k (s) v k 1 (s) η go to Step 1 Output: estimate (approximation) v k (s) of optimal value function v Given the output v k (s) of Algorithm 1, which is an approximation to the optimal value function v for the MDP M, we can find a corresponding (nearly) optimal policy ˆπ(s) by acting greedily, i.e., [ ˆπ(s) := argmax R(s, a) + γvk (T (s, a)) ]. (8) a A There are different options for deciding when to stop the iterations of the value iteration Algorithm 1. For example, we could stop the updates (7) after a fixed but sufficiently large number of iterations. Another option, which is used in the above formulation of Algorithm 1, is to monitor the difference max s S v k (s) v k 1 (s) ε between successive iterates v k and stop as soon as this difference is below a pre-specified threshold ε (which is an input parameter of Algorithm 1). One appealing property of using the stopping condition shown in Algorithm 1 is that it allows to guarantee a bound on the sub-optimality of the policy ˆπ(s) obtained via (8) from the output v k (s) of Algorithm 1. In particular, the value function vˆπ of the policy ˆπ(s) obtained by Algorithm 1 and (8) deviates from the optimal value function v by no more than 2ηγ 1 γ, i.e., max s S vˆπ(s) v (s) 2ηγ 1 γ. (9) We refer to [1] for a formal proof of this bound. In order to implement Algorithm 1, we need to check if the stopping condition max s S v k(s) v k 1 (s) η (10) is satisfied after a reasonable number of iterations. This can be verified using a rather elegant argument which interprets the update (7) as a fixed point iteration v k+1 = Pv k. (11) Here, P denotes an operator which maps a value function v k : S R to another value function v k+1 : S R according to the rule [ v k+1 (s) = max R(s, a) + γvk (T (s, a)) ]. (12) a A 5

6 This operator P is a contraction with rate not larger than the discount factor γ of the MDP, i.e., and, in turn, Pv k Pv k 1 γ v k v k 1, (13) v k+1 v = Pv k v γ v k v (14) Thus, the difference max s S v k (s) v k 1 (s), as well as the deviation max s S v k (s) v (s) from the optimal value function v decays exponentially fast: each additional iteration of Algorithm 1 reduces the difference by a factor γ < 1. Carefully note that in order to execute the value iteration algorithm we need to know the reward function R(s, a) (for all possible state-action pairs s, a S A) and the transition map T (s, a) (for all possible state-action pairs s, a S A). If we have this information at our disposal, we can compute the optimal value function using value iteration Algorithm 1 and, in turn, determine a (nearly) optimal policy using (8). However, what should we do if we do not know the rewards and transition map before we actually implement a policy that lets the AI system (e.g., Rumba) interact with its environment? It turns out that some, rather intuitive, modifications of the value iteration Algorithm 1 will allow us to learn the optimal policy on-the-fly, i.e., while executing actions a t in current state s t and observing the resulting rewards R(s t, a t ) and state transitions s t, a t s t+1 of the AI system. To this end, we first rewrite the value iteration Algorithm 1 in an equivalent way by using action-value functions q(s, a) instead of value functions v(s). Algorithm 2 Value Iteration II (deterministic MDP) Input: MDP model M = S, A, T, R, γ ; error tolerance η > 0 Initialize: q 0 (s, a) = 0 for every s, a S A, iteration counter k := 0, Step 1: for each state-action pair s, a S A, update q k+1 (s, a) = R(s, a) + γ max a A q k(t (s, a), a ) (15) Step 2: increment iteration counter k := k + 1 Step 3: if max s S,a A q k (s, a) q k 1 (s, a) η go to Step 1 Output: estimate (approximation) q k (s, a) for optimal action-value function q (cf. (17)) Similarly to Algorithm 1 and (8), we can read off an (approximately) optimal policy ˆπ(s) from the output q k of Algorithm 2 by acting greedily, i.e., ˆπ(s) := argmax q k (s, a). (16) a A 6

7 In contrast to Algorithm 1, which is based on estimating the optimal value function v of a MDP, Algorithm 2 aims at estimating (approximating) the optimal action-value function q (s, a) = max π q π(s, a), (17) where the maximum is taken over all possible policies π : S A. While the function v and q provide essentially the same information about the optimal policies for a MDP, the optimal action-value function q allows somewhat more easily to read off the corresponding optimal (greedy) policy in (16) (when compared to (8)). We also highlight that, in contrast to (16), computing the greedy policy (8) for a given value function still involves the transition map T of the underlying MDP. The fact that (16) does not involve this transition map anymore will be convenient when facing applications with unknown MDP models. Note that Step 1 of Algorithm 2 (cf. (15)) requires to loop over all states and all possible actions and evaluate the transition map T (s, a) and reward function R(s, a). This can be a challenge when implementing Algorithm 2, since the state space and action set can be very large (consider, e.g., the state space of autonomous ship being obtained by discretising the earth surface by squares with side length 1 km) which makes the execution of Algorithm 2 slow. Moreover, even more serious, in an unknown environment (e.g., a new office room which has to be cleaned by Rumba), we do not know the transition map (which depends on the location of obstacles and charging stations) before-hand. We have to learn about the environment by taking actions a t when being in state s t and by observing the resulting new state s t+1 and reward R(s t, a t ) (see figure below). Figure 3: Starting in some state s t, Rumba takes action a t which (depending on the location of obstacles and charging stations) leads it into the new state s t+1 where Rumba takes another action a t+1 and so on. The usefulness of taking action a t when being in state s t is indicated by the obtained reward R(s t, a t ). 4 Computing Optimal Policies for Unknown MDPs Up to now we considered MDP models and methods for finite state spaces S and action sets A. For such MDP we can represent action-value functions q(s, a) as a table with S rows indexed by states s S and A columns indexed by actions a A. s = 1, 1 s = 2, 1 s = 3, 1 s = 4, 1 s = 5, 1 s = 6, 1 a = W q( 1, 1, W) q( 6, 1, W) a = E q( 1, 1, E) q( 6, 1, E) This tabular representation allows for a simple analysis of the resulting MDP methods and also to develop some helpful intuition for the dynamic behaviour of those methods. However, many 7

8 if not most AI applications do not allow for a tabular representation since the state space is too large or even infinite. We then have to use function approximation methods in order to efficiently represent and work with action-value functions. 4.1 Tabular Methods We now introduce one of the core algorithms underlying many modern reinforcement learning methods, i.e., the Q-learning algorithm. This algorithm can be interpreted as a variation of Algorithm 2 (value iteration), which identifies the iteration counter k in Algorithm 2 with a real-time index t. Thus, while Algorithm 2 can be executed ahead of any actual implementation of a policy in the AI system, i.e., we can do planning with Algorithm 2, the Q-Learning algorithm allows to learn the optimal behaviour while the AI system operates (takes actions, receives rewards and experiences state transitions to new states) in its environment. Algorithm 3 Q-Learning (deterministic MDP) Input: discount factor γ (0, 1), start time t Initialize: set q t 1 (s, a)=0 for every s, a S A, determine current state s t Step 1: take some action a t A Step 2: wait until new state s t+1 is reached and reward R(s t, a t ) received Step 3: update action-value function at particular state-action pair s t, a t S A q t (s t, a t ) = R(s t, a t ) + γ max a A q t 1(s t+1, a ) (18) copy action-value estimate for all other state-action pairs, i.e., q t (s, a) = q t 1 (s, a) for all s, a S A \ { s t, a t } (19) Step 4: if not converged, update time index t := t + 1 and go to Step 1 Output: estimate Q(s, a) = q t (s, a) of optimal action-value function q (cf. (22)) It can be shown that the iterates q t (s, a) generated by Algorithm 3 converge to the true optimal action-value function q (s, a) of the underlying MDP, whenever each possible action a A is taken in each possible state s S infinitely often. Consider a time interval I = t 1,..., t 2, which we refer to as a full interval, such that each possible state-action pair s, a S A is occurring at least once as the current state s t and action a t at some time t I. As shown in [2], after each full interval the approximation error achieved by Algorithm 3 is reduced by a factor γ, i.e., max q t 2 (s, a) q (s, a) γ max q t 1 (s, a) q (s, a). (20) (s,a) S A (s,a) S A 8

9 Thus, after a sufficient number of full intervals the function q t (s, a) generated by Algorithm 3 is an accurate approximation of the optimal action-value function q (s, a) of the unknown (!) underlying MDP. We can then read off an (approximately) optimal policy for the MDP by acting greedily, i.e., ˆπ(s) := argmax Q(s, a). (21) a A It is important to note that the actions a t chosen in Step 1 of Algorithm 3 need not be related at all to the estimate q t (s, a) of the action-value function. Rather, these actions have to ensure that each possible pair s, a S A of state and action occurs infinitely often as the current state s t and action a t during the execution of Algorithm 3. In practice, we have to stop Algorithm 3 after a finite amount of time. An estimate of the minimum required time to ensure a small deviation of q t (s, a) from the optimal action-value function q (s, a) can be obtained from (20). The execution of Algorithm 2 and 3 can be illustrated nicely by thinking of the action-value function q(s, a) as a table with different rows representing the different states and the different columns representing the different actions which the AI system can take. s = 1, 1 s = 2, 1 s = 3, 1 s = 4, 1 s = 5, 1 s = 6, 1 a = W a = E Function Approximation Methods The applicability of Algorithm 3 is limited to MDP models with (small) finite state space S = {1,..., S } and action set A = {1,..., A }. However, many AI applications involve extremely large state spaces which might be even continuous. Consider, e.g., an MDP model for an autonomous ship which can be interpreted as an AI system operating within its physical environment (including the sea water and near-ground atmosphere). The state of an autonomous ship is typically characterized by its current coordinates (longitude and latitude) and velocity [3]. We will now extend the tabular-based Algorithm 3 to cope with such applications, where the state space and action set is so large that we cannot represent them by a simple table anymore. The key idea is to use function approximation, somewhat similar to machine learning methods for approximating predictors, in order to handle high-dimensional action-value functions. In particular, we approximate the optimal action-value function q (s, a) of a given MDP model as q (s, a) φ(s, a; w) (22) with a function φ(s, a; w) which is parametrized by some weight vector w R d. We have seen two examples of such parametrized functions in the chapter Elements of Machine Learning : The class of linear functions and the class of functions represented by an artificial neural network structure. Let us consider in what follows the special case of linear function approximation, i.e., φ(s, a; w) := w T x(s, a) = d w j x j (s, a) (23) with some feature vector x(s, a) = ( x 1 (s, a),..., x d (s, a) ) T R d. The feature vector x(s, a) depends on the current state s of - and action a taken by - the AI system. However, we can evaluate the approximation φ(s, a; w) of the true action-value function j=1 9

10 q (s, a), using only the weight vector w and directly the features x j (s, a) without determining the underlying state s S (which might be difficult). Thus, we only work on the space of the weight vectors w R d which is typically much smaller than the state space S of the MDP. We highlight that, to a large extent, the particular definition of the features x(s, a) is a design choice. In principle, we can use any quantity that can be determined or measured by the AI system. For the particular AI application of the cleaning robot Rumba, we might use as features certain properties of the current snapshot taken by its on-board cameras. These features clearly depend on the underlying state (location) but we do not need the state in order to implement (23), which can be done only from the snapshot of the camera. In order to implement the linear function approximation (23), we have to choose the weight vector w suitably. To this end, we can use the information acquired by observing Rumba s behaviour, i.e., the state transitions s t a t s t+1 and the rewards R(s t, a t ) (see figure above). Let us assume that Rumba takes action a t when being in state s t which results in the reward R(s t, a t ) and new state s t+1 = T (s t, a t ). How can we judge the quality of the approximation φ(s t, a t ; w (t) ), using our current choice for the weight vector w (t), for the true optimal action-value function q (s t, a t )? It seems reasonable to measure the quality by the squared error ε(w (t) ) := (q (s t, a t ) φ(s t, a t ; w (t) )) 2 (24) and then adjust the weight vector w in order to decrease this error. This reasoning leads us to the gradient descent update [4] w (t+1) = w (t) (α/2) ε(w (t) ) (25) with some suitably chosen step size (or learning rate) α (the factor 1/2 is just for notational convenience). One of the appealing properties of using a linear function approximation (23) is that we obtain simple expressions for the gradient of the error: ε(w (t) ) = 2 ( φ(s t, a t ; w (t) ) q (s t, a t ) ) x(s, a). (26) Note that we cannot directly evaluate (26) for computing the gradient as it involves the unknown action-value function q (s t, a t ). After all, our goal is to find a good estimate (approximation) q(s, a; w) to this unknown action-value function based on observing the behaviour (state transitions, rewards, see figure above) of the AI system. While we do not have access to the exact action-value function q (s t, a t ), we can construct an estimate of it by using the Bellman equation (cf. Chapter on MDP): q (s t, a t ) = R(s t, a t ) + max a A q (T (s t, a t ), a ) }{{} φ(t (s t,a t),a ;w (t) ) R(s t, a t ) + max φ(t (s t, a t ), a ; w (t) ) a A }{{} =s t+1 = R(s t, a t ) + max a A φ(s t+1, a ; w (t) ). (27) By inserting (27) into (26), we obtain the following variant of Q-learning using linear function approximation [5]. 10

11 Algorithm 4 Q-Learning with Linear Value Function Approximation (deterministic MDP) Input: discount factor γ (0, 1); learning rate α > 0 Initialize: w (t) :=0, Step 1: take some action a t A Step 2: wait until new state s t+1 is reached and reward R(s t, a t ) received Step 3: update weight parameter (approximate gradient descent step) w (t+1) =w (t) +α ( ( R(s t, a t ) + γ max w (t) ) T x(st+1, a ) a A }{{} q (s t+1,a ) } {{ } q (s t,a t) ( w (t)) T ) x(st, a t ) x(st, a t ) (28) }{{} φ(s t,a t;w (t) ) Step 4: if not converged, update time index t := t + 1 and go to Step 1 Output: estimate Q(s, a) = ( w (t)) T x(s, a) of optimal action-value function q Note that the execution of Algorithm 4 does not require to determine the current state s t itself, but only the features x(s t, a) which typically can be determined more easily. In particular, for the cleaning robot Rumba, determining the state s t (which is defined as its current location) requires an indoor positioning system, while the features x(s t, a) might be simple characteristics of the current snapshot taken by an on-board camera. References [1] D. P. Bertsekas, Dynamic Programming and Optimal Control, Vol. II, 3rd ed. Athena Scientific, [2] T. Mitchell, Lecture notes - machine learning, February [3] T. Perez, Ship Motion Control. London: Springer, [4] A. Jung, A fixed-point of view on gradient methods for big data, Frontiers in Applied Mathematics and Statistics, vol. 3, [Online]. Available: https: // [5] J. N. Tsitsiklis and B. V. Roy, An analysis of temporal-difference learning with function approximation, IEEE Transactions on Automatic Control, vol. 42, no. 5, pp , May

12 5 Exercises Problem 1. Consider a MDP model for an autonomous ship. The state s of the ship is determined by the latitude and longitude of its current location. We encode these coordinates using real numbers and therefore use the state space S = R 2. The action set is A = [ π, π] which corresponds to all possible steering directions relative to some reference direction (e.g., north ). For a fixed s S and a A consider the approximate action-value φ(s, a; w) = w T x(s, a) which is based on a linear function of the feature vector x(s, a) = ( s 2, a s, exp( s 2 )) T R 3 which depends on the current state s and action a taken by the AI sailor. In order to adapt the weight vector w, such that φ(s, a; w) is a good approximation (or estimate) for the (unknown) optimal action-value function q (s, a) of the MDP, we typically need to determine the gradient φ(s, a; w) of φ(s, a; w) w.r.t. the weight vector w R 3. Which of the following statements is true? φ(s, a; w) = x(s, a) φ(s, a; w) = w φ(s, a; w) = w T x(s, a) φ(s, a; w) = ( x(s, a) ) T w Problem 2. Consider developing an AI system for autonomous ships. We are interested in learning a predictor h(x) which maps a single feature x R of the current state (this feature might be determined from the on-board cameras as well as GPS sensors) to a predicted optimal steering direction a [ π, π]. We have collected training data X = {(x (t), a t )} N t=1 by observing the steering direction a t, chosen by an experienced human sailor at time t when the feature value x (t) was observed. This training data is stored in the file which contains in its t-th row the feature value x (t) and corresponding steering direction a t chosen at time t. We restrict the search for good predictors h(x) to functions of the form h(x) = c x + d with parameters c, d chosen in order to make the mean squared error N N E(c, d) := (1/N) (a t h(x (t) )) 2 = (1/N) (a t (c x (t) + d)) 2 (29) t=1 as small as possible. Let us denote the optimal parameter values by c opt and d opt, i.e., E(c opt, d opt ) = min c,d R E(c, d). Which of the following statements are (or is ;-) true? c opt [0, 2] c opt [100, 200] d opt [100, 200] d opt [0, 2] t=1 12

13 Problem 3. Reconsider the setting of Problem 2, i.e., we are interested in predicting the optimal steering direction a [ π, π] using a predictor h(x) = c x + d with parameters c, d and the single feature x which summarizes the current state of the ship. In contrast to Problem 2, we only have access to two training examples (x (1), a 1 ) and (x (2), a 2 ) with x (1) x (2). How small can we make the mean squared error by tuning the parameters c, d? E(c, d) = (1/2) we can always find c, d such that E(c, d) = 0 2 t=1 (a t h(x) ) 2 (30) }{{} =c x+d E(c, d) is always lower bounded by x (1), i.e., E(c, d) x (1) for all c, d R E(c, d) is always lower bounded by x (2), i.e., E(c, d) x (2) for all c, d R E(c, d) is always lower bounded by a 2, i.e., E(c, d) a 2 for all c, d R Problem 4. Consider the problem of predicting a real-valued label y (e.g., the local temperature) based on two features x = (x 1, x 2 ) T R 2 (e.g., the current GPS coordinates). We try to predict the label y using a linear predictor h(x) = w T x with some weight vector w R 2. We choose the weight vector w by evaluating the mean squared error incurred for a training dataset X = {(x (t), y t )} which is available in the file The t-th row of this file begins with the two features x (t) 1, x(t) 2 which are followed by the corresponding label y t. We then choose the optimal weight vector by minimizing the mean squared error (training error), i.e., w opt = argmin f(w). (31) w R 2 }{{} :=(1/ X train ) (x,y) X (y w T x) 2 train In order to compute the optimal weight vector (31), we can use gradient descent w (k+1) = w (k) α f(w (k) ) (32) with some learning rate α > 0 and initial guess w (0) = (1, 1) T R 2. Which of the following statements is true? the iterates f(w (k) ), k = 0, 1,..., converge for α = 1/2 the iterates f(w (k) ), k = 0, 1,..., converge for α = 10 the iterates f(w (k) ), k = 0, 1,..., converge for α = 1/4 the iterates f(w (k) ), k = 0, 1,..., converge for α = 1/8 13

14 Problem 5. Consider the problem of predicting a real-valued label y (e.g., the local temperature) based on two features x = (x 1, x 2 ) T R 2 (e.g., the current GPS coordinates). We try to predict the label y using a linear predictor h(x) = w T x with some weight vector w R 2. The choice of the weight vector w is guided by the mean squared error incurred for a training dataset X = {(x (t), y t )} which is available in the file The t-th row of this file starts with the features x (t) 1, x(t) 2 followed by the corresponding label y t. In some applications it is beneficial not to aim only at smallest training error but also to enforce small norm w of the weight vector. This leads naturally to the regularized linear regression problem for finding the optimal weight vector: w (λ) = argmin(1/ X train ) (y w T x) 2 + λ w 2 2. (33) w R 2 (x,y) X train For different choices of λ [0, 1] (e.g., 10 evenly spaced values between 0 and 1), determine the optimal weight vector w (λ) by solving the optimization problem (33). You might use the gradient descent method as discussed in Chapter Elements of Machine Learning in order to compute (approximately) w (λ). Then, for each value of λ which results in a different w (λ), determine the training error TrainError(λ) := (1/ X train ) (y (w (λ) ) T x) 2 (34) (x,y) X train and the validation (or test) error ValidationError(λ) := (1/ X val ) (x,y) X val (y (w (λ) ) T x) 2 (35) which is the mean squared error incurred for the validation dataset X val available in the file Which of the following statements is true? the training error TrainError(λ) always decreases with increasing λ the training error TrainError(λ) never decreases with increasing λ the validation error ValidationError(λ) always decreases with increasing λ the validation error ValidationError(λ) always increases with increasing λ 14

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

arxiv: v2 [cs.lg] 20 May 2018

arxiv: v2 [cs.lg] 20 May 2018 A Gentle Introduction to Supervised Machine Learning Alexander Jung, first.last ( at ) aalto.fi 2018 arxiv:1805.05052v2 [cs.lg] 20 May 2018 Abstract This tutorial is based on the lecture notes for the

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018 Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

CS 7180: Behavioral Modeling and Decisionmaking

CS 7180: Behavioral Modeling and Decisionmaking CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Chapter 16 Planning Based on Markov Decision Processes

Chapter 16 Planning Based on Markov Decision Processes Lecture slides for Automated Planning: Theory and Practice Chapter 16 Planning Based on Markov Decision Processes Dana S. Nau University of Maryland 12:48 PM February 29, 2012 1 Motivation c a b Until

More information

Real Time Value Iteration and the State-Action Value Function

Real Time Value Iteration and the State-Action Value Function MS&E338 Reinforcement Learning Lecture 3-4/9/18 Real Time Value Iteration and the State-Action Value Function Lecturer: Ben Van Roy Scribe: Apoorva Sharma and Tong Mu 1 Review Last time we left off discussing

More information

CSE250A Fall 12: Discussion Week 9

CSE250A Fall 12: Discussion Week 9 CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

Decision Theory: Q-Learning

Decision Theory: Q-Learning Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning. Machine Learning, Fall 2010 Reinforcement Learning Machine Learning, Fall 2010 1 Administrativia This week: finish RL, most likely start graphical models LA2: due on Thursday LA3: comes out on Thursday TA Office hours: Today 1:30-2:30

More information

Lecture 4: Approximate dynamic programming

Lecture 4: Approximate dynamic programming IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

Planning in Markov Decision Processes

Planning in Markov Decision Processes Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov

More information

ARTIFICIAL INTELLIGENCE. Reinforcement learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes Kee-Eung Kim KAIST EECS Department Computer Science Division Markov Decision Processes (MDPs) A popular model for sequential decision

More information

Reinforcement Learning and Control

Reinforcement Learning and Control CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

More information

State Space Abstractions for Reinforcement Learning

State Space Abstractions for Reinforcement Learning State Space Abstractions for Reinforcement Learning Rowan McAllister and Thang Bui MLG RCC 6 November 24 / 24 Outline Introduction Markov Decision Process Reinforcement Learning State Abstraction 2 Abstraction

More information

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Lecture notes for Analysis of Algorithms : Markov decision processes

Lecture notes for Analysis of Algorithms : Markov decision processes Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with

More information

Planning Under Uncertainty II

Planning Under Uncertainty II Planning Under Uncertainty II Intelligent Robotics 2014/15 Bruno Lacerda Announcement No class next Monday - 17/11/2014 2 Previous Lecture Approach to cope with uncertainty on outcome of actions Markov

More information

Reinforcement Learning. George Konidaris

Reinforcement Learning. George Konidaris Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom

More information

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it

More information

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Assignment #1 Roll-out: nice example paper: X.

More information

CS 570: Machine Learning Seminar. Fall 2016

CS 570: Machine Learning Seminar. Fall 2016 CS 570: Machine Learning Seminar Fall 2016 Class Information Class web page: http://web.cecs.pdx.edu/~mm/mlseminar2016-2017/fall2016/ Class mailing list: cs570@cs.pdx.edu My office hours: T,Th, 2-3pm or

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

Reinforcement Learning II

Reinforcement Learning II Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 11: Markov Decision Processes II Teacher: Gianni A. Di Caro RECAP: DEFINING MDPS Markov decision processes: o Set of states S o Start state s 0 o Set of actions A o Transitions P(s s,a)

More information

Temporal difference learning

Temporal difference learning Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

Reinforcement learning an introduction

Reinforcement learning an introduction Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan Some slides borrowed from Peter Bodik and David Silver Course progress Learning

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

Reinforcement Learning: An Introduction

Reinforcement Learning: An Introduction Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques

More information

Artificial Intelligence & Sequential Decision Problems

Artificial Intelligence & Sequential Decision Problems Artificial Intelligence & Sequential Decision Problems (CIV6540 - Machine Learning for Civil Engineers) Professor: James-A. Goulet Département des génies civil, géologique et des mines Chapter 15 Goulet

More information

Lecture 7: Value Function Approximation

Lecture 7: Value Function Approximation Lecture 7: Value Function Approximation Joseph Modayil Outline 1 Introduction 2 3 Batch Methods Introduction Large-Scale Reinforcement Learning Reinforcement learning can be used to solve large problems,

More information

Some AI Planning Problems

Some AI Planning Problems Course Logistics CS533: Intelligent Agents and Decision Making M, W, F: 1:00 1:50 Instructor: Alan Fern (KEC2071) Office hours: by appointment (see me after class or send email) Emailing me: include CS533

More information

Markov Decision Processes (and a small amount of reinforcement learning)

Markov Decision Processes (and a small amount of reinforcement learning) Markov Decision Processes (and a small amount of reinforcement learning) Slides adapted from: Brian Williams, MIT Manuela Veloso, Andrew Moore, Reid Simmons, & Tom Mitchell, CMU Nicholas Roy 16.4/13 Session

More information

Markov Decision Processes Infinite Horizon Problems

Markov Decision Processes Infinite Horizon Problems Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld 1 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T)

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

Lecture 3: The Reinforcement Learning Problem

Lecture 3: The Reinforcement Learning Problem Lecture 3: The Reinforcement Learning Problem Objectives of this lecture: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

Least Mean Squares Regression

Least Mean Squares Regression Least Mean Squares Regression Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Lecture Overview Linear classifiers What functions do linear classifiers express? Least Squares Method

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5 Table of contents 1 Introduction 2 2 Markov Decision Processes 2 3 Future Cumulative Reward 3 4 Q-Learning 4 4.1 The Q-value.............................................. 4 4.2 The Temporal Difference.......................................

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

Reinforcement Learning II. George Konidaris

Reinforcement Learning II. George Konidaris Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2017 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes

More information

16.410/413 Principles of Autonomy and Decision Making

16.410/413 Principles of Autonomy and Decision Making 16.410/413 Principles of Autonomy and Decision Making Lecture 23: Markov Decision Processes Policy Iteration Emilio Frazzoli Aeronautics and Astronautics Massachusetts Institute of Technology December

More information

CSC321 Lecture 22: Q-Learning

CSC321 Lecture 22: Q-Learning CSC321 Lecture 22: Q-Learning Roger Grosse Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21 Overview Second of 3 lectures on reinforcement learning Last time: policy gradient (e.g. REINFORCE) Optimize

More information

Classification with Perceptrons. Reading:

Classification with Perceptrons. Reading: Classification with Perceptrons Reading: Chapters 1-3 of Michael Nielsen's online book on neural networks covers the basics of perceptrons and multilayer neural networks We will cover material in Chapters

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Final Exam December 12, 2017

Final Exam December 12, 2017 Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information

Markov decision processes

Markov decision processes CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Cyber Rodent Project Some slides from: David Silver, Radford Neal CSC411: Machine Learning and Data Mining, Winter 2017 Michael Guerzhoy 1 Reinforcement Learning Supervised learning:

More information

Reinforcement Learning II. George Konidaris

Reinforcement Learning II. George Konidaris Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2018 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes

More information

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam CSE 573 Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming Slides adapted from Andrey Kolobov and Mausam 1 Stochastic Shortest-Path MDPs: Motivation Assume the agent pays cost

More information

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017 Actor-critic methods Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department February 21, 2017 1 / 21 In this lecture... The actor-critic architecture Least-Squares Policy Iteration

More information

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted 15-889e Policy Search: Gradient Methods Emma Brunskill All slides from David Silver (with EB adding minor modificafons), unless otherwise noted Outline 1 Introduction 2 Finite Difference Policy Gradient

More information

Decision making, Markov decision processes

Decision making, Markov decision processes Decision making, Markov decision processes Solved tasks Collected by: Jiří Kléma, klema@fel.cvut.cz Spring 2017 The main goal: The text presents solved tasks to support labs in the A4B33ZUI course. 1 Simple

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010 Lecture 25: Learning 4 Victor R. Lesser CMPSCI 683 Fall 2010 Final Exam Information Final EXAM on Th 12/16 at 4:00pm in Lederle Grad Res Ctr Rm A301 2 Hours but obviously you can leave early! Open Book

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

Least Mean Squares Regression. Machine Learning Fall 2018

Least Mean Squares Regression. Machine Learning Fall 2018 Least Mean Squares Regression Machine Learning Fall 2018 1 Where are we? Least Squares Method for regression Examples The LMS objective Gradient descent Incremental/stochastic gradient descent Exercises

More information

Factored State Spaces 3/2/178

Factored State Spaces 3/2/178 Factored State Spaces 3/2/178 Converting POMDPs to MDPs In a POMDP: Action + observation updates beliefs Value is a function of beliefs. Instead we can view this as an MDP where: There is a state for every

More information

Lecture 8: Policy Gradient

Lecture 8: Policy Gradient Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Noel Welsh 11 November 2010 Noel Welsh () Markov Decision Processes 11 November 2010 1 / 30 Annoucements Applicant visitor day seeks robot demonstrators for exciting half hour

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning RL in continuous MDPs March April, 2015 Large/Continuous MDPs Large/Continuous state space Tabular representation cannot be used Large/Continuous action space Maximization over action

More information

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL) 15-780: Graduate Artificial Intelligence Reinforcement learning (RL) From MDPs to RL We still use the same Markov model with rewards and actions But there are a few differences: 1. We do not assume we

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

Final Exam December 12, 2017

Final Exam December 12, 2017 Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes

More information

Sequential Decision Problems

Sequential Decision Problems Sequential Decision Problems Michael A. Goodrich November 10, 2006 If I make changes to these notes after they are posted and if these changes are important (beyond cosmetic), the changes will highlighted

More information

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING Slides adapted from Tom Mitchell and Peter Abeel Machine Learning: Jordan Boyd-Graber UMD Machine Learning

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University

More information

Short Course: Multiagent Systems. Multiagent Systems. Lecture 1: Basics Agents Environments. Reinforcement Learning. This course is about:

Short Course: Multiagent Systems. Multiagent Systems. Lecture 1: Basics Agents Environments. Reinforcement Learning. This course is about: Short Course: Multiagent Systems Lecture 1: Basics Agents Environments Reinforcement Learning Multiagent Systems This course is about: Agents: Sensing, reasoning, acting Multiagent Systems: Distributed

More information

Optimism in the Face of Uncertainty Should be Refutable

Optimism in the Face of Uncertainty Should be Refutable Optimism in the Face of Uncertainty Should be Refutable Ronald ORTNER Montanuniversität Leoben Department Mathematik und Informationstechnolgie Franz-Josef-Strasse 18, 8700 Leoben, Austria, Phone number:

More information