A Gentle Introduction to Reinforcement Learning

Similar documents
, and rewards and transition matrices as shown below:

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

arxiv: v2 [cs.lg] 20 May 2018

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

An Adaptive Clustering Method for Model-free Reinforcement Learning

Reinforcement Learning. Introduction

Introduction to Reinforcement Learning

CS 7180: Behavioral Modeling and Decisionmaking

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Chapter 16 Planning Based on Markov Decision Processes

Real Time Value Iteration and the State-Action Value Function

CSE250A Fall 12: Discussion Week 9

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Decision Theory: Q-Learning

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Basics of reinforcement learning

Reinforcement Learning. Machine Learning, Fall 2010

Lecture 4: Approximate dynamic programming

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Planning in Markov Decision Processes

ARTIFICIAL INTELLIGENCE. Reinforcement learning

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Reinforcement Learning and Control

State Space Abstractions for Reinforcement Learning

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

CS599 Lecture 1 Introduction To RL

Lecture notes for Analysis of Algorithms : Markov decision processes

Planning Under Uncertainty II

Reinforcement Learning. George Konidaris

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

CS 570: Machine Learning Seminar. Fall 2016

Prioritized Sweeping Converges to the Optimal Value Function

Reinforcement Learning

Q-Learning in Continuous State Action Spaces

6 Reinforcement Learning

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Learning Tetris. 1 Tetris. February 3, 2009

Reinforcement Learning II

Grundlagen der Künstlichen Intelligenz

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

Temporal difference learning

MDP Preliminaries. Nan Jiang. February 10, 2019

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Reinforcement learning an introduction

Reinforcement Learning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Reinforcement Learning: An Introduction

Reinforcement Learning

Artificial Intelligence & Sequential Decision Problems

Lecture 7: Value Function Approximation

Some AI Planning Problems

Markov Decision Processes (and a small amount of reinforcement learning)

Markov Decision Processes Infinite Horizon Problems

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Lecture 3: The Reinforcement Learning Problem

Least Mean Squares Regression

Decision Theory: Markov Decision Processes

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

Lecture 23: Reinforcement Learning

Reinforcement Learning II. George Konidaris

16.410/413 Principles of Autonomy and Decision Making

CSC321 Lecture 22: Q-Learning

Classification with Perceptrons. Reading:

Internet Monetization

Final Exam December 12, 2017

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Markov decision processes

Reinforcement Learning

Reinforcement Learning II. George Konidaris

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Decision making, Markov decision processes

Q-Learning for Markov Decision Processes*

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Open Theoretical Questions in Reinforcement Learning

Least Mean Squares Regression. Machine Learning Fall 2018

Factored State Spaces 3/2/178

Lecture 8: Policy Gradient

Lecture 1: March 7, 2018

Markov Decision Processes

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Reinforcement Learning

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

Approximate Dynamic Programming

Artificial Intelligence

Final Exam December 12, 2017

Sequential Decision Problems

Chapter 3: The Reinforcement Learning Problem

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Reinforcement Learning

Short Course: Multiagent Systems. Multiagent Systems. Lecture 1: Basics Agents Environments. Reinforcement Learning. This course is about:

Optimism in the Face of Uncertainty Should be Refutable

Transcription:

A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple, we model the office room as a plain rectangular area (see figure below) which is discretised using a regular grid of small squares (cells). Each cell is identified by its coordinates m {1,..., K} and y {1,..., L}. We say that Rumba is in state s t = m t, y t at time t, if it is currently located at the cell with coordinates m t and y t. The set of all states (cells) constitutes the state space of Rumba, i.e., S = { m, y m {1,..., K}, y {1,..., L}}. (1) Some cells m, y are occupied by obstacles or contain a charging station for Rumba. We denote the set of cells occupied by obstacles as B S and the cells with a charging station by C S. In order to move around, Rumba can choose to take actions a A = {N, S, E, W}, each action corresponding to a compass direction into which Rumba can move. For example, if Rumba chooses action a = N it will move into direction north (it there is no obstacle or the border of the room). A simple, yet quite useful, model which formally describes the behaviour of Rumba as it moves around room B329 consists of the following building blocks the state space S; (e.g., the grid-world {1,..., K} {1,..., L}). the set of actions A; (e.g., the set of directions {N, S, E, W}). the transition model T : S A S; (e.g., moves intro compass direction a but neither can move into an obstacle nor move out of gridworld). the reward function R : S A R; (e.g., zero reward for reaching a charging station, otherwise negative reward). the discount factor γ (0, 1) (e.g., γ = 1/2); which together define a Markov decision process (MDP). Formally, a MDP model is a tuple M = S, A, T, R, γ consisting of a state space S, an action set A, a transition map T : S A S, a reward function R : S A R and a discount factor γ [0, 1]. 1

Figure 1: The office room, which Rumba has to keep tidy, and a simple grid-world model of the room s floor space. On a higher level, a MDP is nothing but an abstract mathematical model (much like propositional logic) which can be used to describe the interaction between an AI system (such as the cleaning robot Rumba) and its environment (such as the office room containing Rumba). Having an accurate MDP model allows to derive efficient methods for computing (approximately) optimal actions in the sense of maximizing a long-term average reward or return. Note that the precise specification of a MDP model for Rumba requires perfect knowledge of all the occupied cells (obstacles) in the room B329. In particular, only if we know all locations of obstacles, we can specify the transition map T of the MDP model. In some applications this might be a reasonable assumption, e.g., if the obstacles are furniture which do not move often. However, if the obstacles are chairs which are moved often, then it is unreasonable to assume perfect knowledge of the obstacle locations and, in turn, we do not know the transition map T. Thus, in this case, the MDP model for the behaviour of Rumba must be adapted over the course of time. In what follows, we will restrict ourselves to deterministic MDP models which involve a deterministic transition map T (st, at ) which maps the current state st and action at taken by Rumba to a well-defined successor state st+1 = T (st, at ). Moreover, the reward received when taking action at in state st is determined exactly by a deterministic map R(st, at ). In many real-world applications it is more convenient to allow for stochastic (random) transition maps and rewards, e.g., to cope with random failures in the mechanical parts of Rumba which cause that it will move not exactly as ordered by the action at, but only with high probability. In this case, the successor state st+1 is a random variable with conditional distribution p(st+1 st, at ) which depends on the current state st and action at. However, it turns out that the concepts and methods developed for deterministic MDP models can be extended quite easily to stochastic MDP models (which include deterministic MDP models as a special case). 1.1 Learning Outcomes After completing this chapter, you should understand tabular based Q-learning. understand limitations of tabular based methods. understand the basic idea of function approximation for action-value functions. 2

be able to implement the Q-learning algorithm using linear function approximation. 2 The Problem Consider the cleaning robot Rumba which has to keep the office room B329 tidy. At some time t, when Rumba is currently in state (at cell) s t = m t, y t, it finds itself running out of battery and that it should reach a charging station, which can found at some cell s C, as soon as possible. We want to program Rumba such that it reaches a charging station as quickly as possible. The programming of Rumba amounts to specifying a policy π which maps its current state s S to a good (hopefully rational) action a A. Figure 2: (a) The control software of Rumba implements a policy π : S A which maps the current state s S (e.g., s t = 2, 3 ) to an action a t A (e.g., move east a t = E). (b) We can think of a policy also in terms of a sub-routine which is executed by the operating system of Rumba. Mathematically, we can represent a policy as a map (or function) π : S A from the state space S to the action set A of a MDP. The policy π maps a particular state s S to the action a = π(s) {N, S, E, W}, which the AI system takes next. For the simple gridworld MDP model used for the Rumba application, we can illustrate a particular policy by drawing arrows in the grid world (see figure below). Note that once we specify a policy π and the starting state s t at which Rumba starts to execute π from time t onwards, the future behaviour is completely determined by the transition model T 3

of the MDP, since we have a t = π(s t ) s t+1 = T (s t, a t ) a t+1 = π(s t+1 ) s t+2 = T (s t+1, a t+1 ) and so on. (2) We evaluate the quality of a particular policy π implemented by Rumba using the time difference t (π) c t between the first time t (π) c when Rumba reaches a charging station and the starting time t when Rumba was in the starting state s t. It can be shown that searching for a policy which minimises t (π) c t is fully equivalent to searching a policy which has maximum value function v π (s t ) := γ j R(s t+j, a t+j ) (3) j=0 using the reward function { 1 when T (s, a) / C action did not lead to charger R(s, a) = 0 else. action lead to charger (4) The value function v π (s t ) is a global (long-term) measure for the quality of the policy π followed by Rumba when starting in state s t at time t. In contrast, the reward R(s, a) is a local (instantaneous) measure for the usefulness (or rationality) of Rumba taking the particular action a A when it is currently in state s S. Carefully note that the definition (4) of the reward function involves also the transition map T of the MDP model. Thus, the rewards received by the AI system depend also on the transition model of the MDP. This is intuitively reasonable, since we expect the rewards obtained by an AI system to depend on how the environment responds to the different actions taken by the AI system. In what follows, we will focus on the problem of finding a policy which leads Rumba from its starting state s t as quickly as possible to a charging station, i.e., to some state s C S. As we just discussed, this is equivalent to finding a policy which has maximum value function v π (s t ) (cf. (3)) for any possible start state s t S. 3 Computing Optimal Policies for Known MDPs Given a particular MDP model M = S, A, T, R, γ, we are interested in finding an optimal policy π, i.e., having maximum value function v π (s) = max π v π(s) for all states s S. (5) As discussed in a previous chapter, we can find optimal policies indirectly via first computing the optimal value function v (s) = max v π(s) (6) π and then acting greedily according to v (s) (see Chapter on MDP for details). The computation of the optimal value function v (s), in turn, can be accomplished rather using, e.g., the value iteration algorithm, which we repeat here as Algorithm 1 for convenience: 4

Algorithm 1 Value Iteration (deterministic MDP) Input: MDP M = S, A, T, R, γ with discount γ (0, 1); error tolerance η Initialize: v 0 (s) = 0 for every state s S, iteration counter k := 0, Step 1: for each state s S, update value function, i.e., [ v k+1 (s) = max R(s, a) + γvk (T (s, a)) ] (7) a A Step 2: increment iteration counter k := k + 1 Step 3: if max s S v k (s) v k 1 (s) η go to Step 1 Output: estimate (approximation) v k (s) of optimal value function v Given the output v k (s) of Algorithm 1, which is an approximation to the optimal value function v for the MDP M, we can find a corresponding (nearly) optimal policy ˆπ(s) by acting greedily, i.e., [ ˆπ(s) := argmax R(s, a) + γvk (T (s, a)) ]. (8) a A There are different options for deciding when to stop the iterations of the value iteration Algorithm 1. For example, we could stop the updates (7) after a fixed but sufficiently large number of iterations. Another option, which is used in the above formulation of Algorithm 1, is to monitor the difference max s S v k (s) v k 1 (s) ε between successive iterates v k and stop as soon as this difference is below a pre-specified threshold ε (which is an input parameter of Algorithm 1). One appealing property of using the stopping condition shown in Algorithm 1 is that it allows to guarantee a bound on the sub-optimality of the policy ˆπ(s) obtained via (8) from the output v k (s) of Algorithm 1. In particular, the value function vˆπ of the policy ˆπ(s) obtained by Algorithm 1 and (8) deviates from the optimal value function v by no more than 2ηγ 1 γ, i.e., max s S vˆπ(s) v (s) 2ηγ 1 γ. (9) We refer to [1] for a formal proof of this bound. In order to implement Algorithm 1, we need to check if the stopping condition max s S v k(s) v k 1 (s) η (10) is satisfied after a reasonable number of iterations. This can be verified using a rather elegant argument which interprets the update (7) as a fixed point iteration v k+1 = Pv k. (11) Here, P denotes an operator which maps a value function v k : S R to another value function v k+1 : S R according to the rule [ v k+1 (s) = max R(s, a) + γvk (T (s, a)) ]. (12) a A 5

This operator P is a contraction with rate not larger than the discount factor γ of the MDP, i.e., and, in turn, Pv k Pv k 1 γ v k v k 1, (13) v k+1 v = Pv k v γ v k v (14) Thus, the difference max s S v k (s) v k 1 (s), as well as the deviation max s S v k (s) v (s) from the optimal value function v decays exponentially fast: each additional iteration of Algorithm 1 reduces the difference by a factor γ < 1. Carefully note that in order to execute the value iteration algorithm we need to know the reward function R(s, a) (for all possible state-action pairs s, a S A) and the transition map T (s, a) (for all possible state-action pairs s, a S A). If we have this information at our disposal, we can compute the optimal value function using value iteration Algorithm 1 and, in turn, determine a (nearly) optimal policy using (8). However, what should we do if we do not know the rewards and transition map before we actually implement a policy that lets the AI system (e.g., Rumba) interact with its environment? It turns out that some, rather intuitive, modifications of the value iteration Algorithm 1 will allow us to learn the optimal policy on-the-fly, i.e., while executing actions a t in current state s t and observing the resulting rewards R(s t, a t ) and state transitions s t, a t s t+1 of the AI system. To this end, we first rewrite the value iteration Algorithm 1 in an equivalent way by using action-value functions q(s, a) instead of value functions v(s). Algorithm 2 Value Iteration II (deterministic MDP) Input: MDP model M = S, A, T, R, γ ; error tolerance η > 0 Initialize: q 0 (s, a) = 0 for every s, a S A, iteration counter k := 0, Step 1: for each state-action pair s, a S A, update q k+1 (s, a) = R(s, a) + γ max a A q k(t (s, a), a ) (15) Step 2: increment iteration counter k := k + 1 Step 3: if max s S,a A q k (s, a) q k 1 (s, a) η go to Step 1 Output: estimate (approximation) q k (s, a) for optimal action-value function q (cf. (17)) Similarly to Algorithm 1 and (8), we can read off an (approximately) optimal policy ˆπ(s) from the output q k of Algorithm 2 by acting greedily, i.e., ˆπ(s) := argmax q k (s, a). (16) a A 6

In contrast to Algorithm 1, which is based on estimating the optimal value function v of a MDP, Algorithm 2 aims at estimating (approximating) the optimal action-value function q (s, a) = max π q π(s, a), (17) where the maximum is taken over all possible policies π : S A. While the function v and q provide essentially the same information about the optimal policies for a MDP, the optimal action-value function q allows somewhat more easily to read off the corresponding optimal (greedy) policy in (16) (when compared to (8)). We also highlight that, in contrast to (16), computing the greedy policy (8) for a given value function still involves the transition map T of the underlying MDP. The fact that (16) does not involve this transition map anymore will be convenient when facing applications with unknown MDP models. Note that Step 1 of Algorithm 2 (cf. (15)) requires to loop over all states and all possible actions and evaluate the transition map T (s, a) and reward function R(s, a). This can be a challenge when implementing Algorithm 2, since the state space and action set can be very large (consider, e.g., the state space of autonomous ship being obtained by discretising the earth surface by squares with side length 1 km) which makes the execution of Algorithm 2 slow. Moreover, even more serious, in an unknown environment (e.g., a new office room which has to be cleaned by Rumba), we do not know the transition map (which depends on the location of obstacles and charging stations) before-hand. We have to learn about the environment by taking actions a t when being in state s t and by observing the resulting new state s t+1 and reward R(s t, a t ) (see figure below). Figure 3: Starting in some state s t, Rumba takes action a t which (depending on the location of obstacles and charging stations) leads it into the new state s t+1 where Rumba takes another action a t+1 and so on. The usefulness of taking action a t when being in state s t is indicated by the obtained reward R(s t, a t ). 4 Computing Optimal Policies for Unknown MDPs Up to now we considered MDP models and methods for finite state spaces S and action sets A. For such MDP we can represent action-value functions q(s, a) as a table with S rows indexed by states s S and A columns indexed by actions a A. s = 1, 1 s = 2, 1 s = 3, 1 s = 4, 1 s = 5, 1 s = 6, 1 a = W q( 1, 1, W)............ q( 6, 1, W) a = E q( 1, 1, E)............ q( 6, 1, E) This tabular representation allows for a simple analysis of the resulting MDP methods and also to develop some helpful intuition for the dynamic behaviour of those methods. However, many 7

if not most AI applications do not allow for a tabular representation since the state space is too large or even infinite. We then have to use function approximation methods in order to efficiently represent and work with action-value functions. 4.1 Tabular Methods We now introduce one of the core algorithms underlying many modern reinforcement learning methods, i.e., the Q-learning algorithm. This algorithm can be interpreted as a variation of Algorithm 2 (value iteration), which identifies the iteration counter k in Algorithm 2 with a real-time index t. Thus, while Algorithm 2 can be executed ahead of any actual implementation of a policy in the AI system, i.e., we can do planning with Algorithm 2, the Q-Learning algorithm allows to learn the optimal behaviour while the AI system operates (takes actions, receives rewards and experiences state transitions to new states) in its environment. Algorithm 3 Q-Learning (deterministic MDP) Input: discount factor γ (0, 1), start time t Initialize: set q t 1 (s, a)=0 for every s, a S A, determine current state s t Step 1: take some action a t A Step 2: wait until new state s t+1 is reached and reward R(s t, a t ) received Step 3: update action-value function at particular state-action pair s t, a t S A q t (s t, a t ) = R(s t, a t ) + γ max a A q t 1(s t+1, a ) (18) copy action-value estimate for all other state-action pairs, i.e., q t (s, a) = q t 1 (s, a) for all s, a S A \ { s t, a t } (19) Step 4: if not converged, update time index t := t + 1 and go to Step 1 Output: estimate Q(s, a) = q t (s, a) of optimal action-value function q (cf. (22)) It can be shown that the iterates q t (s, a) generated by Algorithm 3 converge to the true optimal action-value function q (s, a) of the underlying MDP, whenever each possible action a A is taken in each possible state s S infinitely often. Consider a time interval I = t 1,..., t 2, which we refer to as a full interval, such that each possible state-action pair s, a S A is occurring at least once as the current state s t and action a t at some time t I. As shown in [2], after each full interval the approximation error achieved by Algorithm 3 is reduced by a factor γ, i.e., max q t 2 (s, a) q (s, a) γ max q t 1 (s, a) q (s, a). (20) (s,a) S A (s,a) S A 8

Thus, after a sufficient number of full intervals the function q t (s, a) generated by Algorithm 3 is an accurate approximation of the optimal action-value function q (s, a) of the unknown (!) underlying MDP. We can then read off an (approximately) optimal policy for the MDP by acting greedily, i.e., ˆπ(s) := argmax Q(s, a). (21) a A It is important to note that the actions a t chosen in Step 1 of Algorithm 3 need not be related at all to the estimate q t (s, a) of the action-value function. Rather, these actions have to ensure that each possible pair s, a S A of state and action occurs infinitely often as the current state s t and action a t during the execution of Algorithm 3. In practice, we have to stop Algorithm 3 after a finite amount of time. An estimate of the minimum required time to ensure a small deviation of q t (s, a) from the optimal action-value function q (s, a) can be obtained from (20). The execution of Algorithm 2 and 3 can be illustrated nicely by thinking of the action-value function q(s, a) as a table with different rows representing the different states and the different columns representing the different actions which the AI system can take. s = 1, 1 s = 2, 1 s = 3, 1 s = 4, 1 s = 5, 1 s = 6, 1 a = W 0 0-1 0 0 0 a = E 0 0-1 0 0 0 4.2 Function Approximation Methods The applicability of Algorithm 3 is limited to MDP models with (small) finite state space S = {1,..., S } and action set A = {1,..., A }. However, many AI applications involve extremely large state spaces which might be even continuous. Consider, e.g., an MDP model for an autonomous ship which can be interpreted as an AI system operating within its physical environment (including the sea water and near-ground atmosphere). The state of an autonomous ship is typically characterized by its current coordinates (longitude and latitude) and velocity [3]. We will now extend the tabular-based Algorithm 3 to cope with such applications, where the state space and action set is so large that we cannot represent them by a simple table anymore. The key idea is to use function approximation, somewhat similar to machine learning methods for approximating predictors, in order to handle high-dimensional action-value functions. In particular, we approximate the optimal action-value function q (s, a) of a given MDP model as q (s, a) φ(s, a; w) (22) with a function φ(s, a; w) which is parametrized by some weight vector w R d. We have seen two examples of such parametrized functions in the chapter Elements of Machine Learning : The class of linear functions and the class of functions represented by an artificial neural network structure. Let us consider in what follows the special case of linear function approximation, i.e., φ(s, a; w) := w T x(s, a) = d w j x j (s, a) (23) with some feature vector x(s, a) = ( x 1 (s, a),..., x d (s, a) ) T R d. The feature vector x(s, a) depends on the current state s of - and action a taken by - the AI system. However, we can evaluate the approximation φ(s, a; w) of the true action-value function j=1 9

q (s, a), using only the weight vector w and directly the features x j (s, a) without determining the underlying state s S (which might be difficult). Thus, we only work on the space of the weight vectors w R d which is typically much smaller than the state space S of the MDP. We highlight that, to a large extent, the particular definition of the features x(s, a) is a design choice. In principle, we can use any quantity that can be determined or measured by the AI system. For the particular AI application of the cleaning robot Rumba, we might use as features certain properties of the current snapshot taken by its on-board cameras. These features clearly depend on the underlying state (location) but we do not need the state in order to implement (23), which can be done only from the snapshot of the camera. In order to implement the linear function approximation (23), we have to choose the weight vector w suitably. To this end, we can use the information acquired by observing Rumba s behaviour, i.e., the state transitions s t a t s t+1 and the rewards R(s t, a t ) (see figure above). Let us assume that Rumba takes action a t when being in state s t which results in the reward R(s t, a t ) and new state s t+1 = T (s t, a t ). How can we judge the quality of the approximation φ(s t, a t ; w (t) ), using our current choice for the weight vector w (t), for the true optimal action-value function q (s t, a t )? It seems reasonable to measure the quality by the squared error ε(w (t) ) := (q (s t, a t ) φ(s t, a t ; w (t) )) 2 (24) and then adjust the weight vector w in order to decrease this error. This reasoning leads us to the gradient descent update [4] w (t+1) = w (t) (α/2) ε(w (t) ) (25) with some suitably chosen step size (or learning rate) α (the factor 1/2 is just for notational convenience). One of the appealing properties of using a linear function approximation (23) is that we obtain simple expressions for the gradient of the error: ε(w (t) ) = 2 ( φ(s t, a t ; w (t) ) q (s t, a t ) ) x(s, a). (26) Note that we cannot directly evaluate (26) for computing the gradient as it involves the unknown action-value function q (s t, a t ). After all, our goal is to find a good estimate (approximation) q(s, a; w) to this unknown action-value function based on observing the behaviour (state transitions, rewards, see figure above) of the AI system. While we do not have access to the exact action-value function q (s t, a t ), we can construct an estimate of it by using the Bellman equation (cf. Chapter on MDP): q (s t, a t ) = R(s t, a t ) + max a A q (T (s t, a t ), a ) }{{} φ(t (s t,a t),a ;w (t) ) R(s t, a t ) + max φ(t (s t, a t ), a ; w (t) ) a A }{{} =s t+1 = R(s t, a t ) + max a A φ(s t+1, a ; w (t) ). (27) By inserting (27) into (26), we obtain the following variant of Q-learning using linear function approximation [5]. 10

Algorithm 4 Q-Learning with Linear Value Function Approximation (deterministic MDP) Input: discount factor γ (0, 1); learning rate α > 0 Initialize: w (t) :=0, Step 1: take some action a t A Step 2: wait until new state s t+1 is reached and reward R(s t, a t ) received Step 3: update weight parameter (approximate gradient descent step) w (t+1) =w (t) +α ( ( R(s t, a t ) + γ max w (t) ) T x(st+1, a ) a A }{{} q (s t+1,a ) } {{ } q (s t,a t) ( w (t)) T ) x(st, a t ) x(st, a t ) (28) }{{} φ(s t,a t;w (t) ) Step 4: if not converged, update time index t := t + 1 and go to Step 1 Output: estimate Q(s, a) = ( w (t)) T x(s, a) of optimal action-value function q Note that the execution of Algorithm 4 does not require to determine the current state s t itself, but only the features x(s t, a) which typically can be determined more easily. In particular, for the cleaning robot Rumba, determining the state s t (which is defined as its current location) requires an indoor positioning system, while the features x(s t, a) might be simple characteristics of the current snapshot taken by an on-board camera. References [1] D. P. Bertsekas, Dynamic Programming and Optimal Control, Vol. II, 3rd ed. Athena Scientific, 2007. [2] T. Mitchell, Lecture notes - machine learning, February 2011. [3] T. Perez, Ship Motion Control. London: Springer, 2005. [4] A. Jung, A fixed-point of view on gradient methods for big data, Frontiers in Applied Mathematics and Statistics, vol. 3, 2017. [Online]. Available: https: //www.frontiersin.org/article/10.3389/fams.2017.00018 [5] J. N. Tsitsiklis and B. V. Roy, An analysis of temporal-difference learning with function approximation, IEEE Transactions on Automatic Control, vol. 42, no. 5, pp. 674 690, May 1997. 11

5 Exercises Problem 1. Consider a MDP model for an autonomous ship. The state s of the ship is determined by the latitude and longitude of its current location. We encode these coordinates using real numbers and therefore use the state space S = R 2. The action set is A = [ π, π] which corresponds to all possible steering directions relative to some reference direction (e.g., north ). For a fixed s S and a A consider the approximate action-value φ(s, a; w) = w T x(s, a) which is based on a linear function of the feature vector x(s, a) = ( s 2, a s, exp( s 2 )) T R 3 which depends on the current state s and action a taken by the AI sailor. In order to adapt the weight vector w, such that φ(s, a; w) is a good approximation (or estimate) for the (unknown) optimal action-value function q (s, a) of the MDP, we typically need to determine the gradient φ(s, a; w) of φ(s, a; w) w.r.t. the weight vector w R 3. Which of the following statements is true? φ(s, a; w) = x(s, a) φ(s, a; w) = w φ(s, a; w) = w T x(s, a) φ(s, a; w) = ( x(s, a) ) T w Problem 2. Consider developing an AI system for autonomous ships. We are interested in learning a predictor h(x) which maps a single feature x R of the current state (this feature might be determined from the on-board cameras as well as GPS sensors) to a predicted optimal steering direction a [ π, π]. We have collected training data X = {(x (t), a t )} N t=1 by observing the steering direction a t, chosen by an experienced human sailor at time t when the feature value x (t) was observed. This training data is stored in the file https://version.aalto.fi/gitlab/junga1/mlbp2017public/blob/master/training.csv, which contains in its t-th row the feature value x (t) and corresponding steering direction a t chosen at time t. We restrict the search for good predictors h(x) to functions of the form h(x) = c x + d with parameters c, d chosen in order to make the mean squared error N N E(c, d) := (1/N) (a t h(x (t) )) 2 = (1/N) (a t (c x (t) + d)) 2 (29) t=1 as small as possible. Let us denote the optimal parameter values by c opt and d opt, i.e., E(c opt, d opt ) = min c,d R E(c, d). Which of the following statements are (or is ;-) true? c opt [0, 2] c opt [100, 200] d opt [100, 200] d opt [0, 2] t=1 12

Problem 3. Reconsider the setting of Problem 2, i.e., we are interested in predicting the optimal steering direction a [ π, π] using a predictor h(x) = c x + d with parameters c, d and the single feature x which summarizes the current state of the ship. In contrast to Problem 2, we only have access to two training examples (x (1), a 1 ) and (x (2), a 2 ) with x (1) x (2). How small can we make the mean squared error by tuning the parameters c, d? E(c, d) = (1/2) we can always find c, d such that E(c, d) = 0 2 t=1 (a t h(x) ) 2 (30) }{{} =c x+d E(c, d) is always lower bounded by x (1), i.e., E(c, d) x (1) for all c, d R E(c, d) is always lower bounded by x (2), i.e., E(c, d) x (2) for all c, d R E(c, d) is always lower bounded by a 2, i.e., E(c, d) a 2 for all c, d R Problem 4. Consider the problem of predicting a real-valued label y (e.g., the local temperature) based on two features x = (x 1, x 2 ) T R 2 (e.g., the current GPS coordinates). We try to predict the label y using a linear predictor h(x) = w T x with some weight vector w R 2. We choose the weight vector w by evaluating the mean squared error incurred for a training dataset X = {(x (t), y t )} which is available in the file https://version.aalto.fi/gitlab/junga1/mlbp2017public/blob/master/training2.csv. The t-th row of this file begins with the two features x (t) 1, x(t) 2 which are followed by the corresponding label y t. We then choose the optimal weight vector by minimizing the mean squared error (training error), i.e., w opt = argmin f(w). (31) w R 2 }{{} :=(1/ X train ) (x,y) X (y w T x) 2 train In order to compute the optimal weight vector (31), we can use gradient descent w (k+1) = w (k) α f(w (k) ) (32) with some learning rate α > 0 and initial guess w (0) = (1, 1) T R 2. Which of the following statements is true? the iterates f(w (k) ), k = 0, 1,..., converge for α = 1/2 the iterates f(w (k) ), k = 0, 1,..., converge for α = 10 the iterates f(w (k) ), k = 0, 1,..., converge for α = 1/4 the iterates f(w (k) ), k = 0, 1,..., converge for α = 1/8 13

Problem 5. Consider the problem of predicting a real-valued label y (e.g., the local temperature) based on two features x = (x 1, x 2 ) T R 2 (e.g., the current GPS coordinates). We try to predict the label y using a linear predictor h(x) = w T x with some weight vector w R 2. The choice of the weight vector w is guided by the mean squared error incurred for a training dataset X = {(x (t), y t )} which is available in the file https://version.aalto.fi/gitlab/junga1/mlbp2017public/blob/master/training2.csv. The t-th row of this file starts with the features x (t) 1, x(t) 2 followed by the corresponding label y t. In some applications it is beneficial not to aim only at smallest training error but also to enforce small norm w of the weight vector. This leads naturally to the regularized linear regression problem for finding the optimal weight vector: w (λ) = argmin(1/ X train ) (y w T x) 2 + λ w 2 2. (33) w R 2 (x,y) X train For different choices of λ [0, 1] (e.g., 10 evenly spaced values between 0 and 1), determine the optimal weight vector w (λ) by solving the optimization problem (33). You might use the gradient descent method as discussed in Chapter Elements of Machine Learning in order to compute (approximately) w (λ). Then, for each value of λ which results in a different w (λ), determine the training error TrainError(λ) := (1/ X train ) (y (w (λ) ) T x) 2 (34) (x,y) X train and the validation (or test) error ValidationError(λ) := (1/ X val ) (x,y) X val (y (w (λ) ) T x) 2 (35) which is the mean squared error incurred for the validation dataset X val available in the file https://version.aalto.fi/gitlab/junga1/mlbp2017public/blob/master/validate2.csv. Which of the following statements is true? the training error TrainError(λ) always decreases with increasing λ the training error TrainError(λ) never decreases with increasing λ the validation error ValidationError(λ) always decreases with increasing λ the validation error ValidationError(λ) always increases with increasing λ 14