CS599 Lecture 1 Introduction To RL

Size: px
Start display at page:

Download "CS599 Lecture 1 Introduction To RL"


1 CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming Temporal Difference Learning Q-Learning Handout: Class Notes Reading Assignment for Next Class Chapters 1-4

2 Goals for the Course Highlights: Introduction to reinforcement learning with view towards solving realworld problems State-of-the-art of learning control with reinforcement learning Projects possible with simulated or actual robots Course Description: This course will introduce and discuss machine learning methods for learning control, particularly with a focus on robotics, but also applicable to models of learning in biology and any other control process. The course will cover the basics of reinforcement learning with value functions (dynamic programming, temporal difference learning, Q-learning). The emphasis, however, will be on learning methods that scale to complex high dimensional control problems. Thus, we will cover function approximation methods for reinforcement learning, policy gradients, probabilistic reinforcement learning, learning from trajectory trials, optimal control methods, stochastic optimal control methods, dynamic Bayesian networks for learning control, Gaussian processes for reinforcement learning, etc.

3 Reminder: Supervised Learning Supervised Learning provides a signed error vector, i.e., the gradient Training Info = desired (target) outputs Inputs Supervised Learning System Outputs Error = (target output actual output)

4 Evaluative Feedback Evaluating actions vs. instructing by giving correct actions Pure evaluative feedback depends totally on the action taken. Pure instructive feedback depends not at all on the action taken. Supervised learning is instructive; optimization and reinforcement learning are evaluative

5 What is Reinforcement Learning? Reinforcement Learning is learning what to do how to map actions to situations so as to maximize a numerical reward signal, i.e., evaluative feedback. There is no information about which actions to take, only the reward signal is given. Actions in the present may affect future rewards, thus there is a temporal credit assignment problem Reinforcement Learning requires learning of certain functional relationships, and thus builds on techniques of function approximation Reinforcement Learning considers an entire learning problem, i.e., an agent interacting with the environment => it is a more complex problem than most learning tasks.

6 What is Reinforcement Learning? An approach to Artificial Intelligence Learning from interaction Goal-oriented learning Learning about, from, and while interacting with an external environment Learning what to do how to map situations to actions so as to maximize a numerical reward signal

7 Key Features of RL Learner is not told which actions to take Trial-and-Error search Possibility of delayed reward Sacrifice short-term gains for greater long-term gains The need to explore and exploit Considers the whole problem of a goaldirected agent interacting with an uncertain environment

8 RL In The Context of Other Research Areas Artificial Intelligence Psychology Control Theory and Operations Research Reinforcement Learning (RL) Neuroscience Artificial Neural Networks

9 Elements of Reinforcement Learning Policies perceived state to action mapping (can be probabilistic) Reward functions maps the perceived state-action pair into a a single number, an immediate reward (stochastic) Value functions maps the state into the accumulated expected reward that would be received if starting in the state Models predicts the next state given the current state and action (can be probabilistic) Objective: Optimize Reward!

10 Elements of RL Policy Policy: what to do Reward: what is good Reward Value Model of environment Value: what is good because it predicts reward Model: what follows what

11 The Agent-Environment Interface Agent and environment interact at discrete time steps Agent observes state at step t : s t S produces action at step t : a t A(s t ) gets resulting reward : r t +1 R : t = 0, 1, 2, K and resulting next state : s t s t a t r t +1 r s t +2 t +1 s r t +3 a t +2 st t +1 a t +2 a t +3 11

12 Examples of Reinforcement Learning Robocup Soccer Teams Stone & Veloso, Riedmiller et al. World s best player of simulated soccer, 1999; Runner-up 2000 Inventory Management Van Roy, Bertsekas, Lee & Tsitsiklis 10-15% improvement over industry standard methods Dynamic Channel Assignment Singh & Bertsekas, Nie & Haykin World's best assigner of radio channels to mobile telephone calls Elevator Control Crites & Barto (Probably) world's best down-peak elevator controller Many Robots navigation, bi-pedal walking, grasping, switching between skills... TD-Gammon and Jellyfish Tesauro, Dahl World's best backgammon player

13 The Markov Property By the state at step t, the book means whatever information is available to the agent at step t about its environment. The state can include immediate sensations, highly processed sensations, and structures built up over time from sequences of sensations. Ideally, a state should summarize past sensations so as to retain all essential information, i.e., it should have the Markov Property: Pr{ s t +1 = s, r t +1 = r s t,a t,r t, s t 1,a t 1,K,r 1,s 0,a } 0 = 13 Pr{ s t +1 = s, r t +1 = r s t,a } t for all s, r, and histories s t,a t,r t, s t 1,a t 1,K,r 1, s 0,a 0.

14 Markov Decision Processes If a reinforcement learning task has the Markov Property, it is basically a Markov Decision Process (MDP). If state and action sets are finite, it is a finite MDP. To define a finite MDP, you need to give: state and action sets one-step dynamics defined by transition probabilities: a = Pr{ s t +1 = s s t = s,a t = a} for all s, s S, a A(s). P s s reward probabilities: a R s s { } for all s, = E r t +1 s t = s,a t = a,s t +1 = s 14 s S, a A(s).

15 Value Functions The value of a state is the expected return starting from that state; depends on the agent s policy: State - value function for policy π : V π (s) = E π R t s t = s { } = E π γ k r t +k +1 s t = s The value of taking an action in a state under policy π is the expected return starting from that state, taking that action, and thereafter following π : k =0 Action- value function for policy π : { } = E π γ k r t + k +1 s t = s,a t = a Q π (s, a) = E π R t s t = s, a t = a 15 k = 0

16 Bellman Equation for a Policy The basic idea: R t = r t +1 + γ r t +2 +γ 2 r t + 3 +γ 3 r t + 4 L = r t +1 + γ ( r t +2 + γ r t +3 + γ 2 r t + 4 L ) = r t +1 + γ R t +1 So: V π (s) = E π R t s t = s { } { } = E π r t +1 + γ V ( s t +1 ) s t = s Or, without the expectation operator: V π a (s) = π(s,a) P s a s 16 s [ R a + γv π ( s s s )]

17 Bellman Optimality Equation for V* The value of a state under an optimal policy must equal the expected return for the best action from that state: V (s) = max Q π (s,a) a A(s) { } = max E r t +1 + γv (s t +1 ) s t = s,a t = a a A(s) = max P a [ s s R a + γv ( s s s )] a A(s) s V is the unique solution of this system of nonlinear equations. 17

18 Bellman Optimality Equation for Q* { } Q (s,a) = E r t +1 + γ maxq (s t +1, a ) s t = s,a t = a a s = P s [ ] a s R a s s + γ maxq ( s, a ) a Q * is the unique solution of this system of nonlinear equations. 18

19 The reward hypothesis That all of what we mean by goals and purposes can be well thought of as the maximization of the cumulative sum of a received scalar signal (reward) 19

20 Returns In general, Suppose the sequence of rewards after step t is : r t +1, r t+ 2, r t + 3, K What do we want to maximize? we want to maximize the expected return, E{ R t }, for each step t. Episodic tasks: interaction breaks naturally into episodes, e.g., plays of a game, trips through a maze. R t = r t +1 + r t +2 +L + r T, where T is a final time step at which a terminal state is reached, ending an episode. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20

21 Returns for Continuing Tasks Continuing tasks: interaction does not have natural episodes. Discounted return: k =0 R t = r t +1 +γ r t+ 2 + γ 2 r t +3 +L = γ k r t + k +1, where γ, 0 γ 1, is the discount rate. shortsighted 0 γ 1 farsighted 21

22 An Example Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. As an episodic task where episode ends upon failure: As a continuing task with discounted return: reward = +1 for each step before failure return = number of steps before failure reward = 1 upon failure; 0 otherwise return = γ k, for k steps before failure In either case, return is maximized by avoiding failure for as long as possible. 22

23 Another Example Get to the top of the hill as quickly as possible. reward = 1 for each step where not at top of hill return = number of steps before reaching top of hill Return is maximized by minimizing number of steps to reach the top of the hill. 23

24 Example: Tic-Tac-Toe Goal: Learn to play optimal game against, e.g.: a particular opponent the optimal playing opponent What is the state of the system? all possible board configurations What is an action? put down a new X in an empty field What is the reward? say, +10 for winning, -1 for every move that does not win

25 Dynamic Programming for Tic-Tac-Toe Key Idea: Use the value function to organize and structure the search for good policies Key Ingredients: a model of the state transitions (i.e., the opponent) an algorithm to compute the value function from a given policy an algorithm to compute the policy from a given value function the proof that an iteration between policy computation and value function computation converges to the optimal policy and value function!

26 Computing the Model Observe VERY many Tic-Tac-Toe games Count the number n x that the opponent does a particular move in a particular state, and the total number n the state is visited Thus, we obtain a model of the opponent, indicating his statetransition probability P x n+1 x n ( ) = n x n+1 x n n x n For the given discrete-state example, this results in a the statetransition matrix Note: we could also learn this model in an on-line fashion simultaneously with the policy and value function

27 Policy Evaluation Goal: Compute the value function under a given policy π and model (state-transition matrix) Remember: The value function is the expected long term reward from a given state: V π ( x) = E{ r n+1 +γr n+2 +γ 2 r n+3 + x} (γ [ 0,1]) = E{ r n+1 +γv π ( x n+1 ) x} = π( x,a) P( x n+1 x) ( ( )) r n+1 +γv π x n+1 a x n +1 For any given policy (stochastic or deterministic), a repeated application of this update formula will lead to the correct value function under the given policy π! V n+1 ( x) = π( x, a) P( x n+1 x) ( r n+1 + γv n ( x n+1 )) a x n+1

28 Example: Grid World

29 Policy Improvement Goal: Compute a better policy given the value function Bellman s Principle of Optimality: An optimal policy has to be locally optimal as well Thus, the Policy can be improved by local improvements π n +1 ( x) = argmax a x n +1 ( ) P x n+1 x r n+1 ( a, x) +γv n ( x n +1 ( ))

30 Example: Grid World (cont d)

31 Computing the Optimal Policy Policy Iteration update value function (policy evaluation) update policy (policy improvement) iterate until convergence policy iteration converges usually fairly fast Value Iteration policy evaluation is expensive since it takes several iteration to converge to the correct value function value iteration corresponds to a single policy evaluation step and then a policy improvement step, which can actually omit the policy update step V n+1 ( x) = max a x n +1 ( )( r n+1 ( a, x) +γv n ( x n +1 )) P x n+1 x Asynchronous DP it is not necessary to update all the states simultaneously, just every state has to be updated sufficiently often

32 Monte Carlo Methods Goal: Learn Value Function only from experience, without knowledge of the model (environment) (on-line learning) Advantage: real data is often easily obtained while building models of the environment (e.g., density estimation) can be very hard Monte Carlo methods are episode -based: this assumes a trial ends after a while (absorbing states, finite number of step and discounted reward)

33 Monte Carlo Methods: Example: Grid World Monte Carlo Policy Evaluation: start at random state follow policy, keep entire trajectory and rewards in memory after goal was reached, update all values of the Value Function of the entire trajectory starting from the last state of the trajectory: each value becomes the discounted average of the rewards after this state

34 Temporal Difference Learning TD is a combination of Monte Carlo techniques and DP to compute a value function TD allows a more natural (on-line) calculation of the value function as it only needs to look at states that are neighbors in TIME, de-emphasizing the knowledge of the spatial layout of states In its simplest form, TD (actually called TD(0)) updates the value function as: ( ) = V( x( t) ) +α( r( t +1)+ γv( x( t +1) ) V( x( t) )) V x( t) α [ 0,1] In order to obtain data, we need to follow the current policy for a while until sufficient data have been experienced.

35 N-Step TD and TD(λ) N-Step TD is an update method between TD(0) and Monte Carlo Methods (TD(1)) instead of taking just two temporally adjacent states into account for updating, more than one are used ( ) = V( x( t) ) +α( r( t +1)+ γv( x( t +1) ) V( x( t) )) V x( t) α [ 0,1] ( ) ( ( )) R 1 ( t) = r( t +1) +γv x( t +1) R 2 ( t) = r( t +1) +γr t + 2 n R n ( t) = λ i 1 r( t +i) i=1 ( )+ γ 2 V x t + 2 +γ n V( x( t + n) ) When using averaging over R(t) values, one obtains the TD(λ) method R λ ( t) = 1 λ ( ) ( ) λ n 1 R n t n=1

36 TD(λ) Implemented With Eligibility Traces Using the concept of decaying activation traces (eligibility traces), TD(λ) can be implemented in a simple on-line version: every state gets assigned an eligibility trace e(x) the value function is updated as: δ = r( t +1) +γv( x( t +1) ) V( x( t) ) ( ) = e x t e x( t) ( ( ))+1 For all states x: V( x) = V( x) +αδ e( x) e( x) = γλe( x)

37 Q-Learning, A Special Case of TD(0) By building a value function (Q-function) that is both a function of the states AND the actions, Q-Learning avoids the need for a model of the environment One-step Q-Learning: ( ) = Q( x( t), a( t) ) Q x( t),a( t) ( ( )) +α r( t +1) +γ max{ Q( x( t +1), a )} Q x( t), a ( t ) a Note: it is not necessary to follow the policy for Q-learning, just visit every state action pair sufficiently often! The policy is simply the action that has the maximal Q-value in a particular state.

38 The Exploration/ Exploitation Dilemma Suppose you form estimates Q t (a) Q * (a) The greedy action at t is a t action value estimates a t * = argmax a Q t (a) a t = a t * exploitation a t a t * exploration You can t exploit all the time; you can t explore all the time You can never stop exploring; but you should always reduce exploring. Maybe.

39 ε-greedy Action Selection Greedy action selection: a t = a t * = arg max a Q t (a) ε-greedy: a t = a t * with probability 1 ε {random action with probability ε... the simplest way to balance exploration and exploitation

40 10-Armed Testbed n = 10 possible actions Each Q * (a) is chosen randomly from a normal distrib.: each r t 1000 plays is also normal: repeat the whole thing 2000 times and average the results sample average

41 ε-greedy Methods on the 10-Armed Testbed

42 Softmax Action Selection Softmax action selection methods grade action probs. by estimated values. The most common softmax uses a Gibbs, or Boltzmann, distribution: Choose action a on play t with probability e Q t (a) τ n e Q t (b) τ b=1 where τ is the computational temperature,

43 Incremental Implementation Recall the sample average estimation method: The average of the first k rewards is (dropping the dependence on a ): Q k = r 1 + r 2 +L r k k Can we do this incrementally (without storing all the rewards)? We could keep a running sum and count, or, equivalently: Q k +1 = Q k + 1 [ k +1 r Q k +1 k ] This is a common form for update rules: NewEstimate = OldEstimate + StepSize[Target OldEstimate]

44 Tracking a Nonstationary Problem Q k Choosing to be a sample average is appropriate in a stationary problem, i.e., when none of the Q * (a) change over time, But not in a nonstationary problem. Better in the nonstationary case is: Q k +1 = Q k +α[ r k +1 Q k ] for constant α, 0 < α 1 = (1 α) k Q 0 + α(1 α) k i r i k i =1 exponential, recency-weighted average

45 Optimistic Initial Values All methods so far depend on Q 0 (a), i.e., they are biased. Suppose instead we initialize the action values optimistically, i.e., on the 10-armed testbed, use Q 0 (a) = 5 for all a

Reinforcement Learning

Reinforcement Learning Reinforcement Learning 1 Reinforcement Learning Mainly based on Reinforcement Learning An Introduction by Richard Sutton and Andrew Barto Slides are mainly based on the course material provided by the

More information

Reinforcement Learning. Up until now we have been

Reinforcement Learning. Up until now we have been Reinforcement Learning Slides by Rich Sutton Mods by Dan Lizotte Refer to Reinforcement Learning: An Introduction by Sutton and Barto Alpaydin Chapter 16 Up until now we have been Supervised Learning Classifying,

More information

ARTIFICIAL INTELLIGENCE. Reinforcement learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems Lecture 2: Learning from Evaluative Feedback or Bandit Problems 1 Edward L. Thorndike (1874-1949) Puzzle Box 2 Learning by Trial-and-Error Law of Effect: Of several responses to the same situation, those

More information

Reinforcement Learning (1)

Reinforcement Learning (1) Reinforcement Learning 1 Reinforcement Learning (1) Machine Learning 64-360, Part II Norman Hendrich University of Hamburg, Dept. of Informatics Vogt-Kölln-Str. 30, D-22527 Hamburg hendrich@informatik.uni-hamburg.de

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

Lecture 3: The Reinforcement Learning Problem

Lecture 3: The Reinforcement Learning Problem Lecture 3: The Reinforcement Learning Problem Objectives of this lecture: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage. Course basics CSE 190: Reinforcement Learning: An Introduction The website for the class is linked off my homepage. Grades will be based on programming assignments, homeworks, and class participation.

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

Reinforcement learning an introduction

Reinforcement learning an introduction Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

Edward L. Thorndike #1874$1949% Lecture 2: Learning from Evaluative Feedback. or!bandit Problems" Learning by Trial$and$Error.

Edward L. Thorndike #1874$1949% Lecture 2: Learning from Evaluative Feedback. or!bandit Problems Learning by Trial$and$Error. Lecture 2: Learning from Evaluative Feedback Edward L. Thorndike #1874$1949% or!bandit Problems" Puzzle Box 1 2 Learning by Trial$and$Error Law of E&ect:!Of several responses to the same situation, those

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396 Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010 Lecture 25: Learning 4 Victor R. Lesser CMPSCI 683 Fall 2010 Final Exam Information Final EXAM on Th 12/16 at 4:00pm in Lederle Grad Res Ctr Rm A301 2 Hours but obviously you can leave early! Open Book

More information

Autonomous Helicopter Flight via Reinforcement Learning

Autonomous Helicopter Flight via Reinforcement Learning Autonomous Helicopter Flight via Reinforcement Learning Authors: Andrew Y. Ng, H. Jin Kim, Michael I. Jordan, Shankar Sastry Presenters: Shiv Ballianda, Jerrolyn Hebert, Shuiwang Ji, Kenley Malveaux, Huy

More information

Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Reading Response: Due Wednesday R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Another Example Get to the top of the hill as quickly as possible. reward = 1 for each step where

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

Computational Reinforcement Learning: An Introduction

Computational Reinforcement Learning: An Introduction Computational Reinforcement Learning: An Introduction Andrew Barto Autonomous Learning Laboratory School of Computer Science University of Massachusetts Amherst barto@cs.umass.edu 1 Artificial Intelligence

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Reinforcement Learning II

Reinforcement Learning II Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini

More information

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017 Deep Reinforcement Learning STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017 Outline Introduction to Reinforcement Learning AlphaGo (Deep RL for Computer Go)

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning

More information

CS 570: Machine Learning Seminar. Fall 2016

CS 570: Machine Learning Seminar. Fall 2016 CS 570: Machine Learning Seminar Fall 2016 Class Information Class web page: http://web.cecs.pdx.edu/~mm/mlseminar2016-2017/fall2016/ Class mailing list: cs570@cs.pdx.edu My office hours: T,Th, 2-3pm or

More information

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning Introduction to Reinforcement Learning Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/ munos/ INRIA Lille - Nord Europe Machine Learning Summer School, September 2011,

More information

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 7: Eligibility Traces R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Midterm Mean = 77.33 Median = 82 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

More information

Reinforcement Learning. George Konidaris

Reinforcement Learning. George Konidaris Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom

More information

Reinforcement Learning: the basics

Reinforcement Learning: the basics Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46 Introduction Action selection/planning Learning by trial-and-error

More information


REINFORCEMENT LEARNING REINFORCEMENT LEARNING Larry Page: Where s Google going next? DeepMind's DQN playing Breakout Contents Introduction to Reinforcement Learning Deep Q-Learning INTRODUCTION TO REINFORCEMENT LEARNING Contents

More information

Temporal difference learning

Temporal difference learning Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart

More information

The Reinforcement Learning Problem

The Reinforcement Learning Problem The Reinforcement Learning Problem Slides based on the book Reinforcement Learning by Sutton and Barto Formalizing Reinforcement Learning Formally, the agent and environment interact at each of a sequence

More information

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Assignment #1 Roll-out: nice example paper: X.

More information

Notes on Reinforcement Learning

Notes on Reinforcement Learning 1 Introduction Notes on Reinforcement Learning Paulo Eduardo Rauber 2014 Reinforcement learning is the study of agents that act in an environment with the goal of maximizing cumulative reward signals.

More information

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING Slides adapted from Tom Mitchell and Peter Abeel Machine Learning: Jordan Boyd-Graber UMD Machine Learning

More information

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN Reinforcement Learning for Continuous Action using Stochastic Gradient Ascent Hajime KIMURA, Shigenobu KOBAYASHI Tokyo Institute of Technology, 4259 Nagatsuda, Midori-ku Yokohama 226-852 JAPAN Abstract:

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques

More information

RL 3: Reinforcement Learning

RL 3: Reinforcement Learning RL 3: Reinforcement Learning Q-Learning Michael Herrmann University of Edinburgh, School of Informatics 20/01/2015 Last time: Multi-Armed Bandits (10 Points to remember) MAB applications do exist (e.g.

More information

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted 15-889e Policy Search: Gradient Methods Emma Brunskill All slides from David Silver (with EB adding minor modificafons), unless otherwise noted Outline 1 Introduction 2 Finite Difference Policy Gradient

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Daniel Hennes 19.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns Forward and backward view Function

More information

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B. Advanced Topics in Machine Learning, GI13, 2010/11 Advanced Topics in Machine Learning, GI13, 2010/11 Answer any THREE questions. Each question is worth 20 marks. Use separate answer books Answer any THREE

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6.

ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6. ROB 537: Learning-Based Control Week 5, Lecture 1 Policy Gradient, Eligibility Traces, Transfer Learning (MaC Taylor Announcements: Project background due Today HW 3 Due on 10/30 Midterm Exam on 11/6 Reading:

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

Reinforcement Learning

Reinforcement Learning CS7/CS7 Fall 005 Supervised Learning: Training examples: (x,y) Direct feedback y for each input x Sequence of decisions with eventual feedback No teacher that critiques individual actions Learn to act

More information

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning. Monte Carlo is important in practice CSE 190: Reinforcement Learning: An Introduction Chapter 6: emporal Difference Learning When there are just a few possibilitieo value, out of a large state space, Monte

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning 1 / 58 An Introduction to Reinforcement Learning Lecture 01: Introduction Dr. Johannes A. Stork School of Computer Science and Communication KTH Royal Institute of Technology January 19, 2017 2 / 58 ../fig/reward-00.jpg

More information

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari

More information

Reinforcement Learning and Control

Reinforcement Learning and Control CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

More information

Reinforcement Learning: An Introduction

Reinforcement Learning: An Introduction Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

CS599 Lecture 2 Function Approximation in RL

CS599 Lecture 2 Function Approximation in RL CS599 Lecture 2 Function Approximation in RL Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. Overview of function approximation (FA)

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

Lecture 7: Value Function Approximation

Lecture 7: Value Function Approximation Lecture 7: Value Function Approximation Joseph Modayil Outline 1 Introduction 2 3 Batch Methods Introduction Large-Scale Reinforcement Learning Reinforcement learning can be used to solve large problems,

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

ilstd: Eligibility Traces and Convergence Analysis

ilstd: Eligibility Traces and Convergence Analysis ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca

More information

Probabilistic Planning. George Konidaris

Probabilistic Planning. George Konidaris Probabilistic Planning George Konidaris gdk@cs.brown.edu Fall 2017 The Planning Problem Finding a sequence of actions to achieve some goal. Plans It s great when a plan just works but the world doesn t

More information

Lecture 8: Policy Gradient

Lecture 8: Policy Gradient Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@cse.iitb.ac.in Department of Computer Science and Engineering Indian Institute of Technology Bombay April 2018 What is Reinforcement

More information

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Reinforcement Learning. Summer 2017 Defining MDPs, Planning Reinforcement Learning Summer 2017 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning. Machine Learning, Fall 2010 Reinforcement Learning Machine Learning, Fall 2010 1 Administrativia This week: finish RL, most likely start graphical models LA2: due on Thursday LA3: comes out on Thursday TA Office hours: Today 1:30-2:30

More information

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan Some slides borrowed from Peter Bodik and David Silver Course progress Learning

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Artificial Intelligence Review manuscript No. (will be inserted by the editor) Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Howard M. Schwartz Received:

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it

More information

arxiv: v1 [cs.ai] 5 Nov 2017

arxiv: v1 [cs.ai] 5 Nov 2017 arxiv:1711.01569v1 [cs.ai] 5 Nov 2017 Markus Dumke Department of Statistics Ludwig-Maximilians-Universität München markus.dumke@campus.lmu.de Abstract Temporal-difference (TD) learning is an important

More information

Reinforcement Learning II. George Konidaris

Reinforcement Learning II. George Konidaris Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2017 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes

More information

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))] Review: TD-Learning function TD-Learning(mdp) returns a policy Class #: Reinforcement Learning, II 8s S, U(s) =0 set start-state s s 0 choose action a, using -greedy policy based on U(s) U(s) U(s)+ [r

More information

Reinforcement Learning II. George Konidaris

Reinforcement Learning II. George Konidaris Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2018 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B. Advanced Topics in Machine Learning, GI13, 2010/11 Advanced Topics in Machine Learning, GI13, 2010/11 Answer any THREE questions. Each question is worth 20 marks. Use separate answer books Answer any THREE

More information

Rewards and Value Systems: Conditioning and Reinforcement Learning

Rewards and Value Systems: Conditioning and Reinforcement Learning Rewards and Value Systems: Conditioning and Reinforcement Learning Acknowledgements: Many thanks to P. Dayan and L. Abbott for making the figures of their book available online. Also thanks to R. Sutton

More information

Reinforcement Learning Active Learning

Reinforcement Learning Active Learning Reinforcement Learning Active Learning Alan Fern * Based in part on slides by Daniel Weld 1 Active Reinforcement Learning So far, we ve assumed agent has a policy We just learned how good it is Now, suppose

More information

Week 6, Lecture 1. Reinforcement Learning (part 3) Announcements: HW 3 due on 11/5 at NOON Midterm Exam 11/8 Project draft due 11/15

Week 6, Lecture 1. Reinforcement Learning (part 3) Announcements: HW 3 due on 11/5 at NOON Midterm Exam 11/8 Project draft due 11/15 ME 537: Learning Based Control Week 6, Lecture 1 Reinforcement Learning (part 3 Announcements: HW 3 due on 11/5 at NOON Midterm Exam 11/8 Project draft due 11/15 Suggested reading : Chapters 7-8 in Reinforcement

More information

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018 Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)

More information

Generalization and Function Approximation

Generalization and Function Approximation Generalization and Function Approximation 0 Generalization and Function Approximation Suggested reading: Chapter 8 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning II Daniel Hennes 11.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns

More information

Lecture 15: Bandit problems. Markov Processes. Recall: Lotteries and utilities

Lecture 15: Bandit problems. Markov Processes. Recall: Lotteries and utilities Lecture 15: Bandit problems. Markov Processes Bandit problems Action values (and now to compute them) Exploration-exploitation trade-off Simple exploration strategies -greedy Softmax (Boltzmann) exploration

More information

CS885 Reinforcement Learning Lecture 7a: May 23, 2018

CS885 Reinforcement Learning Lecture 7a: May 23, 2018 CS885 Reinforcement Learning Lecture 7a: May 23, 2018 Policy Gradient Methods [SutBar] Sec. 13.1-13.3, 13.7 [SigBuf] Sec. 5.1-5.2, [RusNor] Sec. 21.5 CS885 Spring 2018 Pascal Poupart 1 Outline Stochastic

More information

Some AI Planning Problems

Some AI Planning Problems Course Logistics CS533: Intelligent Agents and Decision Making M, W, F: 1:00 1:50 Instructor: Alan Fern (KEC2071) Office hours: by appointment (see me after class or send email) Emailing me: include CS533

More information