CS599 Lecture 1 Introduction To RL

Similar documents
Reinforcement Learning

Reinforcement Learning. Up until now we have been

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems

Reinforcement Learning (1)

Lecture 23: Reinforcement Learning

Lecture 3: The Reinforcement Learning Problem

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem

Reinforcement learning an introduction

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Edward L. Thorndike #1874$1949% Lecture 2: Learning from Evaluative Feedback. or!bandit Problems" Learning by Trial$and$Error.

Basics of reinforcement learning

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Reinforcement Learning

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Reinforcement Learning

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Autonomous Helicopter Flight via Reinforcement Learning

Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Machine Learning I Reinforcement Learning

Elements of Reinforcement Learning

Computational Reinforcement Learning: An Introduction

Lecture 1: March 7, 2018

Reinforcement Learning II

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Open Theoretical Questions in Reinforcement Learning

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

CS 570: Machine Learning Seminar. Fall 2016

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

6 Reinforcement Learning

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Introduction to Reinforcement Learning

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Reinforcement Learning. George Konidaris

Reinforcement Learning: the basics

REINFORCEMENT LEARNING

Temporal difference learning

Reinforcement Learning. Introduction

Reinforcement Learning

The Reinforcement Learning Problem

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

Notes on Reinforcement Learning

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning

RL 3: Reinforcement Learning

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Reinforcement Learning

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Reinforcement Learning

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6.

Reinforcement learning

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Reinforcement Learning

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.

An Introduction to Reinforcement Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Reinforcement Learning and Control

Reinforcement Learning: An Introduction

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

CS599 Lecture 2 Function Approximation in RL

Reinforcement Learning and NLP

Introduction to Reinforcement Learning

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Lecture 7: Value Function Approximation

Grundlagen der Künstlichen Intelligenz

ilstd: Eligibility Traces and Convergence Analysis

Probabilistic Planning. George Konidaris

Lecture 8: Policy Gradient

An Introduction to Reinforcement Learning

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Reinforcement Learning. Machine Learning, Fall 2010

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Reinforcement Learning

, and rewards and transition matrices as shown below:

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

arxiv: v1 [cs.ai] 5 Nov 2017

Reinforcement Learning II. George Konidaris

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Reinforcement Learning II. George Konidaris

An Adaptive Clustering Method for Model-free Reinforcement Learning

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Rewards and Value Systems: Conditioning and Reinforcement Learning

Reinforcement Learning Active Learning

Week 6, Lecture 1. Reinforcement Learning (part 3) Announcements: HW 3 due on 11/5 at NOON Midterm Exam 11/8 Project draft due 11/15

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Generalization and Function Approximation

Grundlagen der Künstlichen Intelligenz

Lecture 15: Bandit problems. Markov Processes. Recall: Lotteries and utilities

CS885 Reinforcement Learning Lecture 7a: May 23, 2018

Some AI Planning Problems

Transcription:

CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming Temporal Difference Learning Q-Learning Handout: Class Notes Reading Assignment for Next Class Chapters 1-4 http://www.cs.ualberta.ca/~sutton/book/the-book.html

Goals for the Course Highlights: Introduction to reinforcement learning with view towards solving realworld problems State-of-the-art of learning control with reinforcement learning Projects possible with simulated or actual robots Course Description: This course will introduce and discuss machine learning methods for learning control, particularly with a focus on robotics, but also applicable to models of learning in biology and any other control process. The course will cover the basics of reinforcement learning with value functions (dynamic programming, temporal difference learning, Q-learning). The emphasis, however, will be on learning methods that scale to complex high dimensional control problems. Thus, we will cover function approximation methods for reinforcement learning, policy gradients, probabilistic reinforcement learning, learning from trajectory trials, optimal control methods, stochastic optimal control methods, dynamic Bayesian networks for learning control, Gaussian processes for reinforcement learning, etc.

Reminder: Supervised Learning Supervised Learning provides a signed error vector, i.e., the gradient Training Info = desired (target) outputs Inputs Supervised Learning System Outputs Error = (target output actual output)

Evaluative Feedback Evaluating actions vs. instructing by giving correct actions Pure evaluative feedback depends totally on the action taken. Pure instructive feedback depends not at all on the action taken. Supervised learning is instructive; optimization and reinforcement learning are evaluative

What is Reinforcement Learning? Reinforcement Learning is learning what to do how to map actions to situations so as to maximize a numerical reward signal, i.e., evaluative feedback. There is no information about which actions to take, only the reward signal is given. Actions in the present may affect future rewards, thus there is a temporal credit assignment problem Reinforcement Learning requires learning of certain functional relationships, and thus builds on techniques of function approximation Reinforcement Learning considers an entire learning problem, i.e., an agent interacting with the environment => it is a more complex problem than most learning tasks.

What is Reinforcement Learning? An approach to Artificial Intelligence Learning from interaction Goal-oriented learning Learning about, from, and while interacting with an external environment Learning what to do how to map situations to actions so as to maximize a numerical reward signal

Key Features of RL Learner is not told which actions to take Trial-and-Error search Possibility of delayed reward Sacrifice short-term gains for greater long-term gains The need to explore and exploit Considers the whole problem of a goaldirected agent interacting with an uncertain environment

RL In The Context of Other Research Areas Artificial Intelligence Psychology Control Theory and Operations Research Reinforcement Learning (RL) Neuroscience Artificial Neural Networks

Elements of Reinforcement Learning Policies perceived state to action mapping (can be probabilistic) Reward functions maps the perceived state-action pair into a a single number, an immediate reward (stochastic) Value functions maps the state into the accumulated expected reward that would be received if starting in the state Models predicts the next state given the current state and action (can be probabilistic) Objective: Optimize Reward!

Elements of RL Policy Policy: what to do Reward: what is good Reward Value Model of environment Value: what is good because it predicts reward Model: what follows what

The Agent-Environment Interface Agent and environment interact at discrete time steps Agent observes state at step t : s t S produces action at step t : a t A(s t ) gets resulting reward : r t +1 R : t = 0, 1, 2, K and resulting next state : s t +1... s t a t r t +1 r s t +2 t +1 s r t +3 a t +2 st t +1 a +3... t +2 a t +3 11

Examples of Reinforcement Learning Robocup Soccer Teams Stone & Veloso, Riedmiller et al. World s best player of simulated soccer, 1999; Runner-up 2000 Inventory Management Van Roy, Bertsekas, Lee & Tsitsiklis 10-15% improvement over industry standard methods Dynamic Channel Assignment Singh & Bertsekas, Nie & Haykin World's best assigner of radio channels to mobile telephone calls Elevator Control Crites & Barto (Probably) world's best down-peak elevator controller Many Robots navigation, bi-pedal walking, grasping, switching between skills... TD-Gammon and Jellyfish Tesauro, Dahl World's best backgammon player

The Markov Property By the state at step t, the book means whatever information is available to the agent at step t about its environment. The state can include immediate sensations, highly processed sensations, and structures built up over time from sequences of sensations. Ideally, a state should summarize past sensations so as to retain all essential information, i.e., it should have the Markov Property: Pr{ s t +1 = s, r t +1 = r s t,a t,r t, s t 1,a t 1,K,r 1,s 0,a } 0 = 13 Pr{ s t +1 = s, r t +1 = r s t,a } t for all s, r, and histories s t,a t,r t, s t 1,a t 1,K,r 1, s 0,a 0.

Markov Decision Processes If a reinforcement learning task has the Markov Property, it is basically a Markov Decision Process (MDP). If state and action sets are finite, it is a finite MDP. To define a finite MDP, you need to give: state and action sets one-step dynamics defined by transition probabilities: a = Pr{ s t +1 = s s t = s,a t = a} for all s, s S, a A(s). P s s reward probabilities: a R s s { } for all s, = E r t +1 s t = s,a t = a,s t +1 = s 14 s S, a A(s).

Value Functions The value of a state is the expected return starting from that state; depends on the agent s policy: State - value function for policy π : V π (s) = E π R t s t = s { } = E π γ k r t +k +1 s t = s The value of taking an action in a state under policy π is the expected return starting from that state, taking that action, and thereafter following π : k =0 Action- value function for policy π : { } = E π γ k r t + k +1 s t = s,a t = a Q π (s, a) = E π R t s t = s, a t = a 15 k = 0

Bellman Equation for a Policy The basic idea: R t = r t +1 + γ r t +2 +γ 2 r t + 3 +γ 3 r t + 4 L = r t +1 + γ ( r t +2 + γ r t +3 + γ 2 r t + 4 L ) = r t +1 + γ R t +1 So: V π (s) = E π R t s t = s { } { } = E π r t +1 + γ V ( s t +1 ) s t = s Or, without the expectation operator: V π a (s) = π(s,a) P s a s 16 s [ R a + γv π ( s s s )]

Bellman Optimality Equation for V* The value of a state under an optimal policy must equal the expected return for the best action from that state: V (s) = max Q π (s,a) a A(s) { } = max E r t +1 + γv (s t +1 ) s t = s,a t = a a A(s) = max P a [ s s R a + γv ( s s s )] a A(s) s V is the unique solution of this system of nonlinear equations. 17

Bellman Optimality Equation for Q* { } Q (s,a) = E r t +1 + γ maxq (s t +1, a ) s t = s,a t = a a s = P s [ ] a s R a s s + γ maxq ( s, a ) a Q * is the unique solution of this system of nonlinear equations. 18

The reward hypothesis That all of what we mean by goals and purposes can be well thought of as the maximization of the cumulative sum of a received scalar signal (reward) 19

Returns In general, Suppose the sequence of rewards after step t is : r t +1, r t+ 2, r t + 3, K What do we want to maximize? we want to maximize the expected return, E{ R t }, for each step t. Episodic tasks: interaction breaks naturally into episodes, e.g., plays of a game, trips through a maze. R t = r t +1 + r t +2 +L + r T, where T is a final time step at which a terminal state is reached, ending an episode. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20

Returns for Continuing Tasks Continuing tasks: interaction does not have natural episodes. Discounted return: k =0 R t = r t +1 +γ r t+ 2 + γ 2 r t +3 +L = γ k r t + k +1, where γ, 0 γ 1, is the discount rate. shortsighted 0 γ 1 farsighted 21

An Example Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. As an episodic task where episode ends upon failure: As a continuing task with discounted return: reward = +1 for each step before failure return = number of steps before failure reward = 1 upon failure; 0 otherwise return = γ k, for k steps before failure In either case, return is maximized by avoiding failure for as long as possible. 22

Another Example Get to the top of the hill as quickly as possible. reward = 1 for each step where not at top of hill return = number of steps before reaching top of hill Return is maximized by minimizing number of steps to reach the top of the hill. 23

Example: Tic-Tac-Toe Goal: Learn to play optimal game against, e.g.: a particular opponent the optimal playing opponent What is the state of the system? all possible board configurations What is an action? put down a new X in an empty field What is the reward? say, +10 for winning, -1 for every move that does not win

Dynamic Programming for Tic-Tac-Toe Key Idea: Use the value function to organize and structure the search for good policies Key Ingredients: a model of the state transitions (i.e., the opponent) an algorithm to compute the value function from a given policy an algorithm to compute the policy from a given value function the proof that an iteration between policy computation and value function computation converges to the optimal policy and value function!

Computing the Model Observe VERY many Tic-Tac-Toe games Count the number n x that the opponent does a particular move in a particular state, and the total number n the state is visited Thus, we obtain a model of the opponent, indicating his statetransition probability P x n+1 x n ( ) = n x n+1 x n n x n For the given discrete-state example, this results in a the statetransition matrix Note: we could also learn this model in an on-line fashion simultaneously with the policy and value function

Policy Evaluation Goal: Compute the value function under a given policy π and model (state-transition matrix) Remember: The value function is the expected long term reward from a given state: V π ( x) = E{ r n+1 +γr n+2 +γ 2 r n+3 + x} (γ [ 0,1]) = E{ r n+1 +γv π ( x n+1 ) x} = π( x,a) P( x n+1 x) ( ( )) r n+1 +γv π x n+1 a x n +1 For any given policy (stochastic or deterministic), a repeated application of this update formula will lead to the correct value function under the given policy π! V n+1 ( x) = π( x, a) P( x n+1 x) ( r n+1 + γv n ( x n+1 )) a x n+1

Example: Grid World

Policy Improvement Goal: Compute a better policy given the value function Bellman s Principle of Optimality: An optimal policy has to be locally optimal as well Thus, the Policy can be improved by local improvements π n +1 ( x) = argmax a x n +1 ( ) P x n+1 x r n+1 ( a, x) +γv n ( x n +1 ( ))

Example: Grid World (cont d)

Computing the Optimal Policy Policy Iteration update value function (policy evaluation) update policy (policy improvement) iterate until convergence policy iteration converges usually fairly fast Value Iteration policy evaluation is expensive since it takes several iteration to converge to the correct value function value iteration corresponds to a single policy evaluation step and then a policy improvement step, which can actually omit the policy update step V n+1 ( x) = max a x n +1 ( )( r n+1 ( a, x) +γv n ( x n +1 )) P x n+1 x Asynchronous DP it is not necessary to update all the states simultaneously, just every state has to be updated sufficiently often

Monte Carlo Methods Goal: Learn Value Function only from experience, without knowledge of the model (environment) (on-line learning) Advantage: real data is often easily obtained while building models of the environment (e.g., density estimation) can be very hard Monte Carlo methods are episode -based: this assumes a trial ends after a while (absorbing states, finite number of step and discounted reward)

Monte Carlo Methods: Example: Grid World Monte Carlo Policy Evaluation: start at random state follow policy, keep entire trajectory and rewards in memory after goal was reached, update all values of the Value Function of the entire trajectory starting from the last state of the trajectory: each value becomes the discounted average of the rewards after this state

Temporal Difference Learning TD is a combination of Monte Carlo techniques and DP to compute a value function TD allows a more natural (on-line) calculation of the value function as it only needs to look at states that are neighbors in TIME, de-emphasizing the knowledge of the spatial layout of states In its simplest form, TD (actually called TD(0)) updates the value function as: ( ) = V( x( t) ) +α( r( t +1)+ γv( x( t +1) ) V( x( t) )) V x( t) α [ 0,1] In order to obtain data, we need to follow the current policy for a while until sufficient data have been experienced.

N-Step TD and TD(λ) N-Step TD is an update method between TD(0) and Monte Carlo Methods (TD(1)) instead of taking just two temporally adjacent states into account for updating, more than one are used ( ) = V( x( t) ) +α( r( t +1)+ γv( x( t +1) ) V( x( t) )) V x( t) α [ 0,1] ( ) ( ( )) R 1 ( t) = r( t +1) +γv x( t +1) R 2 ( t) = r( t +1) +γr t + 2 n R n ( t) = λ i 1 r( t +i) i=1 ( )+ γ 2 V x t + 2 +γ n V( x( t + n) ) When using averaging over R(t) values, one obtains the TD(λ) method R λ ( t) = 1 λ ( ) ( ) λ n 1 R n t n=1

TD(λ) Implemented With Eligibility Traces Using the concept of decaying activation traces (eligibility traces), TD(λ) can be implemented in a simple on-line version: every state gets assigned an eligibility trace e(x) the value function is updated as: δ = r( t +1) +γv( x( t +1) ) V( x( t) ) ( ) = e x t e x( t) ( ( ))+1 For all states x: V( x) = V( x) +αδ e( x) e( x) = γλe( x)

Q-Learning, A Special Case of TD(0) By building a value function (Q-function) that is both a function of the states AND the actions, Q-Learning avoids the need for a model of the environment One-step Q-Learning: ( ) = Q( x( t), a( t) ) Q x( t),a( t) ( ( )) +α r( t +1) +γ max{ Q( x( t +1), a )} Q x( t), a ( t ) a Note: it is not necessary to follow the policy for Q-learning, just visit every state action pair sufficiently often! The policy is simply the action that has the maximal Q-value in a particular state.

The Exploration/ Exploitation Dilemma Suppose you form estimates Q t (a) Q * (a) The greedy action at t is a t action value estimates a t * = argmax a Q t (a) a t = a t * exploitation a t a t * exploration You can t exploit all the time; you can t explore all the time You can never stop exploring; but you should always reduce exploring. Maybe.

ε-greedy Action Selection Greedy action selection: a t = a t * = arg max a Q t (a) ε-greedy: a t = a t * with probability 1 ε {random action with probability ε... the simplest way to balance exploration and exploitation

10-Armed Testbed n = 10 possible actions Each Q * (a) is chosen randomly from a normal distrib.: each r t 1000 plays is also normal: repeat the whole thing 2000 times and average the results sample average

ε-greedy Methods on the 10-Armed Testbed

Softmax Action Selection Softmax action selection methods grade action probs. by estimated values. The most common softmax uses a Gibbs, or Boltzmann, distribution: Choose action a on play t with probability e Q t (a) τ n e Q t (b) τ b=1 where τ is the computational temperature,

Incremental Implementation Recall the sample average estimation method: The average of the first k rewards is (dropping the dependence on a ): Q k = r 1 + r 2 +L r k k Can we do this incrementally (without storing all the rewards)? We could keep a running sum and count, or, equivalently: Q k +1 = Q k + 1 [ k +1 r Q k +1 k ] This is a common form for update rules: NewEstimate = OldEstimate + StepSize[Target OldEstimate]

Tracking a Nonstationary Problem Q k Choosing to be a sample average is appropriate in a stationary problem, i.e., when none of the Q * (a) change over time, But not in a nonstationary problem. Better in the nonstationary case is: Q k +1 = Q k +α[ r k +1 Q k ] for constant α, 0 < α 1 = (1 α) k Q 0 + α(1 α) k i r i k i =1 exponential, recency-weighted average

Optimistic Initial Values All methods so far depend on Q 0 (a), i.e., they are biased. Suppose instead we initialize the action values optimistically, i.e., on the 10-armed testbed, use Q 0 (a) = 5 for all a