Lecture 23: Reinforcement Learning

Similar documents
CS599 Lecture 1 Introduction To RL

Reinforcement Learning

Lecture 1: March 7, 2018

ARTIFICIAL INTELLIGENCE. Reinforcement learning

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Temporal difference learning

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Open Theoretical Questions in Reinforcement Learning

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Notes on Reinforcement Learning

Machine Learning I Reinforcement Learning

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Reinforcement Learning and Control

Reinforcement Learning

Basics of reinforcement learning

Reinforcement Learning II

CS 570: Machine Learning Seminar. Fall 2016

Introduction to Reinforcement Learning

Reinforcement Learning

Reinforcement Learning II. George Konidaris

Reinforcement Learning

An Introduction to Reinforcement Learning

Lecture 8: Policy Gradient

Reinforcement Learning II. George Konidaris

Grundlagen der Künstlichen Intelligenz

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning. George Konidaris

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Reinforcement Learning

Reinforcement Learning

Temporal Difference Learning & Policy Iteration

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Introduction to Reinforcement Learning

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

15-780: ReinforcementLearning

Reinforcement learning an introduction

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Introduction to Reinforcement Learning. Part 5: Temporal-Difference Learning

An Introduction to Reinforcement Learning

Reinforcement Learning: An Introduction

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Grundlagen der Künstlichen Intelligenz

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

REINFORCEMENT LEARNING

Lecture 7: Value Function Approximation

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Reinforcement learning

Reinforcement Learning

Reinforcement Learning

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

6 Reinforcement Learning

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Chapter 6: Temporal Difference Learning

Decision Theory: Q-Learning

Reinforcement Learning: the basics

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.

ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6.

Reinforcement Learning (1)

An Introduction to Reinforcement Learning

Probabilistic Planning. George Konidaris

ilstd: Eligibility Traces and Convergence Analysis

Elements of Reinforcement Learning

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Machine Learning I Continuous Reinforcement Learning

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Reinforcement Learning Part 2

Reinforcement Learning

Reinforcement Learning. Up until now we have been

Overview Example (TD-Gammon) Admission Why approximate RL is hard TD() Fitted value iteration (collocation) Example (k-nn for hill-car)

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

The Book: Where we are and where we re going. CSE 190: Reinforcement Learning: An Introduction. Chapter 7: Eligibility Traces. Simple Monte Carlo

Reinforcement Learning. Yishay Mansour Tel-Aviv University

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Reinforcement Learning

Reinforcement Learning

Q-Learning in Continuous State Action Spaces

1. (3 pts) In MDPs, the values of states are related by the Bellman equation: U(s) = R(s) + γ max a

Reinforcement Learning. Value Function Updates

16.410/413 Principles of Autonomy and Decision Making

Computational Reinforcement Learning: An Introduction

Introduction. Reinforcement Learning


Autonomous Helicopter Flight via Reinforcement Learning

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

Markov Decision Processes (and a small amount of reinforcement learning)

Reinforcement Learning. Machine Learning, Fall 2010

Lecture 15: Bandit problems. Markov Processes. Recall: Lotteries and utilities

Reinforcement learning

Learning in State-Space Reinforcement Learning CIS 32

Decision Theory: Markov Decision Processes

Transcription:

Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23

Recall: Markov Decision Processes r= 0.1 n Unemployed (U) r= 1 g 0.5 0.5 Grad School (G) i 0.8 0.2 0.4 i 0.6 0.1 0.9 a Industry (I) 0.9 Academia (A) r=+10 n=do Nothing i = Apply to industry g = Apply to grad school a = Apply to academia r=+1 0.1 A set of states S A set of actions A Expected rewards R(s, a) Transition probabilities T(s, a, s ) Discount factor γ November 23, 2006 2 COMP-424 Lecture 23

Recall: Policy Evaluation Problem Suppose someone told us a policy for selecting actions, π : S A [0, 1] How much return do we expect to get if we use it to behave? V π (s) = E π [R t s t = s] = E π [ X k=1 γ k 1 r t+k s t = s] If we knew this, we could then improve the policy (e.g., using policy iteration) November 23, 2006 3 COMP-424 Lecture 23

Iterative Policy Evaluation 1. Start with some initial guess V 0 2. During every iteration k, update the values of all states: V k+1 (s) X π(s, a) R(s, a) + γ X! T(s, a, s )V k (s ), s a s 3. Stop when the maximum change between two iterations is smaller than a desired threshold (the values stop changing) The value of one state is updated based on the values of the states that can be reached from it November 23, 2006 4 COMP-424 Lecture 23

How Is Learning Tied with Dynamic Programming? Observe transitions in the environment, learn an approximate model ˆR(s, a), ˆT(s, a, s ) Use maximum likelihood to compute probabilities Use supervised learning for the rewards Pretend the approximate model is correct and use it for any dynamic programming method This approach is called model-based reinforcement learning Many believers, especially in the robotics community November 23, 2006 5 COMP-424 Lecture 23

Monte Carlo Methods Suppose we have an episodic task: the agent interacts with the environment in trials or episodes, which terminate at some point E.g. game playing The agent behaves according to some policy π for a while, generating several trajectories. How can we compute V π? Compute V π (s) by averaging the observed returns after s from the trajectories in which s was visited. November 23, 2006 6 COMP-424 Lecture 23

Monte Carlo Methods Suppose we have an episodic task: the agent interacts with the environment in trials or episodes, which terminate at some point The agent behaves according to some policy π for a while, generating several trajectories. How can we compute V π? Compute V π (s) by averaging the observed returns after s on the trajectories in which s was visited. November 23, 2006 7 COMP-424 Lecture 23

Implementation of Monte Carlo Policy Evaluation Let V n+1 be the estimate of the value from some state s after observing n + 1 trajectories starting at s. V n+1 = = = 1 n + 1 n n + 1 n+1 X i=1 1 n R i = 1 n + 1 ( n X nx i=1 i=1 R i + 1 n + 1 R n+1 R i + R n+1 ) n n + 1 V n + 1 n + 1 R n+1 = V n + 1 n + 1 (R n+1 V n ) If we do not want to keep counts of how many times states have been visited, we can use a learning rate version: V (s t ) V (s t ) + α(r t V (s t )) November 23, 2006 8 COMP-424 Lecture 23

What If State Space Is Too Large? Represent the state as a vector of input variables Represent V as a linear/polynomial function, or as a neural network Use R t V (s t ) as the error signal! November 23, 2006 9 COMP-424 Lecture 23

Temporal-Difference (TD) Prediction Monte Carlo uses as a target estimate for the value function the actual return, R t : V (s t ) V (s t ) + α [R t V (s t )] The simplest TD method, TD(0), uses instead an estimate of the return: V (s t ) V (s t ) + α[r t+1 + γv (s t+1 ) V (s t )] If V (s t+1 ) were correct, this would be like a dynamic programming target! November 23, 2006 10 COMP-424 Lecture 23

TD Is Hybrid between Dynamic Programming and Monte Carlo! Like DP, it bootstraps (computes the value of a state based on estimates of the successors) Like MC, it estimates expected values by sampling November 23, 2006 11 COMP-424 Lecture 23

TD Learning Algorithm 1. Initialize the value function, V (s) = 0, s 2. Repeat as many times as wanted: (a) Pick a start state s for the current trial (b) Repeat for every time step t: i. Choose action a based on policy π and the current state s ii. Take action a, observed reward r and new state s iii. Compute the TD error: δ r + γv (s ) V (s) iv. Update the value function: V (s) V (s) + α s δ v. s s vi. If s is not a terminal state, go to 2b November 23, 2006 12 COMP-424 Lecture 23

Example Suppose you start will all 0 guesses and observe the following episodes: B,1 B,1 B,1 B,1 B,0 A,0; B (reward not seen yet) What would you predict for V (B)? What would you predict for V (A)? November 23, 2006 13 COMP-424 Lecture 23

Example: TD vs Monte Carlo For B, it is clear that V (B) = 4/5. If you use Monte Carlo, at this point you can only predict your initial guess for A (which is 0) If you use TD, at this point you would predict 0 + 4/5! And you would adjust the value of A towards this target. November 23, 2006 14 COMP-424 Lecture 23

Example (continued) Suppose you start will all 0 guesses and observe the following episodes: B,1 B,1 B,1 B,1 B,0 A,0; B 0 What would you predict for V (B)? What would you predict for V (A)? November 23, 2006 15 COMP-424 Lecture 23

Example: Value Prediction The estimate for B would be 4/6 The estimate for A, if we use Monte Carlo is 0; this minimizes the sum-squared error on the training data If you were to learn a model out of this data and do dynamic programming, you would estimate the A goes to B, so the value of A would be 0 + 4/6 TD is an incremental algorithm: it would adjust the value of A towards 4/5, which is the current estimate for B (before the continuation from B is seen) This is closer to dynamic programming than Monte Carlo TD estimates take into account time sequence November 23, 2006 16 COMP-424 Lecture 23

Example: Eligibility Traces Suppose you estimated V (B) = 4/5, then saw A, 0, B,0. Value of A is adjusted right away towards 4/5 But then the value of B is decreased from 4/5 to something like 4/6 It would be nice to propagate this information to A as well! November 23, 2006 17 COMP-424 Lecture 23

Eligibility Traces (TD(λ)) e t e t e t δt s t-3 s t-2 s t-1 e t s t s t+1 Time On every time step t, we compute the TD error: δ t = r t+1 + γv (s t+1 ) V (s t ) Shout δ t backwards to past states The strength of your voice decreases with temporal distance by γλ, where λ [0, 1] is a parameter November 23, 2006 18 COMP-424 Lecture 23

Advantages No model of the environment is required! TD only needs experience with the environment. On-line, incremental learning: Can learn before knowing the final outcome Less memory and peak computation are required Both TD and MC converge (under mild assumptions), but TD usually learns faster. November 23, 2006 19 COMP-424 Lecture 23

Large State Spaces: Adapt Supervised Learning Algorithms A training example has an input and a target output The error is measured based on the difference between the actual output and the desired (target) output Training Info: Desired (target) Output Inputs Supervised Learning Outputs November 23, 2006 20 COMP-424 Lecture 23

Value-Based Methods We will use a function approximator to represent the value function The input is a description of the state The output is the predicted value of the state The target output comes from the TD update rule: the target is r t+1 + γv (s t+1 ) November 23, 2006 21 COMP-424 Lecture 23

On-line Gradient Descent TD 1. Initialize the weight vector of the function approximator w 2. Pick a start state s 3. Repeat for every time step t: (a) Choose action a based on policy π and the current state s (b) Take action a, observe immediate reward r and new state s (c) Compute the TD error: δ r + γv (s ) V (s) (d) Update the weight vector: w w + αδ V (s) (e) s s November 23, 2006 22 COMP-424 Lecture 23

Observations For linear function approximators, the gradient is just the input feature vector For neural networks, this algorithm reduces to running backpropagation with TD error at the output November 23, 2006 23 COMP-424 Lecture 23

RL Algorithms for Control TD-learning (as above) is used to compute values for a given policy π Control methods aim to find the optimal policy In this case, the behavior policy will have to balance two important tasks: Explore the environment in order to get information Exploit the existing knowledge, by taking the action that currently seems best November 23, 2006 24 COMP-424 Lecture 23

Exploration In order to obtain the optimal solution, the agent must try all actions Simplest exploration scheme: ǫ-greedy With probability 1 ǫ choose the action which currently appears best With probability ǫ choose an action uniformly randomly Much research is done in this area! November 23, 2006 25 COMP-424 Lecture 23

TD-Gammon (Tesauro, 1992-1995) predicted probability of winning, V t TD error, V t+1 V t............ hidden units (40-80)...... backgammon position (198 input units) 24 23 22 21 20 19 18 17 16 15 14 13 white pieces move counterclockwise 1 2 3 4 5 6 7 8 9 10 11 12 black pieces move clockwise November 23, 2006 26 COMP-424 Lecture 23

Immediate reward: +100 if win -100 if lose 0 for all other states TD-Gammon: Training Procedure Trained by playing 1.5 million games against itself Now approximately equal to best human player November 23, 2006 27 COMP-424 Lecture 23

The Power of Learning from Experience Expert examples are expensive and scarce Experience is cheap and plentiful! November 23, 2006 28 COMP-424 Lecture 23

Applying Reinforcement Learning to Chess TD does not replace search, it gives a way for computing value functions Useful trick: instead of evaluating state before your move, evaluate the state after your move (called afterstate) This makes it easier to choose moves Exploration is very important, because the game is deterministic and self-play can get into behavior loops. Example: KnightCap (Baxter Tridgell & Weaver, 2000) November 23, 2006 29 COMP-424 Lecture 23

Success Stories TD-Gammon (Tesauro, 1992) Elevator dispatching (Crites and Barto, 1995): better than industry standard Inventory management (Van Roy et. al): 10-15% improvement over industry standards Job-shop scheduling for NASA space missions (Zhang and Dietterich, 1997) Dynamic channel assignment in cellular phones (Singh and Bertsekas, 1994) Robotic soccer (Stone et al, Riedmiller et al...) Helicopter control (Ng, 2003) Modelling neural reward systems (Schultz, Dayan and Montague, 1997) November 23, 2006 30 COMP-424 Lecture 23

November 23, 2006 31 COMP-424 Lecture 23

November 23, 2006 32 COMP-424 Lecture 23

Summary Reinforcement learning can be used to learn value functions directly from interaction with an environment Monte Carlo methods use samples of the actual return TD methods use just samples of the next transition! Both converge in the limit, but TD is usually faster It is easy (algorithmically) to combined RL with function approximation In this case, it is much harder to establish the theoretical properties of the algorithms, but they often work well in practice. November 23, 2006 33 COMP-424 Lecture 23