Chapter 8: Generalization and Function Approximation

Similar documents
CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

CS599 Lecture 2 Function Approximation in RL

Generalization and Function Approximation

Chapter 8: Generalization and Function Approximation

Value Prediction with FA. Chapter 8: Generalization and Function Approximation. Adapt Supervised Learning Algorithms. Backups as Training Examples [ ]

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning: An Introduction. ****Draft****

Lecture 7: Value Function Approximation

Reinforcement Learning. Machine Learning, Fall 2010

Replacing eligibility trace for action-value learning with function approximation

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6.

Week 6, Lecture 1. Reinforcement Learning (part 3) Announcements: HW 3 due on 11/5 at NOON Midterm Exam 11/8 Project draft due 11/15

CS599 Lecture 1 Introduction To RL

Reinforcement Learning

Machine Learning I Continuous Reinforcement Learning

Reinforcement Learning

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Overview Example (TD-Gammon) Admission Why approximate RL is hard TD() Fitted value iteration (collocation) Example (k-nn for hill-car)

Chapter 13 Wow! Least Squares Methods in Batch RL

Open Theoretical Questions in Reinforcement Learning

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

Approximation Methods in Reinforcement Learning

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Reinforcement learning an introduction

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation

Reinforcement Learning

Grundlagen der Künstlichen Intelligenz

Introduction. Reinforcement Learning

Lecture 23: Reinforcement Learning

Sample questions for Fundamentals of Machine Learning 2018

Machine Learning I Reinforcement Learning

Reinforcement Learning

(Deep) Reinforcement Learning


6 Reinforcement Learning

Temporal difference learning

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

CS 6375 Machine Learning

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

CSE446: non-parametric methods Spring 2017

ilstd: Eligibility Traces and Convergence Analysis

Value Function Methods. CS : Deep Reinforcement Learning Sergey Levine

On and Off-Policy Relational Reinforcement Learning

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Reinforcement learning

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. George Konidaris

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Reinforcement Learning

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Deep Reinforcement Learning

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Reinforcement Learning

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Pattern Recognition and Machine Learning

Payments System Design Using Reinforcement Learning: A Progress Report

An online kernel-based clustering approach for value function approximation

Approximate Dynamic Programming

Least-Squares Temporal Difference Learning based on Extreme Learning Machine

Reinforcement Learning In Continuous Time and Space

State Space Abstractions for Reinforcement Learning

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

CS325 Artificial Intelligence Chs. 18 & 4 Supervised Machine Learning (cont)

Value Function Approximation in Reinforcement Learning using the Fourier Basis

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Lecture 9: Policy Gradient II (Post lecture) 2

Off-Policy Actor-Critic

Chapter 6: Temporal Difference Learning

Elements of Reinforcement Learning

CPSC 340: Machine Learning and Data Mining. Gradient Descent Fall 2016

Reinforcement Learning

Reinforcement Learning using Continuous Actions. Hado van Hasselt

Chapter 6: Temporal Difference Learning

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Lecture XI Learning with Distal Teachers (Forward and Inverse Models)

Q-Learning in Continuous State Action Spaces

CSC321 Lecture 6: Backpropagation

Reinforcement Learning Part 2

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

TDT4173 Machine Learning

Temporal Difference Learning & Policy Iteration

CSC321 Lecture 22: Q-Learning

Artificial Neural Networks. MGS Lecture 2

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Regularization and Feature Selection in. the Least-Squares Temporal Difference

A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation

Introduction to Reinforcement Learning. Part 5: Temporal-Difference Learning

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Transcription:

Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. Overview of function approximation (FA) methods and how they can be adapted to RL

Any FA Method? In principle, yes: artificial neural networks decision trees multivariate regression methods etc. But RL has some special requirements: usually want to learn while interacting ability to handle nonstationarity other? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2

Gradient Descent Methods t t (1), t (2),, t (n) T transpose Assume V t is a (sufficiently smooth) differentiable function of t, for all s S. Assume, for now, training examples of this form : description of s t, V (s t ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 3

Performance Measures Many are applicable but a common and simple one is the mean-squared error (MSE) over a distribution P : Why P? MSE( t ) Why minimize MSE? s S P(s) V (s) V t (s) Let us assume that P is always the distribution of states at which backups are done. The on-policy distribution: the distribution created while following the policy being evaluated. Stronger results are available for this distribution. 2 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4

Gradient Descent Let f be any function of the parameter space. Its gradient at any point t in this space is: f ( t ) f () t (1), f () t (2),, f () T t. (n) (2) Iteratively move down the gradient: t 1 t f ( t ) t t (1), t (2) T (1) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5

On-Line Gradient-Descent TD(l) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6

Linear Methods Represent states as feature vectors: for each s S : s s (1), s (2),, s (n) T V t (s) tt s V t (s)? n i 1 t (i) s (i) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7

Nice Properties of Linear FA Methods The gradient is very simple: V t (s) s For MSE, the error surface is simple: quadratic surface with a single minumum. Linear gradient descent TD(l) converges: Step size decreases appropriately On-line sampling (states sampled from the on-policy distribution) Converges to parameter vector with property: MSE( ) 1 l 1 MSE( ) (Tsitsiklis & Van Roy, 1997) best parameter vector R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8

Coarse Coding R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9

Shaping Generalization in Coarse Coding R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10

Learning and Coarse Coding R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11

Tile Coding Binary feature for each tile Number of features present at any one time is constant Binary features means weighted sum easy to compute Easy to compute indices of the freatures present R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12

Tile Coding Cont. Irregular tilings Hashing CMAC Cerebellar model arithmetic computer Albus 1971 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13

Radial Basis Functions (RBFs) e.g., Gaussians s (i) exp s c i 2 2 i 2 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 14

Can you beat the curse of dimensionality? Can you keep the number of features from going up exponentially with the dimension? Function complexity, not dimensionality, is the problem. Kanerva coding: Select a bunch of binary prototypes Use hamming distance as distance measure Dimensionality is no longer a problem, only complexity Lazy learning schemes: Remember all the data To get new value, find nearest neighbors and interpolate e.g., locally-weighted regression R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 15

Control with FA Learning state-action values Training examples of the form: description of (s t, a t ), v t The general gradient-descent rule: t 1 t v t Q t (s t,a t ) Q(s t,a t ) Gradient-descent Sarsa(l) (backward view): t 1 t t e t where t r t 1 Q t (s t 1,a t 1 ) Q t (s t,a t ) e t le t 1 Q t (s t,a t ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 16

GPI Linear Gradient Descent Watkins Q(l) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17

GPI with Linear Gradient Descent Sarsa(l) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18

Mountain-Car Task R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 19

Mountain Car with Radial Basis Functions R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20

Mountain-Car Results R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 21

Baird s Counterexample R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 22

Baird s Counterexample Cont. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 23

Should We Bootstrap? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 24

Summary Generalization Adapting supervised-learning function approximation methods Gradient-descent methods Linear gradient-descent methods Radial basis functions Tile coding Kanerva coding Nonlinear gradient-descent methods? Backpropation? Subtleties involving function approximation, bootstrapping and the on-policy/off-policy distinction R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 25

Value Prediction with FA As usual: Policy Evaluation (the prediction problem): for a given policy, compute the state-value function V In earlier chapters, value functions were stored in lookup tables. Here, the value function estimate at time t, V t, depends on a parameter vector t, and only the parameter vector is updated. e.g., t could be the vector of connection weights of a neural network. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 26

Adapt Supervised Learning Algorithms Training Info = desired (target) outputs Inputs Supervised Learning System Outputs Training example = {input, target output} Error = (target output actual output) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 27

Backups as Training Examples e.g., the TD(0) backup: V(s t ) V(s t ) r t 1 V(s t 1 ) V(s t ) As a training example: description of s t, r t 1 V(s t 1 ) input target output R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 28

Gradient Descent Cont. For the MSE given above and using the chain rule: t 1 t 1 2 MSE( t) t 1 2 t s S s S P(s) P(s) V (s) V t (s) 2 V (s) V t (s) V t (s) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29

Gradient Descent Cont. Use just the sample gradient instead: t 1 t 1 2 V (s t ) V t (s t ) 2 t V (s t ) V t (s t ) V t (s t ), Since each sample gradient is an unbiased estimate of the true gradient, this converges to a local minimum of the MSE if decreases appropriately with t. E V (s t ) V t (s t ) V t (s t ) P(s) V (s) V t (s) V t (s) s S R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 30

But We Don t have these Targets Suppose we just have targets v t instead : t 1 t v t V t (s t ) V t (s t ) If each v t is an unbiased estimate of V (s t ), i.e., E v t V (s t ), then gradient descent converges to a local minimum (provided decreases appropriately). e.g., the Monte Carlo target v t R t : t 1 t R t V t (s t ) V t (s t ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 31

What about TD(l) Targets? t 1 t R l t V t (s t ) V t (s t ) Not for l 1 But we do it anyway, using the backwards view : t 1 t t e t, where : t r t 1 V t (s t 1 ) V t (s t ), as usual, and e t le t 1 V t (s t ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 32