CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

Similar documents
CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

Chapter 8: Generalization and Function Approximation

CS599 Lecture 2 Function Approximation in RL

Generalization and Function Approximation

Value Prediction with FA. Chapter 8: Generalization and Function Approximation. Adapt Supervised Learning Algorithms. Backups as Training Examples [ ]

Chapter 8: Generalization and Function Approximation

Reinforcement Learning

Reinforcement Learning: An Introduction. ****Draft****

Reinforcement Learning

Lecture 7: Value Function Approximation

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

Approximation Methods in Reinforcement Learning

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Reinforcement Learning

Replacing eligibility trace for action-value learning with function approximation

Reinforcement Learning. George Konidaris

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

CSE446: non-parametric methods Spring 2017

Reinforcement learning an introduction

Week 6, Lecture 1. Reinforcement Learning (part 3) Announcements: HW 3 due on 11/5 at NOON Midterm Exam 11/8 Project draft due 11/15

Grundlagen der Künstlichen Intelligenz

ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6.

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation

Introduction. Reinforcement Learning

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Value Function Methods. CS : Deep Reinforcement Learning Sergey Levine

Machine Learning I Continuous Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

The Book: Where we are and where we re going. CSE 190: Reinforcement Learning: An Introduction. Chapter 7: Eligibility Traces. Simple Monte Carlo

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Reinforcement Learning

6 Reinforcement Learning

Temporal difference learning

CS599 Lecture 1 Introduction To RL

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation


CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Machine Learning I Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning Part 2

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Reinforcement Learning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Lecture 23: Reinforcement Learning

Overview Example (TD-Gammon) Admission Why approximate RL is hard TD() Fitted value iteration (collocation) Example (k-nn for hill-car)

CS 6375 Machine Learning

Sample questions for Fundamentals of Machine Learning 2018

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

Approximate Q-Learning. Dan Weld / University of Washington

Lecture 4: Approximate dynamic programming

Reinforcement Learning II

Reinforcement Learning using Continuous Actions. Hado van Hasselt

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

On and Off-Policy Relational Reinforcement Learning

Lecture 8: Policy Gradient

Nonlinear Classification

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

CPSC 340: Machine Learning and Data Mining. Gradient Descent Fall 2016

Introduction to Reinforcement Learning. Part 5: Temporal-Difference Learning

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

Pattern Recognition and Machine Learning

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

CS325 Artificial Intelligence Chs. 18 & 4 Supervised Machine Learning (cont)

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Reinforcement Learning and NLP

An online kernel-based clustering approach for value function approximation

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

ilstd: Eligibility Traces and Convergence Analysis

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

CSC321 Lecture 22: Q-Learning

Approximate Dynamic Programming

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Reinforcement Learning

Approximate Dynamic Programming

Chapter 6: Temporal Difference Learning

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Reinforcement Learning. Value Function Updates

Notes on Reinforcement Learning

Reinforcement Learning in Continuous Action Spaces

Artificial Intelligence

Least-Squares Temporal Difference Learning based on Extreme Learning Machine

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

Lecture 3 Feedforward Networks and Backpropagation

Least Mean Squares Regression

CMU-Q Lecture 24:

Q-learning. Tambet Matiisen

Feedforward Neural Nets and Backpropagation

CS 570: Machine Learning Seminar. Fall 2016

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

What we learned last time

Transcription:

CSE 190: Reinforcement Learning: An Introduction Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part: Generalization How to accomplish generalization by using parameterized functions that approximate the Value or Q functions: Function Approximation Acknowledgment: A good number of these slides are cribbed from Rich Sutton Function Approximation Methods Many possible methods: artificial neural networks decision trees multivariate regression methods Kernel density methods But RL has some special requirements: Usually want to learn while interacting Want to be able to handle nonstationarity In this chapter, we focus on linear function approximators Simple models Simple learning rules Pop Quiz: What Function Are We Approximating? The Value function and/or the Q-value function The methods we have used so far are called tabular methods The V and Q functions were essentially a table: V(s is represented in an array. But these are still functions: V(s: a mapping from states to values Q(s,a: A mapping from states and actions to values Problem: An entry in an array doesn t tell you anything about the entry in the next cell: No generalization Goal: learning something about a state should generalize to nearby states. 3 4

Linear Function Approximators Gradient Descent Methods Assume V is a (sufficiently smooth differentiable function Represent states as vectors of features (note: I am using different notation than the book: F(s = ( f 1 (s, f 2 (s,...f n (s T Then represent the Value as a linear combination of the features: V (s = T F(s = i f i (s n " I.e., a simple weighted sum of the features. Often, one of the features (e.g., f 0 (s is a constant. i=1 of, for all s "S. Assume, for now, training examples of this form: {(s 1, V (s 1,(s 2, V (s 2,...(s T, V (s T } where our goal: by modifying: s i = ( f 1 (s i, f 2 (s i,..., f n (s i V (s i "V # (s i # = (# 1,# 2,,# n T 5 6 This is our performance measure a common and simple one is the sum-squared error (SSE over a probability distribution P : SSE( = P(s $ % V " (s # V (s ' s(s Why P? Why minimize SSE? Let us assume that P is always the distribution of states at which backups are done. The on-policy distribution: the distribution created while following the policy being evaluated. Stronger results are available for this distribution. 2 7 Gradient Descent Main idea: Change in order to go downhill in the Sum Squared Error Which way is downhill? The negative of the slope of the SSE with respect to When the slope is a vector, it s called the gradient. Current parameters: = 1, 2 Gradient (downhill in the parameters: - " SSE SSE ( T 2 Iteratively move t +1 = down the gradient: t " #$ SSE( t 1 8

Gradient Descent What is - SSE "? % - SSE = # $SSE, $SSE,, $SSE ( " ' $" 1 $" 2 $" n * Looking at one component of this vector: - SSE 1 = # 2 (V $ (s # V(s 2 " i " i = #(V $ (s # V(s(# V(s " i = (V $ (s # V(s V(s " i T 9 So this is what - SSE " Now, what for V(s Gradient Descent is: % - SSE = # $SSE " ' V(s " i, $SSE,, $SSE ( $" 1 $" 2 $" n * % = (V + (s # V(s $V(s, $V(s,, $V(s ( ' $" 1 $" 2 $" n * is depends on your function approximator T T 10 Linear Function Approximators Linear Function Approximators Representing the Value function as a linear combination of features makes the derivative very simple: V (s = T F(s = i f i (s V(s " i = i=1 So the learning rule becomes: i = i + "(V # (s $ V(s f i (s n # n " " j f j (s j =1 " i = " i f i (s " i = f i (s So the learning rule becomes: i = i + "(V # (s $ V(s f i (s Why does this make sense? 11 12

Linear Function Approximators i = i + "(V # (s $ V(s f i (s Targets Where does the teacher V (s come from? We don t actually have access to V (s However, if we have something whose expected value is V (s we re guaranteed to converge to the optimal solution. This is true of: Monte Carlo (R t,td( s return of R " (but only if =1. Not true of R " for <1, or of Dynamic Programming targets - but these methods work well anyway Using the TD error: t = r t + "V t +1 # V t Now, we update the elegibility trace using the gradient: e t = $" e t #1 + % t V t 13 14 On-Line Gradient-Descent TD( Nice Properties of Linear FA Methods The gradient is very simple: V " t (s = F(s For MSE, the error surface is simple: quadratic surface with a single minimum. Linear gradient descent TD( converges: Step size decreases appropriately On-line sampling (states sampled from the on-policy distribution Converges to parameter vector with property: " MSE( " # 1 $ % 1 $ % MSE( ' 15 (Tsitsiklis Van Roy, 1997 best parameter vector 16

Coarse Coding Shaping Generalization in Coarse Coding 17 18 Learning and Coarse Coding Tile Coding Binary feature for each tile Number of features present at any one time is constant Binary features means weighted sum easy to compute Easy to compute indices of the features present 19 20

Tile Coding Cont. Radial Basis Functions (RBFs e.g., Gaussians Irregular tilings # f i (s = exp s c i % 2 $ 2" i 2 ( ' Hashing CMAC Cerebellar model arithmetic computer Albus 1971 21 22 Can you beat the curse of dimensionality? Can you keep the number of features from going up exponentially with the dimension? Function complexity, not dimensionality, is the problem. Kanerva coding: Select a bunch of binary prototypes Use hamming distance as distance measure Dimensionality is no longer a problem, only complexity Lazy learning schemes: Remember all the data To get new value, find nearest neighbors and interpolate e.g., locally-weighted regression Control with FA Learning state-action values The general gradient-descent rule: t +1 = t + " [ v t # Q t,a t ]$ Q(s t,a t Gradient-descent Sarsa( (backward view: t +1 = t + " # t et where # t = r t +1 + $ Q t +1,a t +1 % Q t,a t e t = $ e t %1 + ' Q t,a t 23 24

GPI Linear Gradient Descent Watkins Q( Linear Gradient Descent Sarsa( (with GPI 25 26 Mountain-Car Task 27 28

Mountain Car with Radial Basis Functions Mountain-Car Results 29 30 Baird s Counterexample Baird s Counterexample Cont. 31 32

Should We Bootstrap? Summary Generalization Adapting supervised-learning function approximation methods Gradient-descent methods Linear gradient-descent methods Radial basis functions Tile coding Kanerva coding Nonlinear gradient-descent methods? Backpropagation? Subtleties involving function approximation, bootstrapping and the on-policy/off-policy distinction 33 34 Value Prediction with FA As usual: Policy Evaluation (the prediction problem: for a given policy #, compute the state-value function V Adapt Supervised Learning Algorithms Training Info = desired (target outputs In earlier chapters, value functions were stored in lookup tables. Here, the value function estimate at time t, V t, depends on a parameter vector t, and only the parameter vector Inputs Supervised Learning System Outputs is updated. Training example = {input, target output} e.g., t could be the vector of connection weights of a neural network. Error = (target output actual output 35 36

Backups as Training Examples e.g., the TD(0 backup: [ ] V V + " r t +1 + # V +1 $ V As a training example: { description of s t, r t +1 + V +1 } Gradient Descent Cont. For the MSE given above and using the chain rule: t +1 = t " 1 2 #$ MSE( t = t " 1 2 #$ P(s V ' 2 (s " V ( t (s* + s%s = t + # P(s ( V ' (s " V t (s* + $ V t (s s%s input target output 37 38 Gradient Descent Cont. But We Don t have these Targets Use just the sample gradient instead: t +1 = t " 1 2 #$ V % 2 (s ' t " V t ( Suppose we just have targets v t instead : t +1 = t + " [ v t # V t ]$ V (s t t = t + # ' V % " V t ( $ V (s, t t Since each sample gradient is an unbiased estimate of the true gradient, this converges to a local minimum of the MSE if " decreases appropriately with t. E #$ V " V t % ' V ( t = * P(s #$ V (s " V t (s% ' V ( t (s ss 39 If each v t is an unbiased estimate of V, { } = V, then gradient descent converges i.e., E v t to a local minimum (provided " decreases appropriately. e.g., the Monte Carlo target v t = R t : t +1 = t + " [ R t # V t ]$ V (s t t 40

What about TD( Targets? t +1 = t + " % R # t $ V t ' ( V t Not for # < 1 But we do it anyway, using the backwards view: t +1 = t + " # t et, where: # t = r t +1 + $ V t +1 % V t, as usual, and e t = $ e t %1 + ' V t END 41