Generalization and Function Approximation

Similar documents
Chapter 8: Generalization and Function Approximation

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

CS599 Lecture 2 Function Approximation in RL

Reinforcement Learning: An Introduction. ****Draft****

Replacing eligibility trace for action-value learning with function approximation

Reinforcement Learning

Reinforcement Learning

Lecture 7: Value Function Approximation

Value Prediction with FA. Chapter 8: Generalization and Function Approximation. Adapt Supervised Learning Algorithms. Backups as Training Examples [ ]

Chapter 8: Generalization and Function Approximation

Reinforcement Learning

Reinforcement Learning. Machine Learning, Fall 2010

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

CS599 Lecture 1 Introduction To RL

Open Theoretical Questions in Reinforcement Learning

Week 6, Lecture 1. Reinforcement Learning (part 3) Announcements: HW 3 due on 11/5 at NOON Midterm Exam 11/8 Project draft due 11/15

Reinforcement learning an introduction

Temporal difference learning

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6.

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Machine Learning I Reinforcement Learning

Reinforcement Learning II

Reinforcement Learning

Reinforcement Learning

Lecture 23: Reinforcement Learning

ilstd: Eligibility Traces and Convergence Analysis

Reinforcement Learning. George Konidaris

Notes on Reinforcement Learning

Machine Learning I Continuous Reinforcement Learning

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Grundlagen der Künstlichen Intelligenz

Linear Least-squares Dyna-style Planning

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Reinforcement Learning

Lecture 8: Policy Gradient

Approximation Methods in Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning

An online kernel-based clustering approach for value function approximation

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Value Function Approximation in Reinforcement Learning using the Fourier Basis

Reinforcement Learning. Value Function Updates


Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation

Reinforcement Learning

Basics of reinforcement learning

6 Reinforcement Learning

Reinforcement Learning Part 2

On and Off-Policy Relational Reinforcement Learning

Least-Squares Temporal Difference Learning based on Extreme Learning Machine

Elements of Reinforcement Learning

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

arxiv: v1 [cs.ai] 1 Jul 2015

Least squares temporal difference learning

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

On the Convergence of Optimistic Policy Iteration

Off-Policy Actor-Critic

Lecture 1: March 7, 2018

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Reinforcement learning

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Reinforcement Learning In Continuous Time and Space

State Space Abstractions for Reinforcement Learning

An Introduction to Reinforcement Learning

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

Temporal Difference Learning & Policy Iteration

CS Machine Learning Qualifying Exam

Bias-Variance Error Bounds for Temporal Difference Updates

STATE GENERALIZATION WITH SUPPORT VECTOR MACHINES IN REINFORCEMENT LEARNING. Ryo Goto, Toshihiro Matsui and Hiroshi Matsuo

Deep Reinforcement Learning

Learning Control Under Uncertainty: A Probabilistic Value-Iteration Approach

Input layer. Weight matrix [ ] Output layer

Chapter 6: Temporal Difference Learning

Reinforcement Learning in Non-Stationary Continuous Time and Space Scenarios

(Deep) Reinforcement Learning

Reinforcement Learning

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Value Function Approximation through Sparse Bayesian Modeling

Approximate Dynamic Programming

arxiv: v1 [cs.ai] 5 Nov 2017

Q-Learning in Continuous State Action Spaces

A Gentle Introduction to Reinforcement Learning

Forward Actor-Critic for Nonlinear Function Approximation in Reinforcement Learning

Chapter 6: Temporal Difference Learning

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Data Mining Part 5. Prediction

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Reinforcement Learning

The Book: Where we are and where we re going. CSE 190: Reinforcement Learning: An Introduction. Chapter 7: Eligibility Traces. Simple Monte Carlo

Lecture 4: Approximate dynamic programming

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

Reinforcement Learning in Continuous Action Spaces

Dual Temporal Difference Learning

Transcription:

Generalization and Function Approximation 0 Generalization and Function Approximation Suggested reading: Chapter 8 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998. Generalization and Function Approximation Generalization and Function Approximation 1 Contents: Value-Prediction with function approx Gradient-Descent Methods Linear Methods Control with Function Approximation

Problems with tables! Generalization and Function Approximation 2 value functions are represented as a table with one entry for each state or for each state-action pair! it is limited to tasks with small numbers of states and actions! the state or action spaces include continuous variables or complex sensations (visual image)! Generalization and Function Approximation 3 Value Prediction with Function Approximation! As usual: Policy Evaluation (the prediction problem):! for a given policy!, compute the state-value function!v " In earlier chapters, value functions were stored in lookup tables.! Here, the value function estimate V t at time t, is a parameterized functional form with parameter vector θ t and only the parameter vector is updated.! For example, V t might be the function computed by an artificial neural network, with θ t the vector of connection weights.!

Generalization and Function Approximation 4 Adapt Supervised Learning Algorithms! Training Info = desired (target) outputs" Inputs! Supervised Learning! System! Outputs! Training example = {input, target output}! Error = (target output actual output)! Performance Measures! Generalization and Function Approximation 5 Many are applicable but! a common and simple one is the mean-squared error (MSE) over a distribution P :! MSE(" t ) = & s%s [ ] P(s) V # (s) $V t (s) 2 estimate P is a distribution weighting the errors of different states! Let us assume that P is always the distribution of states from which the training examples are drawn, and thus the one at which backups are done.! The on-policy distribution: the distribution created while following the policy being evaluated (describes the frequency with which states are encountered).!

Ideal Goal! Generalization and Function Approximation 6 MSE(" t ) = & s%s [ ] P(s) V # (s) $V t (s) V t (s) = V (s," t ) 2 " * MSE(" * ) # MSE(") $" global optimum complex function approximators may seek to converge instead to a local optimum Generalization and Function Approximation 7 Backups as Training Examples! Training example:! { description of s t, V " )} input! target output!

Generalization and Function Approximation 8 Backups as Training Examples! We don t know the real target values, but we can use estimates:! As a training example:! input! target output! Any FA Method?! Generalization and Function Approximation 9 In principle, yes:! artificial neural networks! decision trees! multivariate regression methods! etc.! But RL has some special requirements:! usually want to learn while interacting! ability to handle non-stationarity! other?!

Gradient Descent Methods! Generalization and Function Approximation 10 " t = (" t (1)," t (2),...," t (n)) T Assume V t is a (sufficiently smooth) differentiable function of " t, for all s # S. Recall: Gradient Descent! Generalization and Function Approximation 11 Let f be any function of the parameter space. Its gradient at any point " t in this space is : % # " f (" t ) = $f (" t) $"(1),$f (" t ) $"(2),...,$f (" t) ( ' * & $"(n) ) T. "(2) " t = (" t (1)," t (2)) T Iteratively move down the gradient:! "(1) " t +1 = " t # $% " f (" t )

Gradient Descent Cont.! Generalization and Function Approximation 12 For the MSE given above and using the chain rule:! " t +1 = " t # 1 2 $% " MSE(" t ) = " t # 1 2 $% " ' P(s) [ V ( (s) #V t (s)] 2 s&s = " t + $ ' P(s) [ V ( (s) #V t (s)]% " V t (s) s&s Gradient Descent Cont.! Generalization and Function Approximation 13 Alternatively, use just the sample gradient instead:! " t +1 = " t # 1 2 $% " [ V & ) #V t )] 2 = " t + $ [ V & ) #V t )]% " V t ), Since each sample gradient is an unbiased estimate of! the true gradient, this converges to a local minimum! of the MSE if! decreases appropriately with t.! E[ V " ) #V t )]$ % V t ) = ' P(s) [ V " (s) #V t (s)] $ % V t (s) s&s

Generalization and Function Approximation 14 But We Don t have these Targets! Suppose we just have targets v t instead : " t +1 = " t + #[ v t $V t )]% " V t ) e.g., the Monte Carlo target v t = R t : " t +1 = " t + #[ R t $V t )]% " V t ) What about TD(") Targets?! Generalization and Function Approximation 15 " t +1 = " t + #[ R $ t %V t )]& " V t ) Not for $ <1 But we do it anyway, using the backwards view : " t +1 = " t + #$ t e t, where : $ t = r t +1 + % V t +1 ) &V t ), as usual, and e t = % 'e t &1 + ( " V t )

Generalization and Function Approximation 16 On-Line Gradient-Descent TD(")! Linear Methods! Generalization and Function Approximation 17 Represent states as feature vectors: for each s " S : # s = (# s (1),# s (2),...,# s (n)) T V t (s) = " t T # s = n $ i=1 " t (i)# s (i) " # V t (s) =?

Generalization and Function Approximation 18 Nice Properties of Linear FA Methods! The gradient is very simple:! " # V t (s) = $ s For MSE, the error surface is simple: quadratic surface with a single minumum.! Linear gradient descent TD(") converges:! Step size decreases over time! On-line sampling (states sampled from the on-policy distribution)! " # Converges to parameter vector with property:! MSE(" # ) $ 1 % & ' 1% & MSE(" ( ) (Tsitsiklis & Van Roy, 1997)! best parameter vector! Coarse Coding! Generalization and Function Approximation 19 Consider a task in which the state set is continuous and two-dimensional. A state in this case is a point in 2-space, a vector with two real components. One kind of feature for this case is those corresponding to circles in state space:!

Generalization and Function Approximation 20 Shaping Generalization in Coarse Coding! Generalization and Function Approximation 21 Learning and Coarse Coding! Linear function approximation based on coarse coding was used to learn a onedimensional square-wave function. The values of this function were used as targets. With just one dimension, the receptive fields were intervals rather than circles. With broad features, the generalization tended to be broad. However, the final function learned was affected only slightly by the width of the features.

Tile Coding! Generalization and Function Approximation 22 Binary feature for each tile! Number of features present at any one time is constant! Binary features means weighted sum easy to compute! Easy to compute indices of the features present! Tile Coding Cont.! Generalization and Function Approximation 23 Irregular tilings! Hashing!

Generalization and Function Approximation 24 Radial Basis Functions (RBFs)! e.g., Gaussians! Generalization and Function Approximation 25 Can you beat the curse of dimensionality?! Can you keep the number of features from going up exponentially with the dimension?! Function complexity, not dimensionality, is the problem.! Kanerva coding:! Select a bunch of binary prototypes! Use hamming distance as distance measure! Dimensionality is no longer a problem, only complexity!

Control with Function Approximation! Learning state-action values! Training examples of the form:! { description of,a t ), v t } The general gradient-descent rule:! " t +1 = " t + # [ v t $ Q t,a t )]% " Q,a t ) Gradient-descent Sarsa(") (backward view):! Generalization and Function Approximation 26 " t +1 = " t + #$ t e t where $ t = r t +1 + % Q t +1,a t +1 ) & Q t,a t ) e t = % 'e t &1 + ( " Q t,a t ) Continuous action spaces are a topic of ongoing research. Here, stick with discrete action spaces. Generalization and Function Approximation 27 GPI with Linear Gradient Descent Sarsa(")!

Generalization and Function Approximation 28 GPI Linear Gradient Descent Watkins Q(")! Mountain-Car Task! Generalization and Function Approximation 29 Example: Driving an underpowered car up a steep mountain road. The difficulty is that gravity is stronger than the car's engine. The only solution is to first move away from the goal and up the opposite slope on the left. Reward: -1 on all time steps until the car reaches the goal, which ends the episode. 3 actions: full throttle forward, full throttle reverse, and zero throttle. States: Position, Velocity To convert the two continuous state variables to binary features, ten 9x9 gridtilings were used, each offset by a random fraction of a tile width. The initial action values were all zero, which was optimistic (all true values are negative in this task), causing extensive exploration to occur.

Mountain-Car Task! Generalization and Function Approximation 30 negative of the value function The Sarsa algorithm (using replace traces and the optional clearing) solved this task, learning a near optimal policy within 100 episodes. Mountain-Car Results! Generalization and Function Approximation 31

Off policy bootstrapping! Generalization and Function Approximation 32 Bootstrapping methods are more difficult to combine with function approximation than are non-bootstrapping methods. In value prediction with linear, gradient-descent function approximation nonbootstrapping methods find minimal MSE solutions for any distribution of training examples, whereas bootstrapping methods find only near-minimal MSE solutions, and only for the on-policy distribution. Moreover, the quality of the MSE bound for TD(") gets worse the farther " strays from 1, that is, the farther the method moves from its nonbootstrapping form. MSE(" # ) $ 1 % & ' 1 % & MSE(" ( ) Off-policy bootstrapping combined with function approximation can lead to divergence and infinite MSE. Off-policy control methods do not backup states (or state-action pairs) with exactly the same distribution with which the states would be encountered following the estimation policy (the policy whose value function they are estimating). Should We Bootstrap?! Generalization and Function Approximation 33 The performance of bootstrapping can be investigated by using a TD method with eligibility traces and vary " from 0 (pure bootstrapping) to 1 (pure nonbootstrapping).

Generalization and Function Approximation 34 Example: Topological RL in continuous action spaces! steering angle Eight infrared sensors are placed around the robot. Generalization and Function Approximation 35 Example: Topological RL in continuous action spaces! Goal: Docking at a target using the camera image.

Generalization and Function Approximation 36 Example: Topological RL in continuous action spaces! Sensor Input Reward Generalization and Function Approximation 37 Example: Topological RL in continuous action spaces!

Neural Gas! Generalization and Function Approximation 38 The neural gas algorithm sorts for each input signal the units of the network according to the distance of their reference vectors to the input. Based on this rank order a certain number of units is adapted towards the input. Both the number of adapted units and the adaptation strength are decreased according to a fixed schedule. T. M. Martinetz, S. G. Berkovich, and K. J. Schulten. Neural-gas network for vector quantization and its application to time-series prediction. IEEE Transactions on Neural Networks, 4(4):-569, 1993. Generalization and Function Approximation 39 Example: Topological RL in continuous action spaces!

Generalization and Function Approximation 40 Example: Topological RL in continuous action spaces! Generalization and Function Approximation 41 Example: Topological RL in continuous action spaces!

Generalization and Function Approximation 42 Example: Topological RL in continuous action spaces! Generalization and Function Approximation 43 Example: Topological RL in continuous action spaces!

Growing Neural Gas! Generalization and Function Approximation 44 The growing neural gas (GNG) algorithm starts with very few units and new units are inserted successively. To determine where to insert new units, local error measures are gathered during the adaptation process. Each new unit is inserted near the unit which has accumulated the highest error. B. Fritzke. A growing neural gas network learns topologies. In G. Tesauro, D. S. Touretzky, and T. K. Leen, editors, Advances in Neural Information Processing Systems 7, pages 625-632. MIT Press, Cambridge MA, 1995. Generalization and Function Approximation 45 Stability-Plasticity Dilemma! The problem: learning in a non-stationary environment" Learning in artificial neural networks inevitably implies forgetting. Later input patterns tend to wash out previous knowledge. A purely stable network is unable to absorb new knowledge from its interactions, whereas a purely plastic network cannot preserve its knowledge.

Generalization and Function Approximation 46 Life-long learning cell structures! Hamker, F.H. (2001) Life long Learning Cell Structures continuously learning without catastrophic interference. Neural Networks, 14:551-573. Generalization and Function Approximation 47 Life-long learning cell structures! A node of the Life-long Learning Cell Structures. Besides the width of the Gaussian, each node has error and age counters. In contrast to the inherited error, which remains fixed until the node is selected for insertion, the error counters are defined as moving averages according to their individual time constant.

Generalization and Function Approximation 48 Life-long learning cell structures! Repeat for each pattern: Adaptation of the representation layer locate the node b, which best matches the input pattern by applying a distance measure. Find also the second best node s. determine the quality measure for learning B L for b and its neighbors c separately, determine the individual input learning rate # i of b and of its neighbors c. move the input weights of the node b and its neighbors toward the presented pattern according to the individual learning rate. Generalization and Function Approximation 49 Life-long learning cell structures! Repeat for each pattern: Adaptation of the output layer calculate the activation of all nodes in the representation layer by a RBF. determine the individual output learning rate for all nodes in the representation layer. adapt the output weights to match the desired output by applying the delta rule. Insertion of nodes in the representation layer If the algorithm was presented a sufficient number of patterns since the last insertion criterion was applied. Check insertion find the node q with the highest value of the insertion criterion. The insertion criterion depends on each quality measure for insertion B I and on the age of each node.

Generalization and Function Approximation 50 Life-long learning cell structures! Insert a new node if the insertion criterion is satisfied. Evaluate the previous insertion. If The last insertion was not successful increase the insertion threshold: Generalization and Function Approximation 51 Life-long learning cell structures!

Generalization and Function Approximation 52 Life-long learning cell structures! Generalization and Function Approximation 53 Life-long learning cell structures!

Generalization and Function Approximation 54 Life-long learning cell structures! Generalization and Function Approximation 55 Life-long learning cell structures!

Summary! Generalization and Function Approximation 56 Generalization! Adapting supervised-learning function approximation methods! Gradient-descent methods! Linear gradient-descent methods! Radial basis functions! Tile coding! Kanerva coding! Nonlinear gradient-descent methods? Backpropation?! Subtleties involving function approximation, bootstrapping and the on-policy/off-policy distinction! References! Generalization and Function Approximation 57 Tsitsiklis, J. N. and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control.