CS599 Lecture 2 Function Approximation in RL

Size: px

Start display at page:

Download "CS599 Lecture 2 Function Approximation in RL"

Sharlene Moody
5 years ago
Views:

1 CS599 Lecture 2 Function Approximation in RL Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. Overview of function approximation (FA) methods and how they can be adapted to RL Handout: Class Notes Reading Assignment for Next Class See web page

2 Why Function Approximation in RL? There are 4 functions to be represented in RL The value function V, or action value function Q The policy π The reward function r The state transition model P For model-free RL, we can avoid P Often, the reward function r is assumed to be given or provided through the environment 2

3 Problems of Function Approximation in RL Nonstationary Problems Nobody provides accurate targets, but only increment improvement over previous knowledge Many function approximators are not built for this Incremental learning How to evaluate the max operator? For extraction of the policy, e.g., in Q-learning, we need to find Q(x,u) -- this is a nonlinear root finding problem max u The greedy max operator can trigger a very different action in case of the slightest variation of the Q-function -> this can lead to unstable RL 3

4 Example: Incremental Learning 4

5 What Function Approximation Methods Exist? Artificial neural networks Decision trees Multivariate regression methods Gaussian Processes Etc. Currently, the trend is clearly towards learning systems that are: Linear in the parameters Have automatic complexity regularization (nonparametric methods) Kernel-based methods 5

6 Gradient Descent Methods θ t = ( θ t (1),θ t (2),,θ t (n)) T transpose Assume V t is a (sufficiently smooth) differentiable function of θ t, for all s S. Assume, for now, training examples of this form: { description of s t, V π (s t )} R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6

7 Performance Measures Many are applicable but a common and simple one is the mean-squared error (MSE) over a distribution P : MSE(θ t ) = s S [ ] P(s) V π (s) V t (s) Why P? Why minimize MSE? Let us assume that P is always the distribution of states at which backups are done. The on-policy distribution: the distribution created while following the policy being evaluated. Stronger results are available for this distribution. 2 7

8 Gradient Descent Let f be any function of the parameter space. Its gradient at any point θ t in this space is : f ( θ t ) = f ( T θ t ) θ t ) θ t ) θ θ(1), f ( θ(2),, f ( θ(n) θ (2). r θ t = ( θ t (1),θ t (2)) T Iteratively θ t +1 = move down θ t α θ f ( the gradient: θ t ) θ (1) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8

9 On-Line Gradient-Descent TD(λ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9

10 Linear Methods Represent states as feature vectors: for each s S : φ s = ( φ s (1),φ s (2),,φ s (n)) T V t (s) = θ t T φ s = θ V t (s) =? n i=1 θ t (i)φ s (i) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10

11 Nice Properties of Linear FA Methods The gradient is very simple: For MSE, the error surface is simple: quadratic surface with a single minimum. Linear gradient descent TD(λ) converges: Step size decreases appropriately On-line sampling (states sampled from the on-policy distribution) Converges to parameter vector with property: θ V t (s) = θ MSE( θ ) 1 γ λ 1 γ MSE( θ ) (Tsitsiklis & Van Roy, 1997) φ s best parameter vector R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11

12 Coarse Coding R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12

13 Shaping Generalization in Coarse Coding R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13

14 Learning and Coarse Coding R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 14

sum easy to compute Easy to compute indices of the freatures

15 Tile Coding Binary feature for each tile Number of features present at any one time is constant Binary features means weighted sum easy to compute Easy to compute indices of the freatures present R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 15

16 Tile Coding Cont. Irregular tilings Hashing CMAC Cerebellar model arithmetic computer Albus 1971 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 16

17 Radial Basis Functions (RBFs) e.g., Gaussians φ s (i) = exp s c i 2 2σ i 2 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17

18 Can you beat the curse of dimensionality? Can you keep the number of features from going up exponentially with the dimension? Function complexity, not dimensionality, is the problem. Lazy learning schemes: Remember all the data To get new value, find nearest neighbors and interpolate e.g., locally-weighted regression Nonparametric Regression E.g., Gaussian Process regression R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18

19 Control with FA Learning state-action values Training examples of the form: { description of ( s t, a t ), v } t The general gradient-descent rule: θ t +1 = θ t + α [ v t Q t (s t,a t )] θ Q(s t,a t ) Gradient-descent Sarsa(λ) (backward view): θ t +1 = θ t + αδ t e t where δ t = r t +1 + γ Q t (s t +1,a t +1 ) Q t (s t,a t ) e t = γ λe t 1 + Q t (s t,a t ) θ 19

20 GPI Linear Gradient Descent Watkins Q(λ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20

21 GPI with Linear Gradient Descent Sarsa(λ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 21

22 Compare Sarsa(λ) and Watkins Q(λ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 22

23 Mountain-Car Task R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 23

24 Mountain Car with Radial Basis Functions R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 24

25 Mountain-Car Results R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 25

26 Baird s Counterexample R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 26

27 Baird s Counterexample Cont. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 27

Chapter 8: Generalization and Function Approximation

Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. Overview