CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

CSE 190: Reinforcement Learning: An Introduction Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part: Generalization How to accomplish generalization by using parameterized functions that approximate the Value or Q functions: Function Approximation Acknowledgment: A good number of these slides are cribbed from Rich Sutton Function Approximation Methods Many possible methods: artificial neural networks decision trees multivariate regression methods Kernel density methods But RL has some special requirements: Usually want to learn while interacting Want to be able to handle nonstationarity In this chapter, we focus on linear function approximators Simple models Simple learning rules Pop Quiz: What Function Are We Approximating? The Value function and/or the Q-value function The methods we have used so far are called tabular methods The V and Q functions were essentially a table: V(s is represented in an array. But these are still functions: V(s: a mapping from states to values Q(s,a: A mapping from states and actions to values Problem: An entry in an array doesn t tell you anything about the entry in the next cell: No generalization Goal: learning something about a state should generalize to nearby states. 3 4

Linear Function Approximators Gradient Descent Methods Assume V is a (sufficiently smooth differentiable function Represent states as vectors of features (note: I am using different notation than the book: F(s = ( f 1 (s, f 2 (s,...f n (s T Then represent the Value as a linear combination of the features: V (s = T F(s = i f i (s n " I.e., a simple weighted sum of the features. Often, one of the features (e.g., f 0 (s is a constant. i=1 of, for all s "S. Assume, for now, training examples of this form: {(s 1, V (s 1,(s 2, V (s 2,...(s T, V (s T } where our goal: by modifying: s i = ( f 1 (s i, f 2 (s i,..., f n (s i V (s i "V # (s i # = (# 1,# 2,,# n T 5 6 This is our performance measure a common and simple one is the sum-squared error (SSE over a probability distribution P : SSE( = P(s $ % V " (s # V (s ' s(s Why P? Why minimize SSE? Let us assume that P is always the distribution of states at which backups are done. The on-policy distribution: the distribution created while following the policy being evaluated. Stronger results are available for this distribution. 2 7 Gradient Descent Main idea: Change in order to go downhill in the Sum Squared Error Which way is downhill? The negative of the slope of the SSE with respect to When the slope is a vector, it s called the gradient. Current parameters: = 1, 2 Gradient (downhill in the parameters: - " SSE SSE ( T 2 Iteratively move t +1 = down the gradient: t " #$ SSE( t 1 8

Gradient Descent What is - SSE "? % - SSE = # $SSE, $SSE,, $SSE ( " ' $" 1 $" 2 $" n * Looking at one component of this vector: - SSE 1 = # 2 (V $ (s # V(s 2 " i " i = #(V $ (s # V(s(# V(s " i = (V $ (s # V(s V(s " i T 9 So this is what - SSE " Now, what for V(s Gradient Descent is: % - SSE = # $SSE " ' V(s " i, $SSE,, $SSE ( $" 1 $" 2 $" n * % = (V + (s # V(s $V(s, $V(s,, $V(s ( ' $" 1 $" 2 $" n * is depends on your function approximator T T 10 Linear Function Approximators Linear Function Approximators Representing the Value function as a linear combination of features makes the derivative very simple: V (s = T F(s = i f i (s V(s " i = i=1 So the learning rule becomes: i = i + "(V # (s $ V(s f i (s n # n " " j f j (s j =1 " i = " i f i (s " i = f i (s So the learning rule becomes: i = i + "(V # (s $ V(s f i (s Why does this make sense? 11 12

Linear Function Approximators i = i + "(V # (s $ V(s f i (s Targets Where does the teacher V (s come from? We don t actually have access to V (s However, if we have something whose expected value is V (s we re guaranteed to converge to the optimal solution. This is true of: Monte Carlo (R t,td( s return of R " (but only if =1. Not true of R " for <1, or of Dynamic Programming targets - but these methods work well anyway Using the TD error: t = r t + "V t +1 # V t Now, we update the elegibility trace using the gradient: e t = $" e t #1 + % t V t 13 14 On-Line Gradient-Descent TD( Nice Properties of Linear FA Methods The gradient is very simple: V " t (s = F(s For MSE, the error surface is simple: quadratic surface with a single minimum. Linear gradient descent TD( converges: Step size decreases appropriately On-line sampling (states sampled from the on-policy distribution Converges to parameter vector with property: " MSE( " # 1 $ % 1 $ % MSE( ' 15 (Tsitsiklis Van Roy, 1997 best parameter vector 16

Coarse Coding Shaping Generalization in Coarse Coding 17 18 Learning and Coarse Coding Tile Coding Binary feature for each tile Number of features present at any one time is constant Binary features means weighted sum easy to compute Easy to compute indices of the features present 19 20

Tile Coding Cont. Radial Basis Functions (RBFs e.g., Gaussians Irregular tilings # f i (s = exp s c i % 2 $ 2" i 2 ( ' Hashing CMAC Cerebellar model arithmetic computer Albus 1971 21 22 Can you beat the curse of dimensionality? Can you keep the number of features from going up exponentially with the dimension? Function complexity, not dimensionality, is the problem. Kanerva coding: Select a bunch of binary prototypes Use hamming distance as distance measure Dimensionality is no longer a problem, only complexity Lazy learning schemes: Remember all the data To get new value, find nearest neighbors and interpolate e.g., locally-weighted regression Control with FA Learning state-action values The general gradient-descent rule: t +1 = t + " [ v t # Q t,a t ]$ Q(s t,a t Gradient-descent Sarsa( (backward view: t +1 = t + " # t et where # t = r t +1 + $ Q t +1,a t +1 % Q t,a t e t = $ e t %1 + ' Q t,a t 23 24

GPI Linear Gradient Descent Watkins Q( Linear Gradient Descent Sarsa( (with GPI 25 26 Mountain-Car Task 27 28

Mountain Car with Radial Basis Functions Mountain-Car Results 29 30 Baird s Counterexample Baird s Counterexample Cont. 31 32

Should We Bootstrap? Summary Generalization Adapting supervised-learning function approximation methods Gradient-descent methods Linear gradient-descent methods Radial basis functions Tile coding Kanerva coding Nonlinear gradient-descent methods? Backpropagation? Subtleties involving function approximation, bootstrapping and the on-policy/off-policy distinction 33 34 Value Prediction with FA As usual: Policy Evaluation (the prediction problem: for a given policy #, compute the state-value function V Adapt Supervised Learning Algorithms Training Info = desired (target outputs In earlier chapters, value functions were stored in lookup tables. Here, the value function estimate at time t, V t, depends on a parameter vector t, and only the parameter vector Inputs Supervised Learning System Outputs is updated. Training example = {input, target output} e.g., t could be the vector of connection weights of a neural network. Error = (target output actual output 35 36

Backups as Training Examples e.g., the TD(0 backup: [ ] V V + " r t +1 + # V +1 $ V As a training example: { description of s t, r t +1 + V +1 } Gradient Descent Cont. For the MSE given above and using the chain rule: t +1 = t " 1 2 #$ MSE( t = t " 1 2 #$ P(s V ' 2 (s " V ( t (s* + s%s = t + # P(s ( V ' (s " V t (s* + $ V t (s s%s input target output 37 38 Gradient Descent Cont. But We Don t have these Targets Use just the sample gradient instead: t +1 = t " 1 2 #$ V % 2 (s ' t " V t ( Suppose we just have targets v t instead : t +1 = t + " [ v t # V t ]$ V (s t t = t + # ' V % " V t ( $ V (s, t t Since each sample gradient is an unbiased estimate of the true gradient, this converges to a local minimum of the MSE if " decreases appropriately with t. E #$ V " V t % ' V ( t = * P(s #$ V (s " V t (s% ' V ( t (s ss 39 If each v t is an unbiased estimate of V, { } = V, then gradient descent converges i.e., E v t to a local minimum (provided " decreases appropriately. e.g., the Monte Carlo target v t = R t : t +1 = t + " [ R t # V t ]$ V (s t t 40

What about TD( Targets? t +1 = t + " % R # t $ V t ' ( V t Not for # < 1 But we do it anyway, using the backwards view: t +1 = t + " # t et, where: # t = r t +1 + $ V t +1 % V t, as usual, and e t = $ e t %1 + ' V t END 41