Lecture 7: Value Function Approximation

Size: px

Start display at page:

Download "Lecture 7: Value Function Approximation"

Tyler Williamson
5 years ago
Views:

1 Lecture 7: Value Function Approximation Joseph Modayil

2 Outline 1 Introduction 2 3 Batch Methods

3 Introduction Large-Scale Reinforcement Learning Reinforcement learning can be used to solve large problems, e.g. Backgammon: states Computer Go: states Helicopter/Mountain Car: continuous state space Robots: informal state space (physical universe) How can we scale up the model-free methods for prediction and control from the last two lectures?

4 Introduction Value Function Approximation So far we have represented value function by a lookup table Every state s has an entry V (s) Or every state-action pair s, a has an entry Q(s, a) Problem with large MDPs: There are too many states and/or actions to store in memory It is too slow to learn the value of each state individually Solution for large MDPs: Estimate value function with function approximation V θ (s) v π (s) or Q θ (s, a) q π (s, a) Generalise from seen states to unseen states Update parameter θ using MC or TD learning

5 Introduction Which Function Approximator? There are many function approximators, e.g. Artificial neural network Decision tree Nearest neighbour Fourier / wavelet bases Coarse coding In principle, any function approximator can be used. However, the choice may be affected by some properties of RL: Experience is not i.i.d. - successive time-steps are correlated During control, value function v π (s) is non-stationary Agent s actions affect the subsequent data it receives Feedback is delayed, not instantaneous

6 Introduction Classes of Function Approximation Tabular (No FA): a table with an entry for each MDP state State Aggregation: Partition environment states Linear function approximation: fixed features (or fixed kernel) Differentiable (nonlinear) function approximation: neural nets So what should you choose? Depends on your goals. Top: good theory but weak performance. Bottom: excellent performance but weak theory Linear function approximation is a useful middle ground Neural nets now commonly give the highest performance

7 Gradient Descent Gradient Descent Let J(θ) be a differentiable function of parameter vector θ Define the gradient of J(θ) to be θ J(θ) = To find a local minimum of J(θ) J(θ) θ 1. J(θ) θ n Adjust the parameter θ in the direction of -ve gradient θ = 1 2 α θj(θ) where α is a step-size parameter

8 Gradient Descent Value Function Approx. By Stochastic Gradient Descent Goal: find parameter vector θ minimising mean-squared error between approximate value fn V θ (s) and true value fn v π (s) J(θ) = E π [ (v π (S) V θ (S)) 2] Note: The notation E π [] means that the random variable S is drawn from a distribution induced by π. E π [f (S)] = s f (s)d π(s) Gradient descent finds a local minimum θ = 1 2 α θj(θ) = αe π [(v π (S) V θ (S)) θ V θ (S)] Stochastic gradient descent samples the gradient θ = α(v π (s) V θ (s)) θ V θ (s) Expected update is equal to full gradient update

9 Linear Function Approximation Feature Vectors Represent state by a feature vector φ 1 (s) φ(s) =. φ n (s) For example: Distance of robot from landmarks Trends in the stock market Piece and pawn configurations in chess

10 Linear Function Approximation Linear Value Function Approximation Approximate value function by a linear combination of features V θ (s) = φ(s) θ = n φ j (s)θ j j=1 Objective function is quadratic in parameters θ J(θ) = E π [ (v π (S) φ(s) θ) 2] Stochastic gradient descent converges on global optimum Update rule is particularly simple θ V θ (s) = φ(s) θ = α(v π (s) V θ (s))φ(s) Update = step-size prediction error feature vector

11 Linear Function Approximation Table Lookup Features Table lookup can be implemented as a special case of linear value function approximation Let the n states be given by S = {s (1),..., s (n) }. Using table lookup features φ table (s) = 1(s = s (1) ). 1(s = s (n) ) Parameter vector θ gives value of each individual state 1(s = s (1) ) θ 1 V (s) =.. 1(s = s (n) ) θ n

12 Coarse Coding Example Coarse Coding Example of linear value function approximation: Coarse coding provides large feature vector φ(s) Parameter vector θ gives a value to each feature original representation expanded representation, many features t approximation

13 Coarse Coding Example Generalization in Coarse Coding

14 Coarse Coding Example Stochastic Gradient Descent with Coarse Coding

15 Incremental Prediction Algorithms Incremental Prediction Algorithms Have assumed true value function v π (s) given by supervisor But in RL there is no supervisor, only rewards In practice, we substitute a target for v π (s) For MC, the target is the return G t θ = α(g t V θ (s)) θ V θ (s) For TD(0), the target is the TD target r + γv θ (s ) θ = α(r + γv θ (s ) V θ (s)) θ V θ (s) For TD(λ), the target is the λ-return G λ t θ = α(gt λ V θ (s)) θ V θ (s)

16 Incremental Prediction Algorithms Monte-Carlo with Value Function Approximation The return G t is an unbiased, noisy sample of true value v π (s) Can therefore apply supervised learning to training data : S 1, G 1, S 2, G 2,..., S T, G T For example, using linear Monte-Carlo policy evaluation θ = α(g t V θ (s)) θ V θ (s) = α(g t V θ (s))φ(s) Monte-Carlo evaluation converges to a local optimum Even when using non-linear value function approximation

17 Incremental Prediction Algorithms TD Learning with Value Function Approximation The TD-target R t+1 + γv θ (S t+1 ) is a biased sample of true value v π (S t ) Can still apply supervised learning to training data : S 1, R 2 + γv θ (S 2 ), S 2, R 3 + γv θ (S 3 ),..., S T 1, R T For example, using linear TD(0) θ = α(r + γv θ (s ) V θ (s)) θ V θ (s) = αδφ(s) Linear TD(0) converges (close) to global optimum

18 Incremental Prediction Algorithms TD(λ) with Value Function Approximation The λ-return Gt λ is also a biased sample of true value v π (s) Can again apply supervised learning to training data : S 1, G1 λ, S 2, G2 λ,..., S T 1, GT λ 1 Forward view linear TD(λ) Backward view linear TD(λ) θ = α(g λ t V θ (S t )) θ V θ (S t ) = α(g λ t V θ (S t ))φ(s t ) δ t = R t+1 + γv θ (S t+1 ) V θ (S t ) e t = γλe t 1 + φ(s t ) θ = αδ t e t

19 Incremental Control Algorithms Control with Value Function Approximation Q θ Q π Starting θ Q θ Q * π = ε-greedy(q θ ) Policy evaluation Approximate policy evaluation, Q θ q π Policy improvement ɛ-greedy policy improvement

20 Incremental Control Algorithms Action-Value Function Approximation Approximate the action-value function Q θ (s, a) q π (s, a) Minimise mean-squared error between approximate action-value fn Q θ (s, a) and true action-value fn q π (s, a) J(θ) = E π [ (q π (S, A) Q θ (S, A)) 2] Here, E π [] means both S and A are drawn from a distribution induced by π. Use stochastic gradient descent to find a local minimum 1 2 θj(θ) = (q π (s, a) Q θ (s, a)) θ Q θ (s, a) θ = α(q π (s, a) Q θ (s, a)) θ Q θ (s, a)

21 Incremental Control Algorithms Linear Action-Value Function Approximation Represent state and action by a feature vector φ 1 (s, a) φ(s, a) =. φ n (s, a) Represent action-value fn by linear combination of features n Q θ (s, a) = φ(s, a) θ = φ j (s, a)θ j j=1 Stochastic gradient descent update θ Q θ (s, a) = φ(s, a) θ = α(q π (s, a) Q θ (s, a))φ(s)

22 Incremental Control Algorithms Incremental Linear Control Algorithms Like prediction, we must substitute a target for q π (s, a) For MC, the target is the return G t θ = α(g t Q θ (S t, a t ))φ(s t, A t ) For SARSA(0), the target is the TD target R t+1 + γq(s t+1, A t+1 ) θ = α(r t+1 + γq θ (S t+1, A t+1 ) Q θ (S t, A t ))φ(s t, A t ) For forward-view Sarsa(λ), target is the λ-return with action-values θ = α(g λ t Q θ (S t, A t ))φ(s t, A t ) For backward-view Sarsa(λ), equivalent update is δ t = R t+1 + γq θ (S t+1, A t+1 ) Q θ (S t, A t ) e t = γλe t 1 + φ(s t, A t ) θ = αδ t e t

23 Mountain Car Linear Sarsa with Coarse Coding in Mountain Car

24 Mountain Car Linear Sarsa with Radial Basis Functions in Mountain Car

25 Mountain Car Study of λ: Should We Bootstrap?

26 Convergence Convergence Questions The previous results show it is desirable to bootstrap But now we consider convergence issues When do incremental prediction algorithms converge? When using bootstrapping (i.e. TD with λ < 1)? When using linear value function approximation? When using off-policy learning? Ideally, we would like algorithms that converge in all cases

27 Convergence Baird s Counterexample

28 Convergence Parameter Divergence in Baird s Counterexample

29 Convergence Convergence of Prediction Algorithms On/Off-Policy Algorithm Table Lookup Linear Non-Linear MC On-Policy TD(0) TD(λ) MC Off-Policy TD(0) TD(λ)

30 Convergence Gruesome Threesome We have not quite achieved our ideal goal for prediction algorithms.

31 Convergence Gradient Temporal-Difference Learning TD does not follow the gradient of any objective function This is why TD can diverge when off-policy or using non-linear function approximation Gradient TD follows true gradient of projected Bellman error On/Off-Policy Algorithm Table Lookup Linear Non-Linear MC On-Policy TD Gradient TD MC Off-Policy TD Gradient TD

32 Convergence Convergence of Control Algorithms In practice, the tabular control learning algorithms are extended to find a control policy (with linear FA or with neural nets). In theory, many aspects of control are not as simple to specify under function approximation. e.g. The starting state distribution is required before specifying an optimal policy, unlike in the tabular setting. The optimal policy can differ starting from state s (1) and from state s (2), but the state aggregation may not be able to distinguish between them. Such situations commonly arise in large environments (e.g. robotics), and tracking is often preferred to convergence. (continually adapting the policy instead of converging to a fixed policy).

33 Batch Methods Batch Reinforcement Learning Gradient descent is simple and appealing But it is not sample efficient Batch methods seek to find the best fitting value function for a given a set of past experience ( training data )

34 Batch Methods Least Squares Prediction Least Squares Prediction Given value function approximation V θ (s) v π (s) And experience D consisting of state, estimated value pairs D = { S 1, ˆV 1 π, S 2, ˆV 2 π,..., S T, ˆV T π } Which parameters θ give the best fitting value fn V θ (s)? Least squares algorithms find parameter vector θ minimising sum-squared error between V θ (S t ) and target values ˆV π t, LS(θ) = T t=1 ( ˆV π t V θ (S t )) 2 = E D [ ( ˆV π V θ (s)) 2]

35 Batch Methods Least Squares Prediction Stochastic Gradient Descent with Experience Replay Given experience consisting of state, value pairs D = { S 1, ˆV 1 π, S 2, ˆV 2 π,..., S T, ˆV T π } Repeat: 1 Sample state, value from experience s, ˆV π D 2 Apply stochastic gradient descent update θ = α( ˆV π V θ (s)) θ V θ (s) Converges to least squares solution θ π = argmin θ LS(θ)

36 Batch Methods Least Squares Prediction Linear Least Squares Prediction Experience replay finds least squares solution But it may take many iterations Using linear value function approximation V θ (s) = φ(s) θ We can solve the least squares solution directly

37 Batch Methods Least Squares Prediction Linear Least Squares Prediction (2) At minimum of LS(θ), the expected update must be zero α T t=1 E D [ θ] = 0 φ(s t )( ˆV π t φ(s t ) θ) = 0 T t=1 φ(s t ) ˆV π t = T φ(s t )φ(s t ) θ t=1 ( T ) 1 θ = φ(s t )φ(s t ) t=1 For N features, direct solution time is O(N 3 ) T t=1 φ(s t ) ˆV π t Incremental solution time is O(N 2 ) using Shermann-Morrison

38 Batch Methods Least Squares Prediction Linear Least Squares Prediction Algorithms We do not know true values vt π (have estimates ˆV t π ) In practice, our training data must use noisy or biased samples of vt π LSMC Least Squares Monte-Carlo uses return vt π G t LSTD Least Squares Temporal-Difference uses TD target vt π R t+1 + γv θ (S t+1 ) LSTD(λ) Least Squares TD(λ) uses λ-return vt π Vt λ In each case solve directly for fixed point of MC / TD / TD(λ)

39 Batch Methods Least Squares Prediction Linear Least Squares Prediction Algorithms (2) T LSMC 0 = α(g t V θ (S t ))φ(s t ) t=1 ( T ) 1 θ = φ(s t )φ(s t ) t=1 T t=1 φ(s t )G t T LSTD 0 = α(r t+1 + γv θ (S t+1 ) V θ (S t ))φ(s t ) t=1 ( T ) 1 T θ = φ(s t )(φ(s t ) γφ(s t+1 )) φ(s t )R t+1 t=1 T LSTD(λ) 0 = αδ t e t t=1 ( T ) 1 θ = e t (φ(s t ) γφ(s t+1 )) T t=1 t=1 t=1 e t R t+1

40 Batch Methods Least Squares Prediction Convergence of Linear Least Squares Prediction Algorithms On/Off-Policy Algorithm Table Lookup Linear Non-Linear MC On-Policy LSMC - TD LSTD - Off-Policy MC LSMC - TD LSTD -

41 Batch Methods Least Squares Control Least Squares Policy Iteration Q θ Q π Starting θ Q θ Q * π = greedy(q θ ) Policy evaluation Policy evaluation by least squares Q-learning Policy improvement Greedy policy improvement

42 Batch Methods Least Squares Control Least Squares Action-Value Function Approximation Approximate action-value function q π (s, a) using linear combination of features φ(s, a) Q θ (s, a) = φ(s, a) θ q π (s, a) Minimise least squares error between Q θ (s, a) and q π (s, a) from experience generated using policy π consisting of (state, action), value pairs D = { (S 1, A 1 ), ˆV 1 π, (S 2, A 2 ), ˆV 2 π,..., (S T, A T ), ˆV T π }

43 Batch Methods Least Squares Control Least Squares Control For policy evaluation, we want to efficiently use all experience For control, we also want to improve the policy This experience is generated from many policies So to evaluate q π (s, a) we must learn off-policy We use the same idea as Q-learning: Use experience generated by old policy S t, A t, R t+1, S t+1 π old Consider alternative successor action a = π new (S t+1 ) Update Q θ (S t, A t ) towards value of alternative action R t+1 + γq θ (S t+1, a )

44 Batch Methods Least Squares Control Least Squares Q-Learning Consider the following linear Q-learning update δ = R t+1 + γq θ (S t+1, π(s t+1 )) Q θ (S t, A t ) θ = αδφ(s t, A t ) LSTDQ algorithm: solve for total update = zero T 0 = α(r t+1 + γq θ (S t+1, π(s t+1 )) Q θ (S t, A t ))φ(s t, A t ) t=1 ( T ) 1 T θ = φ(s t, A t )(φ(s t, A t ) γφ(s t+1, π(s t+1 ))) φ(s t, A t )R t+1 t=1 t=1

45 Batch Methods Least Squares Control Least Squares Policy Iteration Algorithm The following pseudocode uses LSTDQ for policy evaluation It repeatedly re-evaluates experience D with different policies function LSPI-TD(D, π 0 ) π π 0 repeat π π Q LSTDQ(π, D) for all s S do π (s) argmax Q(s, a) end for until (π π ) return π end function a A

46 Batch Methods Least Squares Control Least-Squares Policy Iteration Chain Walk Example L R R R R L L L r=0 r=1 r=1 r=0 Figure 9: The problematic MDP. Consider the 50 state version of this problem where s is the state number. LSPI was applied on the same problem using thesamebasis functions Reward repeated +1forineach states of the two 10actions and 41, so that 0 each elsewhere action gets its own parameters: 10 Optimal policy: R (1-9), L I(a (10-25), = L) 1 R (26-41), L (42, 50) I(a = L) s φ(s, a) = I(a = L) s 2 I(a = R) 1. Experience: 10,000 steps from random I(a = R) s walk policy I(a = R) s 2 Features: 10 evenly spaced Gaussians (σ = 4) for each action

47 Batch Methods Least Squares Control Least-Squares Policy Iteration LSPI in Chain Walk: Action-Value Function Iteration Iteration Iteration Iteration Iteration Iteration Iteration7 Exact (solid blue) Function Approx. (red dashes)

48 Batch Methods Least Squares Control Questions?

Reinforcement Learning

Reinforcement Learning RL in continuous MDPs March April, 2015 Large/Continuous MDPs Large/Continuous state space Tabular representation cannot be used Large/Continuous action space Maximization over action