Reinforcement Learning

Size: px

Start display at page:

Download "Reinforcement Learning"

Beatrice Beasley
6 years ago
Views:

1 Reinforcement Learning Function Approximation Continuous state/action space, mean-square error, gradient temporal difference learning, least-square temporal difference, least squares policy iteration Vien Ngo Marc Toussaint University of Stuttgart

2 Outline Function Approximation Gradient Descent Methods. Least-Square Temporal Difference. 2/??

3 Value Iteration in Continuous MDP [ V (s) = sup r(s, a) + γ a P (s s, a)v (s )dx ] 3/??

4 Continuous state/actions in model-free RL All of this is fine in small finite state & action spaces. Q(s, a) is a S A -matrix of numbers. π(a s) is a S A -matrix of numbers. In the following: two examples for handling continuous states/actions use function approximation to estimate Q(s, a): Gradient descent (TD with FA), LSPI. optimize a parameterized π(a s) (policy search - next lecture). 4/??

5 Value Function Approximation Estimate of the value function (from Satinder Singh, RL: A tutorial at videolectures.net) V t (s) = V (s, θ t ) 5/??

6 Performance Measure Minimizing the mean-squared error (MSE) over some distribution, P, of the states MSE(β t ) = s S P (s) [ V π (s) V t (s) ] 2 where V π (s) is the true value function of the policy π. Set P to the stationary distribution of policy π in on-policy learning methods (e.g. SARSA). 6/??

7 Value Function Approximation The estimate value function: V (s, β t ) = β t φ(s) where β R d is a vector of parameters, φ : S R d is a mapping from states to d-dimensional spaces. Examples: polynomial, RBF, fourier, wavelet basis, tile-coding. (suffer from the curse of dimensionality) Nonparametric methods: k-nearest neighbor, nonparametric kernel smoothing, spline smoothers, Gaussian process regression,... 7/??

8 Value Function Approximation (10 35 states, 10 5 binary features and parameters.) (Sutton, presentation at ICML 2009) ) 8/??

9 TD(λ) with Function Approximation The gradient at any point β t MSE(β t ) = 2 s S P (s) [ V π (s) V t (s) ] V (s, β t ) = 2 s S P (s) [ V π (s) V t (s) ] φ(s) Applying stochastic approximation and bootstrapping, we can iteratively update the parameters (TD(0) with function approximation) TD(λ) (with eligibility trace) β t+1 = β t α t [ rt + γv (s, β t ) V t (s, β t ) ] φ(s) e t+1 = γλe t + φ(s) β t+1 = β t α t e t+1 [ rt + γv (s, β t ) V t (s, β t ) ] 9/??

10 TD(λ) with Function Approximation (Gradient-descent SARSA(λ)) Repeat (for each episode) e = 0 initial state s = s 0 Repeat a t = π(s) Take a, observe r t, s e t+1 = γλe t + φ(s, a) β t+1 = β t α t e t+1 [ rt + γq(s, π(s ), β t ) Q(s, a t, β t ) ] s s until s is terminal. 10/??

11 TD(λ) with Function Approximation Convergence proof: If the stochastic process S t is ergodic Markov process, the whose stationary distribution is the same as the stationary distribution of the underlying MDP (e.g. on-policy distribution). The convergence property MSE(β ) 1 γλ 1 λ MSE(β ) (Tsitsiklis & Van Roy, An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 1997) Convergence guarantee for off-policy methods (e.g Q-learning with linear function approximation)? 11/??

12 Gradient temporal difference learning GTD (gradient temporal difference learning) GTD2 (gradient temporal difference learning, version 2) TDC (temporal difference learning with corrections.) 1. Sutton, Szepesveri and Maei. A convergent O(n) temporal difference algorithm for off-policy learning with linear function approximation, NIPS Sutton, Maei, Precup, Bhatnagar, Silver, Szepesvri, Wiewiora: Fast gradient-descent methods for temporal-difference learning with linear function approximation. ICML /??

13 Value function geometry Bellman operator T V = R + γp V (The space spanned by the feature vectors) RMSBE : Residual mean-squared Bellman error RMSPBE: Residual mean-squared projected Bellman error 13/??

14 TD performance measure Error from the true value: V β V Error in the Bellman update (used in previous section: gradient descent methods) V β T V β Error in Bellman update after projection V β T V β 14/??

15 TD performance measure GTD(0): the norm of the expected TD update NEU(β) = E(δφ) E(δφ) GTD(2) and TDC: the norm of the expected TD update weighted by the covariance matrix of the features (δ is the TD error.) MSP BE(β) = E(δφ) E(φφ) 1 E(δφ) (GTD(2) and TDC slightly differ at their derivation of the approximate gradient direction.) 15/??

16 TD algorithms with linear function approximation problems are guaranteed convergent under both general on- and off-policy training. the compuational complexity is only O(n) (n is the number of features). the curse of dimensionality is removed 16/??

17 LSPI: Least Squares Policy Iteration Gradient-descent methods are sensitive to the choice of learning rates and initial parameter values. Least-square temporal difference (LSTD) method: LSPI. Bellman residual minimization Least Squares Fixed-Point Approximation 17/??

18 Bellman residual minimization The Q-functions for a given policy π fulfills for any s, a: Q π (s, a) = R(s, a) + γ s P (s a, s) Q π (s, π(s )) If we have n data points D = {(s i, a i, r i, s i )}n i=1, we require that this equation holds (approximately) for these n data points: i : Q π (s i, a i ) = r i + γq π (s i, π(s i)) Written in vector notation: Q = R + g Q with N-dim data vectors Q, R, Q Written as optmization: minimize the Bellman residual error L(Q π ) = R + γp ΠQ π Q π (true residual) n = [Q π (s i, a i ) r i γq π (s i, π(s i))] 2 = R Q + γ Q 2 i=1 18/??

19 Bellman residual minimization The true fixed point of Bellman Residual Minimization (this is an overconstrained system) β π = ( (Φ γp ΠΦ) (Φ γp ΠΦ)) 1(Φ γp ΠΦ)r the solution β π of the system is unique since the columns of Φ (the basis functions) are linearly independent by definition. (See Lagoudakis & Parr (JMLR 2003) for details.) 19/??

20 LSPI: Least Squares Fixed-Point Approximation Projection T π Q back onto span(φ) ˆT π (Q) = Φ(Φ Φ) 1 Φ (T π Q) The approximate fixed-point β π = ( Φ (Φ γp ΠΦ)) 1Φ r 20/??

21 LSPI: Comparisons of two views the Bellman residual minimizing method: focus on the magnitude of the change. the least-squares fixed-point approximation: focus on the direction of the change. the least-squares fixed point approximation is less stable and less predictable the least-squares fixed-point method might be preferable. Because Learning the Bellman residual minimizing approximation requires doubled samples. Experimentally, it often delivers policies that are superior. (See Lagoudakis & Parr (JMLR 2003) for details.) 21/??

22 LSPI: LSTDQ algorithm For each (s, a, r, s ) D A = ( ) 1 Φ (Φ γp ΠΦ) b = Φ r A A + φ(s, a) ( φ(s, a) γφ(s, π(s )) ) β A 1 b b b + φ(s, a)r 22/??

23 LSPI algorithm given D repeat return π π π π LST DQ(π) (π is a policy of β π ) 23/??

24 LSPI: Riding a bike (from Alma A. M. Rahat s simulation.) States = {θ, θ, ω, ω, ω, ψ}. where θ is the angle of the handlebar, ω is the vertical angle of the bicycle, and ψ is the angle of the bicycle to the goal. Actions: {τ, ν}. τ { 2, 0, 2} is the torque applied to the handlebar, ν { 0.02, 0, 0.02} is the displacement of the rider. For each a, the value function Q(s, a) uses 20 features (1, ω, ω, ω 2, ω ω, θ, θ, θ 2, θ 2, θ θ, ωθ, ωθ 2, ω 2 θ, ψ, ψ 2, ψθ, ψ, ψ 2, ψθ) where ψ = sign(ψ) π ψ. 24/??

25 LSPI: Riding a bike from Lagoudakis & Parr (JMLR 2003) 25/??

26 LSPI: Riding a bike Training samples were collected in advance by initializing the bicycle to a small random perturbation from the initial position (0, 0, 0, 0, 0, π/2) and running each episode up to 20 steps using a purely random policy. Each successful ride must complete a distance of 2 kilometers. This experiment was repeated 100 times from Lagoudakis & Parr (JMLR 2003) 26/??

27 Feature Selection/Building Problems Feature selection. Online/increment feature learning. Wu and Givan (2005); Keller et al. (2006); Parr et al. (2007); Mahadevan and Liu (2010); Parr et al., (2007); Mahadevan and Liu, (2010); Mahadevan et al., (2006), Kolter and Ng, (2009), Boots and Gordon, (2010), Sun et al., (2011), etc. 27/??

Reinforcement Learning

Reinforcement Learning Value Function Approximation Continuous state/action space, mean-square error, gradient temporal difference learning, least-square temporal difference, least squares policy iteration