Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation

Size: px

Start display at page:

Download "Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation"

Cameron Conley
5 years ago
Views:

1 Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation Rich Sutton, University of Alberta Hamid Maei, University of Alberta Doina Precup, McGill University Shalabh Bhatnagar, Indian Institute of Science, Bangalore David Silver, University of Alberta Csaba Szepesvari, University of Alberta Eric Wiewiora, University of Alberta

2 a breakthrough in RL function approximation in TD learning is now straightforward as straightforward as it is in supervised learning TD learning can now be done as gradientdescent in a novel Bellman error

3 limitations (for this paper) linear function approximation one-step TD methods (λ = ) prediction (policy evaluation), not control

4 limitations (for this paper) linear function approximation one-step TD methods (λ = ) prediction (policy evaluation), not control all of these are being removed in current work

5 keys to the breakthrough a new Bellman error objective function an algorithmic trick a second set of weights to estimate one of the sub-expectations and avoid the need for double sampling introduced in prior work (Sutton, Szepesvari & Maei, 28)

6 outline ways in which TD with FA has not been straightforward the new Bellman error objective function derivation of new algorithms (the trick) results (theory and experiments)

7 TD+FA was not straightforward with linear FA, off-policy methods such as Q- learning diverge on some problems (Baird, 1995) with nonlinear FA, even on-policy methods can diverge (Tsitsiklis & Van Roy, 1997) convergence guaranteed only for one very important special case linear FA, learning about the policy being followed second-order or importance-sampling methods are complex, slow or messy no true gradient-descent methods

8 Baird s counterexample a simple Markov chain linear FA, all rewards zero deterministic, expectation-based full backups (as in DP) each state updated once per sweep (as in DP) weights can diverge to ± Error RMSPBE 1 GTD GTD2 TDC TD Sweeps

9 outline ways in which TD with FA has not been straightforward the new Bellman error objective function derivation of new algorithms (the trick) results (theory and experiments)

10 e.g. linear value-function approximation in Computer Go state s 1 35 states

11 e.g. linear value-function approximation in Computer Go state feature vector s φ s 1 35 states 1 5 binary features

12 e.g. linear value-function approximation in Computer Go state feature vector parameter vector x s φ s θ 1 35 states 1 5 binary features and parameters

13 e.g. linear value-function approximation in Computer Go state feature vector parameter vector x s φ s θ 1 35 states 1 5 binary features and parameters

e.g. linear value-function approximation in Computer Go state feature vector parameter vector 1 1 1

14 e.g. linear value-function approximation in Computer Go state feature vector parameter vector x = = 3 estimated value s φ s θ 1 35 states 1 5 binary features and parameters

15 state transitions: Notation s r s feature vectors: φ φ R n n #states approximate values: V θ (s) =θ φ θ R n parameter vector TD error: TD() algorithm: true values: δ = r + γθ φ θ φ γ [, 1) θ = αδφ α > V (s) =E[r s]+γ s P ss V (s ) Bellman operator: over per-state vectors TV = R + γpv V = TV

16 state transitions: Notation s r s feature vectors: φ φ R n n #states approximate values: V θ (s) =θ φ θ R n parameter vector TD error: TD() algorithm: true values: δ = r + γθ φ θ φ γ [, 1) θ = αδφ α > V (s) =E[r s]+γ s P ss V (s ) Bellman operator: over per-state vectors TV = R + γpv V = TV

17 state transitions: Notation s r s feature vectors: φ φ R n n #states approximate values: V θ (s) =θ φ θ R n parameter vector TD error: TD() algorithm: true values: δ = r + γθ φ θ φ γ [, 1) θ = αδφ α > V (s) =E[r s]+γ s P ss V (s ) Bellman operator: over per-state vectors TV = R + γpv V = TV

18 Value function geometry V RMSBE T TV! " "TV!!, D V! RMSPBE The space spanned by the feature vectors, weighted by the state visitation distribution

19 Value function geometry V RMSBE T TV! " T takes you outside the space "TV!!, D V! RMSPBE The space spanned by the feature vectors, weighted by the state visitation distribution

20 Value function geometry V!, D V! RMSBE T RMSPBE TV! " "TV! T takes you outside the space Π projects you back into it The space spanned by the feature vectors, weighted by the state visitation distribution

21 Value function geometry Previous work on gradient methods for TD minimized this objective fn (Baird 1995, 1999) RMSBE T TV! " "TV! V T takes you outside the space Π projects you back into it V!!, D RMSPBE The space spanned by the feature vectors, weighted by the state visitation distribution

22 Value function geometry Previous work on gradient methods for TD minimized this objective fn (Baird 1995, 1999) RMSBE T TV! " "TV! V T takes you outside the space Π projects you back into it V!!, D RMSPBE Better objective fn? The space spanned by the feature vectors, weighted by the state visitation distribution Mean Square Projected Bellman Error (MSPBE)

23 TD objective functions (to be minimized) Error from the true values Error in the Bellman equation (Bellman residual) Error in the Bellman equation after projection (MSPBE) Zero expected TD update Norm Expected TD update Expected squared TD error V θ V 2 D V θ TV θ 2 D V θ ΠTV θ 2 D V θ = ΠTV θ E[ θ TD ] E[δ 2 ]

24 TD objective functions (to be minimized) Error from the true values Error in the Bellman equation (Bellman residual) V θ V 2 D V θ TV θ 2 D Not TD Error in the Bellman equation after projection (MSPBE) V θ ΠTV θ 2 D Zero expected TD update Norm Expected TD update Expected squared TD error V θ = ΠTV θ E[ θ TD ] E[δ 2 ]

25 TD objective functions (to be minimized) Error from the true values Error in the Bellman equation (Bellman residual) V θ V 2 D V θ TV θ 2 D Not TD Not right Error in the Bellman equation after projection (MSPBE) V θ ΠTV θ 2 D Zero expected TD update Norm Expected TD update Expected squared TD error V θ = ΠTV θ E[ θ TD ] E[δ 2 ]

26 TD objective functions (to be minimized) Error from the true values Error in the Bellman equation (Bellman residual) V θ V 2 D V θ TV θ 2 D Not TD Not right Error in the Bellman equation after projection (MSPBE) V θ ΠTV θ 2 D Right! Zero expected TD update Norm Expected TD update Expected squared TD error V θ = ΠTV θ E[ θ TD ] E[δ 2 ]

27 TD objective functions (to be minimized) Error from the true values Error in the Bellman equation (Bellman residual) V θ V 2 D V θ TV θ 2 D Not TD Not right Error in the Bellman equation after projection (MSPBE) V θ ΠTV θ 2 D Right! Zero expected TD update Norm Expected TD update Expected squared TD error V θ = ΠTV θ E[ θ TD ] E[δ 2 ]

28 TD objective functions (to be minimized) Error from the true values Error in the Bellman equation (Bellman residual) V θ V 2 D V θ TV θ 2 D Not TD Not right Error in the Bellman equation after projection (MSPBE) Zero expected TD update Norm Expected TD update Expected squared TD error V θ ΠTV θ 2 D V θ = ΠTV θ E[ θ TD ] E[δ 2 ] Right! Not an objective

29 backwardsbootstrapping example The two A states look the same; they share a single feature and must be given the same approximate value V (A1) = V (A2) = 1 2 A1 B 1 All transitions are deterministic; Bellman error = TD error A Clearly, the right solution is V (B) = 1, V(C) = A2 C But the solution the minimizes the Bellman error is With function approx... V (B) = 3 4,V(C) =1 4

30 backwardsbootstrapping example The two A states look the same; they share a single feature and must be given the same approximate value A1 A A2 B B C 1 C V (A1) = V (A2) = 1 2 All transitions are deterministic; Bellman error = TD error Clearly, the right solution is V (B) = 1, V(C) = But the solution the minimizes the Bellman error is With function approx... V (B) = 3 4,V(C) =1 4

31 TD objective functions (to be minimized) Error from the true values Error in the Bellman equation (Bellman residual) V θ V 2 D V θ TV θ 2 D Not TD Not right Error in the Bellman equation after projection (MSPBE) Zero expected TD update Norm Expected TD update Expected squared TD error V θ ΠTV θ 2 D Right! Not an objective V θ = ΠTV θ, E[ θ TD ]= E[ θ TD ] E[δ 2 ]

32 TD objective functions (to be minimized) Error from the true values Error in the Bellman equation (Bellman residual) V θ V 2 D V θ TV θ 2 D Not TD Not right Error in the Bellman equation after projection (MSPBE) Zero expected TD update V θ ΠTV θ 2 D Right! Not an objective V θ = ΠTV θ, E[ θ TD ]= Norm Expected TD update Expected squared TD error E[ θ TD ] E[δ 2 ] previous work

33 TD objective functions (to be minimized) Error from the true values Error in the Bellman equation (Bellman residual) V θ V 2 D V θ TV θ 2 D Not TD Not right Error in the Bellman equation after projection (MSPBE) Zero expected TD update V θ ΠTV θ 2 D Right! Not an objective V θ = ΠTV θ, E[ θ TD ]= Norm Expected TD update E[ θ TD ] previous work Expected squared TD error E[δ 2 ] Not right; residual gradient

34 outline ways in which TD with FA has not been straightforward the new Bellman error objective function derivation of new algorithms (the trick) results (theory and experiments)

35 Gradient-descent learning 1. Pick an objective function J(θ), a parameterized function to be minimized 2. Use calculus to analytically compute the gradient θ J(θ) 3. Find a sample gradient that you can sample on every time step and whose expected value equals the gradient 4. Take small steps in θ proportional to the sample gradient: θ = α θ J t (θ)

36 Derivation of the TDC algorithm θ = 1 2 α θj(θ) = 1 2 α θ V θ ΠTV θ 2 D = 1 2 α θ ( E [ φφ ] ) 1 = α ( θ ) E [ φφ ] 1 = αe [ θ ( r + γθ φ θ φ ) φ ] E [ φφ ] 1 = αe [(γφ φ) φ] E [ φφ ] 1 = α ( E [ φφ ] γe [ φ φ ]) E [ φφ ] 1 = α αγe [ φ φ ] E [ φφ ] 1 α αγe [ φ φ ] w (sampling) αδφ αγφ φ w This is the main trick!

37 Derivation of the TDC algorithm θ = 1 2 α θj(θ) = 1 2 α θ V θ ΠTV θ 2 D = 1 2 α θ ( E [ φφ ] ) 1 = α ( θ ) E [ φφ ] 1 = αe [ θ ( r + γθ φ θ φ ) φ ] E [ φφ ] 1 = αe [(γφ φ) φ] E [ φφ ] 1 = α ( E [ φφ ] γe [ φ φ ]) E [ φφ ] 1 = α αγe [ φ φ ] E [ φφ ] 1 α αγe [ φ φ ] w (sampling) αδφ αγφ φ w This is the main trick!

38 Derivation of the TDC algorithm θ = 1 2 α θj(θ) = 1 2 α θ V θ ΠTV θ 2 D = 1 2 α θ ( E [ φφ ] ) 1 s r s φ φ = α ( θ ) E [ φφ ] 1 = αe [ θ ( r + γθ φ θ φ ) φ ] E [ φφ ] 1 = αe [(γφ φ) φ] E [ φφ ] 1 = α ( E [ φφ ] γe [ φ φ ]) E [ φφ ] 1 = α αγe [ φ φ ] E [ φφ ] 1 α αγe [ φ φ ] w (sampling) αδφ αγφ φ w This is the main trick!

39 Derivation of the TDC algorithm θ = 1 2 α θj(θ) = 1 2 α θ V θ ΠTV θ 2 D = 1 2 α θ ( E [ φφ ] ) 1 s r s φ φ = α ( θ ) E [ φφ ] 1 = αe [ θ ( r + γθ φ θ φ ) φ ] E [ φφ ] 1 = αe [(γφ φ) φ] E [ φφ ] 1 = α ( E [ φφ ] γe [ φ φ ]) E [ φφ ] 1 = α αγe [ φ φ ] E [ φφ ] 1 α αγe [ φ φ ] w (sampling) αδφ αγφ φ w This is the main trick!

40 Derivation of the TDC algorithm θ = 1 2 α θj(θ) = 1 2 α θ V θ ΠTV θ 2 D = 1 2 α θ ( E [ φφ ] ) 1 s r s φ φ = α ( θ ) E [ φφ ] 1 = αe [ θ ( r + γθ φ θ φ ) φ ] E [ φφ ] 1 = αe [(γφ φ) φ] E [ φφ ] 1 = α ( E [ φφ ] γe [ φ φ ]) E [ φφ ] 1 = α αγe [ φ φ ] E [ φφ ] 1 α αγe [ φ φ ] w (sampling) αδφ αγφ φ w This is the main trick!

41 Derivation of the TDC algorithm θ = 1 2 α θj(θ) = 1 2 α θ V θ ΠTV θ 2 D = 1 2 α θ ( E [ φφ ] ) 1 s r s φ φ = α ( θ ) E [ φφ ] 1 = αe [ θ ( r + γθ φ θ φ ) φ ] E [ φφ ] 1 = αe [(γφ φ) φ] E [ φφ ] 1 = α ( E [ φφ ] γe [ φ φ ]) E [ φφ ] 1 = α αγe [ φ φ ] E [ φφ ] 1 α αγe [ φ φ ] w (sampling) αδφ αγφ φ w This is the main trick!

42 Derivation of the TDC algorithm θ = 1 2 α θj(θ) = 1 2 α θ V θ ΠTV θ 2 D = 1 2 α θ ( E [ φφ ] ) 1 s r s φ φ = α ( θ ) E [ φφ ] 1 = αe [ θ ( r + γθ φ θ φ ) φ ] E [ φφ ] 1 = αe [(γφ φ) φ] E [ φφ ] 1 = α ( E [ φφ ] γe [ φ φ ]) E [ φφ ] 1 = α αγe [ φ φ ] E [ φφ ] 1 α αγe [ φ φ ] w (sampling) αδφ αγφ φ w This is the main trick!

43 Derivation of the TDC algorithm θ = 1 2 α θj(θ) = 1 2 α θ V θ ΠTV θ 2 D = 1 2 α θ ( E [ φφ ] ) 1 s r s φ φ = α ( θ ) E [ φφ ] 1 = αe [ ( θ r + γθ φ θ φ ) φ ] E [ φφ ] 1 = αe [(γφ φ) φ] E [ φφ ] 1 = α ( E [ φφ ] γe [ φ φ ]) E [ φφ ] 1 = α αγe [ φ φ ] E [ φφ ] 1 α αγe [ φ φ ] w (sampling) αδφ αγφ φ w

44 Derivation of the TDC algorithm θ = 1 2 α θj(θ) = 1 2 α θ V θ ΠTV θ 2 D = 1 2 α θ ( E [ φφ ] ) 1 s r s φ φ = α ( θ ) E [ φφ ] 1 = αe [ ( θ r + γθ φ θ φ ) φ ] E [ φφ ] 1 = αe [(γφ φ) φ] E [ φφ ] 1 = α ( E [ φφ ] γe [ φ φ ]) E [ φφ ] 1 = α αγe [ φ φ ] E [ φφ ] 1 α αγe [ φ φ ] w (sampling) αδφ αγφ φ w

45 Derivation of the TDC algorithm θ = 1 2 α θj(θ) = 1 2 α θ V θ ΠTV θ 2 D = 1 2 α θ ( E [ φφ ] ) 1 s r s φ φ = α ( θ ) E [ φφ ] 1 = αe [ θ ( r + γθ φ θ φ ) φ ] E [ φφ ] 1 = αe [(γφ φ) φ] E [ φφ ] 1 = α ( E [ φφ ] γe [ φ φ ]) E [ φφ ] 1 = α αγe [ φ φ ] E [ φφ ] 1 α αγe [ φ φ ] w (sampling) αδφ αγφ φ w This is the trick! w R n is a second set of weights

46 Derivation of the TDC algorithm θ = 1 2 α θj(θ) = 1 2 α θ V θ ΠTV θ 2 D = 1 2 α θ ( E [ φφ ] ) 1 s r s φ φ = α ( θ ) E [ φφ ] 1 = αe [ θ ( r + γθ φ θ φ ) φ ] E [ φφ ] 1 = αe [(γφ φ) φ] E [ φφ ] 1 = α ( E [ φφ ] γe [ φ φ ]) E [ φφ ] 1 = α αγe [ φ φ ] E [ φφ ] 1 α αγe [ φ φ ] w (sampling) αδφ αγφ φ w This is the trick! w R n is a second set of weights

47 The complete TD with gradient correction (TDC) algorithm on each transition s r s update two parameters where θ θ + αδφ αγφ ( φ ) w w w + β(δ φ w)φ δ = r + γθ φ θ φ φ φ

48 The complete TD with gradient correction (TDC) algorithm on each transition s r s update two parameters where θ θ + αδφ αγφ ( φ ) w w w + β(δ φ w)φ δ = r + γθ φ θ φ φ φ TD()

49 The complete TD with gradient correction (TDC) algorithm on each transition s r s φ φ update two parameters TD() θ θ + αδφ αγφ ( φ w ) with gradient correction where w w + β(δ φ w)φ δ = r + γθ φ θ φ

50 The complete TD with gradient correction (TDC) algorithm on each transition s r s update two parameters φ φ θ θ + αδφ αγφ ( φ w ) where w w + β(δ φ w)φ δ = r + γθ φ θ φ estimate of the TD error ( δ) for the current state φ

51 outline ways in which TD with FA has not been straightforward the new Bellman error objective function derivation of new algorithms (the trick) results (theory and experiments)

52 Three new algorithms GTD, the original gradient TD algorithm (Sutton, Szepevari & Maei, 28) GTD2, a second-generation GTD TDC

53 Convergence theorems For arbitrary on- or off-policy training All algorithms converge w.p.1 to the TD fix-point: GTD, GTD2 converge at one time scale α = β TDC converges in a two-time-scale sense α, β α β

54 Summary of empirical results on small problems.12 Random Walk - Tabular features.2 Random Walk - Inverted features RMSPBE TD GTD2 GTD TDC ! GTD GTD2 TD TDC 1 2 episodes RMSPBE TD GTD GTD2 TDC ! GTD TDC GTD2 TD 25 5 episodes.13 Random Walk - Dependent features 2.8 TD Boyan Chain RMSPBE TDC GTD2 GTD TD ! TDC TD GTD2 GTD episodes Usually GTD << GTD2 < TD, TDC RMSPBE TDC GTD GTD ! TDC GTD2 TD GTD 5 1 episodes

55 Computer Go experiment Learn a linear value function (probability of winning) for 9x9 Go from self play One million features, each corresponding to a template on a part of the Go board An established experimental testbed E[ θ TD ] RNEU.8.6 GTD2 TDC.4 TD GTD2.2 GTD TDC α

56 conclusions the new algorithms are roughly the same efficiency as conventional TD on on-policy problems but are guaranteed convergent under general off-policy training as well their key ideas appear to extend quite broadly, to control, general λ, non-linear settings, DP, intra-option learning, TD nets... TD with FA is now straightforward the curse of dimensionality is removed

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation Richard S. Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, 3 David Silver, Csaba Szepesvári,