Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation

Similar documents
Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation

Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning Algorithms in Markov Decision Processes AAAI-10 Tutorial. Part II: Learning to predict values

Lecture 7: Value Function Approximation

A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation

Gradient Temporal Difference Networks

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Algorithms for Fast Gradient Temporal Difference Learning

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

Recurrent Gradient Temporal Difference Networks

Lecture 4: Approximate dynamic programming

Reinforcement Learning Part 2

Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results

Off-Policy Actor-Critic

Linear Least-squares Dyna-style Planning

Chapter 8: Generalization and Function Approximation

Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping

Generalization and Function Approximation

Overview Example (TD-Gammon) Admission Why approximate RL is hard TD() Fitted value iteration (collocation) Example (k-nn for hill-car)

Proximal Gradient Temporal Difference Learning Algorithms

Least-Squares λ Policy Iteration: Bias-Variance Trade-off in Control Problems

Tutorial on Policy Gradient Methods. Jan Peters

CS599 Lecture 2 Function Approximation in RL

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

Reinforcement. Function Approximation. Learning with KATJA HOFMANN. Researcher, MSR Cambridge

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Least squares temporal difference learning

A UNIFIED FRAMEWORK FOR LINEAR FUNCTION APPROXIMATION OF VALUE FUNCTIONS IN STOCHASTIC CONTROL

The Fixed Points of Off-Policy TD

6 Reinforcement Learning

Reinforcement Learning

Toward Off-Policy Learning Control with Function Approximation

Bayesian Active Learning With Basis Functions

State Space Abstractions for Reinforcement Learning

Temporal difference learning

Policy Evaluation with Temporal Differences: A Survey and Comparison

An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning

Open Theoretical Questions in Reinforcement Learning

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Off-Policy Learning Combined with Automatic Feature Expansion for Solving Large MDPs

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

CS599 Lecture 1 Introduction To RL

Approximate Dynamic Programming

Reinforcement Learning

The convergence limit of the temporal difference learning

Emphatic Temporal-Difference Learning

Planning in Markov Decision Processes

Reinforcement Learning

Reinforcement Learning. Yishay Mansour Tel-Aviv University

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Regularization and Feature Selection in. the Least-Squares Temporal Difference

ilstd: Eligibility Traces and Convergence Analysis

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Lecture 23: Reinforcement Learning

Notes on policy gradients and the log derivative trick for reinforcement learning

Dual Temporal Difference Learning

Approximate Dynamic Programming

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Reinforcement Learning and Deep Reinforcement Learning

Notes on Reinforcement Learning

arxiv: v1 [cs.ai] 1 Jul 2015

Lecture 3: Markov Decision Processes

Model-based Reinforcement Learning with State and Action Abstractions. Hengshuai Yao

Parametric value function approximation: A unified view

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Machine Learning I Reinforcement Learning

Natural Actor Critic Algorithms

Reinforcement learning

Approximation Methods in Reinforcement Learning

CSC321 Lecture 22: Q-Learning

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Trust Region Policy Optimization

Regularization and Feature Selection in. the Least-Squares Temporal Difference

arxiv: v4 [cs.lg] 22 Oct 2018

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

Reinforcement learning an introduction

Lecture 9: Policy Gradient II 1

REINFORCEMENT LEARNING

Reinforcement Learning: An Introduction. ****Draft****

University of Alberta

Reinforcement Learning. Machine Learning, Fall 2010

arxiv: v1 [cs.lg] 23 Oct 2017

Reinforcement Learning

MDP Preliminaries. Nan Jiang. February 10, 2019

Off-Policy Temporal-Difference Learning with Function Approximation

Lecture 8: Policy Gradient

Basics of reinforcement learning

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Introduction to Reinforcement Learning

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Payments System Design Using Reinforcement Learning: A Progress Report

CS 598 Statistical Reinforcement Learning. Nan Jiang

Chapter 4: Dynamic Programming

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Lecture 9: Policy Gradient II (Post lecture) 2

Transcription:

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation Rich Sutton, University of Alberta Hamid Maei, University of Alberta Doina Precup, McGill University Shalabh Bhatnagar, Indian Institute of Science, Bangalore David Silver, University of Alberta Csaba Szepesvari, University of Alberta Eric Wiewiora, University of Alberta

a breakthrough in RL function approximation in TD learning is now straightforward as straightforward as it is in supervised learning TD learning can now be done as gradientdescent in a novel Bellman error

limitations (for this paper) linear function approximation one-step TD methods (λ = ) prediction (policy evaluation), not control

limitations (for this paper) linear function approximation one-step TD methods (λ = ) prediction (policy evaluation), not control all of these are being removed in current work

keys to the breakthrough a new Bellman error objective function an algorithmic trick a second set of weights to estimate one of the sub-expectations and avoid the need for double sampling introduced in prior work (Sutton, Szepesvari & Maei, 28)

outline ways in which TD with FA has not been straightforward the new Bellman error objective function derivation of new algorithms (the trick) results (theory and experiments)

TD+FA was not straightforward with linear FA, off-policy methods such as Q- learning diverge on some problems (Baird, 1995) with nonlinear FA, even on-policy methods can diverge (Tsitsiklis & Van Roy, 1997) convergence guaranteed only for one very important special case linear FA, learning about the policy being followed second-order or importance-sampling methods are complex, slow or messy no true gradient-descent methods

Baird s counterexample a simple Markov chain linear FA, all rewards zero deterministic, expectation-based full backups (as in DP) each state updated once per sweep (as in DP) weights can diverge to ± 1 2 3 4 5 2 Error RMSPBE 1 GTD GTD2 TDC TD Sweeps

outline ways in which TD with FA has not been straightforward the new Bellman error objective function derivation of new algorithms (the trick) results (theory and experiments)

e.g. linear value-function approximation in Computer Go state s 1 35 states

e.g. linear value-function approximation in Computer Go state feature vector 1 1 1 s φ s 1 35 states 1 5 binary features

e.g. linear value-function approximation in Computer Go state feature vector parameter vector 1 1 1 x.1-2.5 5 -.4 s φ s θ 1 35 states 1 5 binary features and parameters

e.g. linear value-function approximation in Computer Go state feature vector parameter vector 1 1 1 x.1-2.5 5 -.4 s φ s θ 1 35 states 1 5 binary features and parameters

e.g. linear value-function approximation in Computer Go state feature vector parameter vector 1 1 1 x.1-2.5 5 -.4 = -2 + + 5 = 3 estimated value s φ s θ 1 35 states 1 5 binary features and parameters

state transitions: Notation s r s feature vectors: φ φ R n n #states approximate values: V θ (s) =θ φ θ R n parameter vector TD error: TD() algorithm: true values: δ = r + γθ φ θ φ γ [, 1) θ = αδφ α > V (s) =E[r s]+γ s P ss V (s ) Bellman operator: over per-state vectors TV = R + γpv V = TV

state transitions: Notation s r s feature vectors: φ φ R n n #states approximate values: V θ (s) =θ φ θ R n parameter vector TD error: TD() algorithm: true values: δ = r + γθ φ θ φ γ [, 1) θ = αδφ α > V (s) =E[r s]+γ s P ss V (s ) Bellman operator: over per-state vectors TV = R + γpv V = TV

state transitions: Notation s r s feature vectors: φ φ R n n #states approximate values: V θ (s) =θ φ θ R n parameter vector TD error: TD() algorithm: true values: δ = r + γθ φ θ φ γ [, 1) θ = αδφ α > V (s) =E[r s]+γ s P ss V (s ) Bellman operator: over per-state vectors TV = R + γpv V = TV

Value function geometry V RMSBE T TV! " "TV!!, D V! RMSPBE The space spanned by the feature vectors, weighted by the state visitation distribution

Value function geometry V RMSBE T TV! " T takes you outside the space "TV!!, D V! RMSPBE The space spanned by the feature vectors, weighted by the state visitation distribution

Value function geometry V!, D V! RMSBE T RMSPBE TV! " "TV! T takes you outside the space Π projects you back into it The space spanned by the feature vectors, weighted by the state visitation distribution

Value function geometry Previous work on gradient methods for TD minimized this objective fn (Baird 1995, 1999) RMSBE T TV! " "TV! V T takes you outside the space Π projects you back into it V!!, D RMSPBE The space spanned by the feature vectors, weighted by the state visitation distribution

Value function geometry Previous work on gradient methods for TD minimized this objective fn (Baird 1995, 1999) RMSBE T TV! " "TV! V T takes you outside the space Π projects you back into it V!!, D RMSPBE Better objective fn? The space spanned by the feature vectors, weighted by the state visitation distribution Mean Square Projected Bellman Error (MSPBE)

TD objective functions (to be minimized) Error from the true values Error in the Bellman equation (Bellman residual) Error in the Bellman equation after projection (MSPBE) Zero expected TD update Norm Expected TD update Expected squared TD error V θ V 2 D V θ TV θ 2 D V θ ΠTV θ 2 D V θ = ΠTV θ E[ θ TD ] E[δ 2 ]

TD objective functions (to be minimized) Error from the true values Error in the Bellman equation (Bellman residual) V θ V 2 D V θ TV θ 2 D Not TD Error in the Bellman equation after projection (MSPBE) V θ ΠTV θ 2 D Zero expected TD update Norm Expected TD update Expected squared TD error V θ = ΠTV θ E[ θ TD ] E[δ 2 ]

TD objective functions (to be minimized) Error from the true values Error in the Bellman equation (Bellman residual) V θ V 2 D V θ TV θ 2 D Not TD Not right Error in the Bellman equation after projection (MSPBE) V θ ΠTV θ 2 D Zero expected TD update Norm Expected TD update Expected squared TD error V θ = ΠTV θ E[ θ TD ] E[δ 2 ]

TD objective functions (to be minimized) Error from the true values Error in the Bellman equation (Bellman residual) V θ V 2 D V θ TV θ 2 D Not TD Not right Error in the Bellman equation after projection (MSPBE) V θ ΠTV θ 2 D Right! Zero expected TD update Norm Expected TD update Expected squared TD error V θ = ΠTV θ E[ θ TD ] E[δ 2 ]

TD objective functions (to be minimized) Error from the true values Error in the Bellman equation (Bellman residual) V θ V 2 D V θ TV θ 2 D Not TD Not right Error in the Bellman equation after projection (MSPBE) V θ ΠTV θ 2 D Right! Zero expected TD update Norm Expected TD update Expected squared TD error V θ = ΠTV θ E[ θ TD ] E[δ 2 ]

TD objective functions (to be minimized) Error from the true values Error in the Bellman equation (Bellman residual) V θ V 2 D V θ TV θ 2 D Not TD Not right Error in the Bellman equation after projection (MSPBE) Zero expected TD update Norm Expected TD update Expected squared TD error V θ ΠTV θ 2 D V θ = ΠTV θ E[ θ TD ] E[δ 2 ] Right! Not an objective

backwardsbootstrapping example The two A states look the same; they share a single feature and must be given the same approximate value V (A1) = V (A2) = 1 2 A1 B 1 All transitions are deterministic; Bellman error = TD error A Clearly, the right solution is V (B) = 1, V(C) = A2 C But the solution the minimizes the Bellman error is With function approx... V (B) = 3 4,V(C) =1 4

backwardsbootstrapping example The two A states look the same; they share a single feature and must be given the same approximate value A1 A A2 B B C 1 C V (A1) = V (A2) = 1 2 All transitions are deterministic; Bellman error = TD error Clearly, the right solution is V (B) = 1, V(C) = But the solution the minimizes the Bellman error is With function approx... V (B) = 3 4,V(C) =1 4

TD objective functions (to be minimized) Error from the true values Error in the Bellman equation (Bellman residual) V θ V 2 D V θ TV θ 2 D Not TD Not right Error in the Bellman equation after projection (MSPBE) Zero expected TD update Norm Expected TD update Expected squared TD error V θ ΠTV θ 2 D Right! Not an objective V θ = ΠTV θ, E[ θ TD ]= E[ θ TD ] E[δ 2 ]

TD objective functions (to be minimized) Error from the true values Error in the Bellman equation (Bellman residual) V θ V 2 D V θ TV θ 2 D Not TD Not right Error in the Bellman equation after projection (MSPBE) Zero expected TD update V θ ΠTV θ 2 D Right! Not an objective V θ = ΠTV θ, E[ θ TD ]= Norm Expected TD update Expected squared TD error E[ θ TD ] E[δ 2 ] previous work

TD objective functions (to be minimized) Error from the true values Error in the Bellman equation (Bellman residual) V θ V 2 D V θ TV θ 2 D Not TD Not right Error in the Bellman equation after projection (MSPBE) Zero expected TD update V θ ΠTV θ 2 D Right! Not an objective V θ = ΠTV θ, E[ θ TD ]= Norm Expected TD update E[ θ TD ] previous work Expected squared TD error E[δ 2 ] Not right; residual gradient

outline ways in which TD with FA has not been straightforward the new Bellman error objective function derivation of new algorithms (the trick) results (theory and experiments)

Gradient-descent learning 1. Pick an objective function J(θ), a parameterized function to be minimized 2. Use calculus to analytically compute the gradient θ J(θ) 3. Find a sample gradient that you can sample on every time step and whose expected value equals the gradient 4. Take small steps in θ proportional to the sample gradient: θ = α θ J t (θ)

Derivation of the TDC algorithm θ = 1 2 α θj(θ) = 1 2 α θ V θ ΠTV θ 2 D = 1 2 α θ ( E [ φφ ] ) 1 = α ( θ ) E [ φφ ] 1 = αe [ θ ( r + γθ φ θ φ ) φ ] E [ φφ ] 1 = αe [(γφ φ) φ] E [ φφ ] 1 = α ( E [ φφ ] γe [ φ φ ]) E [ φφ ] 1 = α αγe [ φ φ ] E [ φφ ] 1 α αγe [ φ φ ] w (sampling) αδφ αγφ φ w This is the main trick!

Derivation of the TDC algorithm θ = 1 2 α θj(θ) = 1 2 α θ V θ ΠTV θ 2 D = 1 2 α θ ( E [ φφ ] ) 1 = α ( θ ) E [ φφ ] 1 = αe [ θ ( r + γθ φ θ φ ) φ ] E [ φφ ] 1 = αe [(γφ φ) φ] E [ φφ ] 1 = α ( E [ φφ ] γe [ φ φ ]) E [ φφ ] 1 = α αγe [ φ φ ] E [ φφ ] 1 α αγe [ φ φ ] w (sampling) αδφ αγφ φ w This is the main trick!

Derivation of the TDC algorithm θ = 1 2 α θj(θ) = 1 2 α θ V θ ΠTV θ 2 D = 1 2 α θ ( E [ φφ ] ) 1 s r s φ φ = α ( θ ) E [ φφ ] 1 = αe [ θ ( r + γθ φ θ φ ) φ ] E [ φφ ] 1 = αe [(γφ φ) φ] E [ φφ ] 1 = α ( E [ φφ ] γe [ φ φ ]) E [ φφ ] 1 = α αγe [ φ φ ] E [ φφ ] 1 α αγe [ φ φ ] w (sampling) αδφ αγφ φ w This is the main trick!

Derivation of the TDC algorithm θ = 1 2 α θj(θ) = 1 2 α θ V θ ΠTV θ 2 D = 1 2 α θ ( E [ φφ ] ) 1 s r s φ φ = α ( θ ) E [ φφ ] 1 = αe [ θ ( r + γθ φ θ φ ) φ ] E [ φφ ] 1 = αe [(γφ φ) φ] E [ φφ ] 1 = α ( E [ φφ ] γe [ φ φ ]) E [ φφ ] 1 = α αγe [ φ φ ] E [ φφ ] 1 α αγe [ φ φ ] w (sampling) αδφ αγφ φ w This is the main trick!

Derivation of the TDC algorithm θ = 1 2 α θj(θ) = 1 2 α θ V θ ΠTV θ 2 D = 1 2 α θ ( E [ φφ ] ) 1 s r s φ φ = α ( θ ) E [ φφ ] 1 = αe [ θ ( r + γθ φ θ φ ) φ ] E [ φφ ] 1 = αe [(γφ φ) φ] E [ φφ ] 1 = α ( E [ φφ ] γe [ φ φ ]) E [ φφ ] 1 = α αγe [ φ φ ] E [ φφ ] 1 α αγe [ φ φ ] w (sampling) αδφ αγφ φ w This is the main trick!

Derivation of the TDC algorithm θ = 1 2 α θj(θ) = 1 2 α θ V θ ΠTV θ 2 D = 1 2 α θ ( E [ φφ ] ) 1 s r s φ φ = α ( θ ) E [ φφ ] 1 = αe [ θ ( r + γθ φ θ φ ) φ ] E [ φφ ] 1 = αe [(γφ φ) φ] E [ φφ ] 1 = α ( E [ φφ ] γe [ φ φ ]) E [ φφ ] 1 = α αγe [ φ φ ] E [ φφ ] 1 α αγe [ φ φ ] w (sampling) αδφ αγφ φ w This is the main trick!

Derivation of the TDC algorithm θ = 1 2 α θj(θ) = 1 2 α θ V θ ΠTV θ 2 D = 1 2 α θ ( E [ φφ ] ) 1 s r s φ φ = α ( θ ) E [ φφ ] 1 = αe [ θ ( r + γθ φ θ φ ) φ ] E [ φφ ] 1 = αe [(γφ φ) φ] E [ φφ ] 1 = α ( E [ φφ ] γe [ φ φ ]) E [ φφ ] 1 = α αγe [ φ φ ] E [ φφ ] 1 α αγe [ φ φ ] w (sampling) αδφ αγφ φ w This is the main trick!

Derivation of the TDC algorithm θ = 1 2 α θj(θ) = 1 2 α θ V θ ΠTV θ 2 D = 1 2 α θ ( E [ φφ ] ) 1 s r s φ φ = α ( θ ) E [ φφ ] 1 = αe [ ( θ r + γθ φ θ φ ) φ ] E [ φφ ] 1 = αe [(γφ φ) φ] E [ φφ ] 1 = α ( E [ φφ ] γe [ φ φ ]) E [ φφ ] 1 = α αγe [ φ φ ] E [ φφ ] 1 α αγe [ φ φ ] w (sampling) αδφ αγφ φ w

Derivation of the TDC algorithm θ = 1 2 α θj(θ) = 1 2 α θ V θ ΠTV θ 2 D = 1 2 α θ ( E [ φφ ] ) 1 s r s φ φ = α ( θ ) E [ φφ ] 1 = αe [ ( θ r + γθ φ θ φ ) φ ] E [ φφ ] 1 = αe [(γφ φ) φ] E [ φφ ] 1 = α ( E [ φφ ] γe [ φ φ ]) E [ φφ ] 1 = α αγe [ φ φ ] E [ φφ ] 1 α αγe [ φ φ ] w (sampling) αδφ αγφ φ w

Derivation of the TDC algorithm θ = 1 2 α θj(θ) = 1 2 α θ V θ ΠTV θ 2 D = 1 2 α θ ( E [ φφ ] ) 1 s r s φ φ = α ( θ ) E [ φφ ] 1 = αe [ θ ( r + γθ φ θ φ ) φ ] E [ φφ ] 1 = αe [(γφ φ) φ] E [ φφ ] 1 = α ( E [ φφ ] γe [ φ φ ]) E [ φφ ] 1 = α αγe [ φ φ ] E [ φφ ] 1 α αγe [ φ φ ] w (sampling) αδφ αγφ φ w This is the trick! w R n is a second set of weights

Derivation of the TDC algorithm θ = 1 2 α θj(θ) = 1 2 α θ V θ ΠTV θ 2 D = 1 2 α θ ( E [ φφ ] ) 1 s r s φ φ = α ( θ ) E [ φφ ] 1 = αe [ θ ( r + γθ φ θ φ ) φ ] E [ φφ ] 1 = αe [(γφ φ) φ] E [ φφ ] 1 = α ( E [ φφ ] γe [ φ φ ]) E [ φφ ] 1 = α αγe [ φ φ ] E [ φφ ] 1 α αγe [ φ φ ] w (sampling) αδφ αγφ φ w This is the trick! w R n is a second set of weights

The complete TD with gradient correction (TDC) algorithm on each transition s r s update two parameters where θ θ + αδφ αγφ ( φ ) w w w + β(δ φ w)φ δ = r + γθ φ θ φ φ φ

The complete TD with gradient correction (TDC) algorithm on each transition s r s update two parameters where θ θ + αδφ αγφ ( φ ) w w w + β(δ φ w)φ δ = r + γθ φ θ φ φ φ TD()

The complete TD with gradient correction (TDC) algorithm on each transition s r s φ φ update two parameters TD() θ θ + αδφ αγφ ( φ w ) with gradient correction where w w + β(δ φ w)φ δ = r + γθ φ θ φ

The complete TD with gradient correction (TDC) algorithm on each transition s r s update two parameters φ φ θ θ + αδφ αγφ ( φ w ) where w w + β(δ φ w)φ δ = r + γθ φ θ φ estimate of the TD error ( δ) for the current state φ

outline ways in which TD with FA has not been straightforward the new Bellman error objective function derivation of new algorithms (the trick) results (theory and experiments)

Three new algorithms GTD, the original gradient TD algorithm (Sutton, Szepevari & Maei, 28) GTD2, a second-generation GTD TDC

Convergence theorems For arbitrary on- or off-policy training All algorithms converge w.p.1 to the TD fix-point: GTD, GTD2 converge at one time scale α = β TDC converges in a two-time-scale sense α, β α β

Summary of empirical results on small problems.12 Random Walk - Tabular features.2 Random Walk - Inverted features RMSPBE.9.6.3. TD GTD2 GTD TDC.3.6.12.25.5! GTD GTD2 TD TDC 1 2 episodes RMSPBE.15.1.5. TD GTD GTD2 TDC.3.6.12.25.5! GTD TDC GTD2 TD 25 5 episodes.13 Random Walk - Dependent features 2.8 TD Boyan Chain RMSPBE.11.9.7.5 TDC GTD2 GTD TD.8.15.3.6.12.25.5! TDC TD GTD2 GTD 1 2 3 4 episodes Usually GTD << GTD2 < TD, TDC RMSPBE 2.1 1.4.7 TDC GTD GTD2.15.3.6.12.25.5 1 2! TDC GTD2 TD GTD 5 1 episodes

Computer Go experiment Learn a linear value function (probability of winning) for 9x9 Go from self play One million features, each corresponding to a template on a part of the Go board An established experimental testbed E[ θ TD ] RNEU.8.6 GTD2 TDC.4 TD GTD2.2 GTD TDC.1.3.1.3.1.3.1 α

conclusions the new algorithms are roughly the same efficiency as conventional TD on on-policy problems but are guaranteed convergent under general off-policy training as well their key ideas appear to extend quite broadly, to control, general λ, non-linear settings, DP, intra-option learning, TD nets... TD with FA is now straightforward the curse of dimensionality is removed