Bias-Variance Error Bounds for Temporal Difference Updates

Similar documents
Bias-Variance Error Bounds for Temporal Difference Updates

Physics 235 Chapter 2. Chapter 2 Newtonian Mechanics Single Particle

PENALIZED LEAST SQUARES AND PENALIZED LIKELIHOOD

Vehicle Arrival Models : Headway

5.1 - Logarithms and Their Properties

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB

1 Review of Zero-Sum Games

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Diebold, Chapter 7. Francis X. Diebold, Elements of Forecasting, 4th Edition (Mason, Ohio: Cengage Learning, 2006). Chapter 7. Characterizing Cycles

Robust estimation based on the first- and third-moment restrictions of the power transformation model

Notes for Lecture 17-18

Lecture Notes 2. The Hilbert Space Approach to Time Series

Exponential Weighted Moving Average (EWMA) Chart Under The Assumption of Moderateness And Its 3 Control Limits

20. Applications of the Genetic-Drift Model

Some Basic Information about M-S-D Systems

Christos Papadimitriou & Luca Trevisan November 22, 2016

Random Walk with Anti-Correlated Steps

Econ107 Applied Econometrics Topic 7: Multicollinearity (Studenmund, Chapter 8)

Lecture 33: November 29

4.1 - Logarithms and Their Properties

Final Spring 2007

Two Popular Bayesian Estimators: Particle and Kalman Filters. McGill COMP 765 Sept 14 th, 2017

13.3 Term structure models

ACE 562 Fall Lecture 5: The Simple Linear Regression Model: Sampling Properties of the Least Squares Estimators. by Professor Scott H.

Maintenance Models. Prof. Robert C. Leachman IEOR 130, Methods of Manufacturing Improvement Spring, 2011

Physics 127b: Statistical Mechanics. Fokker-Planck Equation. Time Evolution

Ensamble methods: Bagging and Boosting

Simulation-Solving Dynamic Models ABE 5646 Week 2, Spring 2010

GMM - Generalized Method of Moments

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Nature Neuroscience: doi: /nn Supplementary Figure 1. Spike-count autocorrelations in time.

3.1.3 INTRODUCTION TO DYNAMIC OPTIMIZATION: DISCRETE TIME PROBLEMS. A. The Hamiltonian and First-Order Conditions in a Finite Time Horizon

Air Traffic Forecast Empirical Research Based on the MCMC Method

Handout: Power Laws and Preferential Attachment

Notes on Kalman Filtering

An introduction to the theory of SDDP algorithm

A Dynamic Model of Economic Fluctuations

18 Biological models with discrete time

The Arcsine Distribution

Hamilton- J acobi Equation: Explicit Formulas In this lecture we try to apply the method of characteristics to the Hamilton-Jacobi equation: u t

A Shooting Method for A Node Generation Algorithm

Comparing Means: t-tests for One Sample & Two Related Samples

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

Presentation Overview

Kriging Models Predicting Atrazine Concentrations in Surface Water Draining Agricultural Watersheds

) were both constant and we brought them from under the integral.

Matlab and Python programming: how to get started

Linear Response Theory: The connection between QFT and experiments

Lecture 2-1 Kinematics in One Dimension Displacement, Velocity and Acceleration Everything in the world is moving. Nothing stays still.

Guest Lectures for Dr. MacFarlane s EE3350 Part Deux

On Boundedness of Q-Learning Iterates for Stochastic Shortest Path Problems

t is a basis for the solution space to this system, then the matrix having these solutions as columns, t x 1 t, x 2 t,... x n t x 2 t...

R t. C t P t. + u t. C t = αp t + βr t + v t. + β + w t

( ) is the stretch factor, and x the

Distribution of Estimates

Assignment 6. Tyler Shendruk December 6, 2010

Bias in Conditional and Unconditional Fixed Effects Logit Estimation: a Correction * Tom Coupé

MANAGEMENT SCIENCE doi /mnsc ec pp. ec1 ec20

Ensamble methods: Boosting

More Digital Logic. t p output. Low-to-high and high-to-low transitions could have different t p. V in (t)

5. Stochastic processes (1)

On Measuring Pro-Poor Growth. 1. On Various Ways of Measuring Pro-Poor Growth: A Short Review of the Literature

OBJECTIVES OF TIME SERIES ANALYSIS

Planning in POMDPs. Dominik Schoenberger Abstract

Chapter 2. First Order Scalar Equations

Lab 10: RC, RL, and RLC Circuits

0.1 MAXIMUM LIKELIHOOD ESTIMATION EXPLAINED

Inventory Control of Perishable Items in a Two-Echelon Supply Chain

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Biol. 356 Lab 8. Mortality, Recruitment, and Migration Rates

A First Course on Kinetics and Reaction Engineering. Class 19 on Unit 18

Lecture 2 October ε-approximation of 2-player zero-sum games

Numerical Dispersion

Policy regimes Theory

EXERCISES FOR SECTION 1.5

STATE-SPACE MODELLING. A mass balance across the tank gives:

Supplement for Stochastic Convex Optimization: Faster Local Growth Implies Faster Global Convergence

Lecture 2 April 04, 2018

True Online Temporal-Difference Learning. A. Rupam Mahmood Patrick M. Pilarski

Estimation of Poses with Particle Filters

Tom Heskes and Onno Zoeter. Presented by Mark Buller

EXPLICIT TIME INTEGRATORS FOR NONLINEAR DYNAMICS DERIVED FROM THE MIDPOINT RULE

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

Unit Root Time Series. Univariate random walk

Intermediate Macro In-Class Problems

Navneet Saini, Mayank Goyal, Vishal Bansal (2013); Term Project AML310; Indian Institute of Technology Delhi

ACE 562 Fall Lecture 8: The Simple Linear Regression Model: R 2, Reporting the Results and Prediction. by Professor Scott H.

Introduction D P. r = constant discount rate, g = Gordon Model (1962): constant dividend growth rate.

Online Convex Optimization Example And Follow-The-Leader

10. State Space Methods

DEPARTMENT OF STATISTICS

A Reinforcement Learning Approach for Collaborative Filtering

Time series Decomposition method

Math 2142 Exam 1 Review Problems. x 2 + f (0) 3! for the 3rd Taylor polynomial at x = 0. To calculate the various quantities:

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

Explaining Total Factor Productivity. Ulrich Kohli University of Geneva December 2015

Energy Storage Benchmark Problems

Speaker Adaptation Techniques For Continuous Speech Using Medium and Small Adaptation Data Sets. Constantinos Boulis

3.1 More on model selection

State-Space Models. Initialization, Estimation and Smoothing of the Kalman Filter

Transcription:

Bias-Variance Bounds for Temporal Difference Updaes Michael Kearns AT&T Labs mkearns@research.a.com Sainder Singh AT&T Labs baveja@research.a.com Absrac We give he firs rigorous upper bounds on he error of emporal difference (d) algorihms for policy evaluaion as a funcion of he amoun of experience. These upper bounds prove exponenially fas convergence, wih boh he rae of convergence and he asympoe srongly dependen on he lengh of he backups k or he parameer. Our bounds give formal verificaion o he long-sanding inuiion ha d mehods are subjec o a bias-variance rade-off, and hey lead o schedules for k and ha are prediced o be beer han any fixed values for hese parameers. We give preliminary experimenal confirmaion of our heory for a version of he random walk problem. Inroducion In he policy evaluaion problem, we mus predic he expeced discouned reurn (or value) for a fixed policy, given only he abiliy o generae experience in an unknown Markov decision process (MDP) M. A family of well-sudied emporal difference (or d) [3] algorihms have been developed for his problem ha make use of repeaed rajecories under from he sae(s) of ineres, and perform ieraive updaes o he value funcion. The main difference beween he d varians lies in how far hey look ahead in he rajecories. The d(k) family of algorihms use he firs k rewards and he (curren) value predicion a he (k +)s sae reached in making is updae. The more commonly used d() family of algorihms use exponenially weighed sums of d(k) updaes (wih decay parameer ). The smaller he value for k or, he less he algorihm depends on he acual rewards received in he rajecory, and he more i depends on he curren predicions for he value funcion. Conversely, he larger he value for k or, he more he algorihm depends on he acual rewards obained, wih he curren value funcion playing a lessened role. The exreme cases of d(k = ) and d( = )become he Mone Carlo algorihm, which updaes each predicion o be he average of he discouned reurns in he rajecories. A long-sanding quesion is wheher i is beer o use large or small values of he parameers k and. Wakins [5] informally discusses he rade-off ha his decision gives rise o: larger values for he d parameers suffer larger variance in he updaes (since more sochasic reward erms appear), bu also enjoy lower bias (since he error in he curren value funcion predicions have less influence). This argumen has largely remained an inuiion. However, some conclusions arising from his inuiion for insance, ha inermediae values of k and ofen yield he bes performance in he shor erm have been borne ou experimenally [4, 2]. In his paper, we provide he firs rigorous upper bounds on he error in he value funcions of

he d algorihms as a funcion of he number of rajecories used. In oher words, we give bounds on he learning curves of d mehods ha hold for any MDP. These upper bounds decay exponenially fas, and are obained by firs deriving a one-sep recurrence relaing he errors before and afer a d updae, and hen ieraing his recurrence for he desired number of seps. Of paricular ineres is he form of our bounds, since i formalizes he rade-off discussed above he bounds consis of erms ha are monoonically growing wih k and (corresponding o he increased variance), and erms ha are monoonically shrinking wih k and (corresponding o he decreased influence of he curren error). Overall, our bounds provide he following conribuions and predicions:. A formal heoreical explanaion of he bias-variance rade-off in muli-sep d updaes; 2. A proof of exponenially fas raes of convergence for any fixed k or ; 3. A rigorous upper bound ha predics ha larger values of k and lead o faser convergence, buohigher asympoic errror; 4. Formal explanaion of he superioriy of inermediae values of k and (U-shaped curves) for any fixed number of ieraions; 5. Derivaion of a decreasing schedule of k and ha our bound predics should bea any fixed value of hese parameers. Furhermore, we provide some preliminary experimenal confirmaion of our heory for he random walk problem. We noe ha some of he findings above were conjecured by Singh and Dayan [2] hrough analysis of specific MDPs. 2 Technical Preliminaries Le M = (P; R) be an MDP consising of he ransiion probabiliies P (js; a) and he reward disribuions R(js). For any policy in M, and any sar sae s 0,arajecory generaed by saring from s 0 is a random variable ha is an infinie sequence of saes and rewards: =(s 0 ;r 0 )! (s ;r )! (s 2 ;r 2 )!. Here each random reward r i is disribued according o R(js i ), and each sae s i+ is disribued according o P (js i ;(s i )). For simpliciy we will assume ha he suppor of R(js i ) is [,; +]. However, all of our resuls easily generalize o he case of bounded variance. We now recall he sandard d(k) (also known as k-sep backup) and d() mehods for updaing an esimae of he value funcion. Given a rajecory generaed by from s 0,and given an esimae ^V () for he value funcion V (), for any naural number k we define d(k; ; ^V ()) = (, ) ^V (s 0 )+ r 0 + r + + k, r k, + k ^V (s k ) : The d(k) updae based on is simply ^V (s 0 ) d(k; ; ^V ()). I is implici ha he updae is always applied o he esimae a he iniial sae of he rajecory, and we regard he discoun facor and he learning rae as being fixed. For any 2 [0; ], hed() updae can now be easily expressed as an infinie linear combinaion of he d(k) updaes: d(; ; ^V ()) = X (, ) k, d(k; ; ^V ()): Given a sequence ; 2 ; 3 ;:::, we can simply apply eiher ype of d updae sequenially. In eiher case, as eiher k becomes large or approaches, he updaes approach a Mone Carlo mehod, in which we use each rajecory i enirely, and ignore our curren esimae ^V (). Ask becomes small or approaches 0, we rely heavily on he esimae ^V (), and

effecively use only a few seps of each i. The common inuiionis ha early in he sequence of udpaes, he esimae ^V () is poor, and we are beer off choosing k large or near. However, since he rajecories i do obey he saisics of, he value funcion esimaes will evenually improve, a which poin we may be beer off boosrapping by choosing small k or. In order o provide a rigorous analysis of his inuiion, we will sudy a framework which we call phased d updaes. This framework is inended o simplify he complexiies of he moving average inroduced by he learning rae. In each phase, we are given n rajecories under from every sae s, wheren is a parameer of he analysis. Thus, phase consiss of a se S() =f s ())g i s;i, wheres ranges over all saes, i ranges from o n, and s () i is an independen random rajecory generaed by saring from sae s. In phase, phased d averages all n of he rajecories in S() ha sar from sae s o obain is updae of he value funcion esimae for s. In oher words, he d(k) updaes become nx ^V + (s) (=n) i= r i 0 + ri + + k, r i k, + k ^V (si k ) where he r i j are he rewards along rajecory s (), i andsi is he kh sae reached along ha k rajecory. The d() updaes become nx X! ^V + (s) (=n) (, ) r k, i 0 + ri + + k, r i k, + k ^V (si ) k i= Phased d updaes wih a fixed value of n are analogous o sandard d updaes wih a consan learning rae []. In he ensuing secions, we provide a rigorous upper bound on he error in he value funcion esimaes of phased d updaes as a funcion of he number of phases. This upper bound clearly capures he inuiions expressed above. 3 Bounding he of d Updaes Theorem (Phased d(k) Recurrence) Le S() be he se of rajecories generaed by in phase (n rajecories from each sae), le ^V () be he value funcion esimae of phased d(k) afer phase, and le = max s fj ^V (s),v (s)jg. Then for any >>0, wih probabiliy a leas,, r, k 3 log(k=) + k,: (), n Here he error, afer phase, is fixed, and he probabiliy is aken over only he rajecories in S(). Proof:(Skech) We begin by wriing V (s) = E[r 0 + r + + k, r k, + k V (s k )] = E[r 0 ]+E[r ]++ k, E[r k,] + k E[V (s k )]: Here he expecaions are over a random rajecory under ; hus E[r`] (` k, ) denoes he expeced value of he `h reward received, while E[V (s k )] is he expeced value of he rue value P funcion a he kh sae reached. The phased d(k) updae sums he erms n `(=n) i= rì, whose expecaions are exacly he `E[r`] appearing above. By a sandard large deviaion p analysis (omied), he probabiliy ha any of hese erms deviae by more han = 3 log(k=)=n from heir expeced values is a mos. If no such deviaion occurs, he oal conribuion o he error in he value funcion esimae is bounded

by ((, k )=(, )), giving rise o he variance erm in P our overall bound above. The remainder of he phased d(k) updae is simply k n (=n) ^V i=, (si ). Bu since k j ^V, (si ), V (s i )j k k, by definiion, he conribuion o he error is a mos k,, which is he bias erm of he bound. We noe ha a similar argumen leads o bounds in expecaion raher han he PAC-syle bounds given here. 2 Le us ake a brief momen o analyze he qualiaive behavior of Equaion () as a funcion of k. For large values of k, he quaniy k becomes negligible, and he bound is approximaely (=(, )) p 3 log(k=)=n, giving almos all he weigh o he error incurred by variance in he firs k rewards, and negligible weigh o he error in our curren value funcion. A he oher exreme, when k =our reward variance conribues error only p 3 log(=)=n, bu he error in our curren value funcion has weigh. Thus, he firs erm increases wih k, while he second erm decreases wih k, in a manner ha formalizes he inuiive rade-off ha one faces when choosing beween longer or shorer backups. Equaion () describes he effec of a single phase of d(k) backups, bu we can ierae his recurrence over many phases o derive an upper bound on he full learning curve for any value of k. Assuming ha he recurrence holds for consecuive seps, and assuming 0 = wihou loss of generaliy, soluion of he recurrence (deails omied) yields, k, p 3 log(k=)=n + k : (2) This bound makes a number of predicions abou he effecs of differen values for k. Firs of all, as approaches infiniy, he bound on approaches he value (=(, ))p 3 log(k=)=n, which increases wih k. Thus, he bound predics ha he asympoic error of phased d(k) updaes is larger for larger k. On he oher hand, he rae of convergence o his asympoe is k, which is always exponenially fas, bu faser for larger k. Thus, in choosing a fixed value of k, we mus choose beween having eiher rapid convergence o a worse asympoe, or slower convergence o a beer asympoe. This predicion is illusraed graphically in Figure (a), where wih all of he parameers besides k and fixed (namely,,,andn),we have ploedheboundof Equaion (2)as a funcionof for several differen choices of k. Noe ha while he plos of Figure (a) were obained by choosing fixed values for k and ieraing he recurrence of Equaion (), a each phase we can insead use Equaion () o choose he value of k ha maximizes he prediced decrease in error, +.Inoher words, he recurrence immediaely yields a schedule for k, along wih an upper boundon he learning curve for his schedule ha ouperforms he upper bound on he learning curve for any fixed value of k. The learning curve for he schedule is also shown in Figure (a), and Figure (b) plos he schedule iself. Anoher ineresing se of plos is obained by fixing he number of phases, and compuing for each k he error afer phases using d(k) updaes ha is prediced by Equaion (2). Such plos are given in Figure (c), and hey clearly predic a unique minimum ha is, an opimal value of k for each fixed (his can also be verified analyically from equaion 2). For moderae values of, valuesofk ha are oo small suffer from heir overemphasis on a sill-inaccurae value funcion approximaion, while values of k ha are oo large suffer from heir refusal o boosrap. Of course, as increases, he opimal value of k decreases, since small values of k have ime o reach heir superior asympoes. We now go on o provide a similar analysis for he d() family of updaes, beginning wih he analogue o Theorem. Formally, we can apply Theorem by choosing = 0 =(N ), wheren is he number of saes in he MDP. Then wih probabiliy a leas, 0, he bound of Equaion () will hold a every sae for consecuive seps.

(a) (b) (c) 25 5 scheduled k k=5 0 5 0 5 20 25 30 35 40 k=25 0 k=2 Opimal k 20 5 0 5 0 0 2 3 4 5 6 7 8 9 0 5 5 5 5 = =2 =5 =00 =0 0.45 0 5 0 5 20 25 30 35 40 45 50 k Figure : (a) Upper bounds on he learning curves of phased d(k) for several values of k, as a funcion of he number of phases (parameers n = 3000, = 0:9, = 0:). Noe ha larger values of k lead o more rapid convergence, bu o higher asympoic errors. Boh he heory and he curves sugges a (decreasing) schedule for k, inuiively obained by always jumping o he learning curve ha enjoys he greaes one-sep decrease from he curren error. This schedule can be efficienly compued from he analyical upper bounds, and leads o he bes (lowes) of he learning curves ploed, which is significanly beer han for any fixed k. (b) The schedulefor k derived from he heory as a funcion of he number of phases. (c) For several values of he number of phases, he upper bound on for d(k) as a funcion of k. These curves show he prediced rade-off, wih a unique opimal value for k idenified unil is sufficienly large o permi -sep backups o converge o heir opimal asympoes. Theorem 2 (Phased d() Recurrence) Le S() be he se of rajecories generaed by in phase (n rajecories from each sae), le ^V () be he value funcion esimae of phased d() afer phase, and le = max s fj ^V (s),v (s)jg. Then for any >>0, wih probabiliy a leas,, ( r ), () k 3 log(k=) min + ()k (, ) + k, n,,,: (3) Here he error, afer phase, is fixed, and he probabiliy is aken over only he rajecories in S(). We omi he proof of his heorem, bu i roughly follows ha of Theorem. Tha proof exploied he fac ha in d(k) updaes, we only need o apply large deviaion bounds o he rewards of a finie number (k) of averaged rajecory seps. In d(), all of he rewards conribue o he updae. However, we can always choose o bound he deviaions of he firs k seps, for any value of k, and assume maximum variance for he remainder (whose weigh diminishes rapidly as we increase k). This logic is he source of he min k fg erm of he bound. One can view Equaion (3) as a variaional upper bound, in he sense ha i provides a familyof upper bounds, one for each k, and hen minimizes over he variaional parameer k. The reader can verify ha he erms appearing in Equaion (3) exhibi a rade-off as a funcion of analogous o ha exhibied by Equaion () as a funcion of k. In he ineres of breviy, we move direcly o he d() analogue of Equaion (2). I will be noaionally convenien o define k = argmin k ff ()g, wheref () is he funcion appearing inside he min k fg in Equaion (3). (Here we regard all parameers oher han as fixed.) I can be shown ha for 0 =, repeaed ieraion of Equaion (3) yields he -phase inequaliy where, b a + b, b r a =, ()k 3 log(k =) + ()k, n, b = (, ), (4)

(a) (b) = =2 Scheduled λ λ = λ = λ = λ = 0.4 λ = 0.2 λ = 0.0 λ = 0. =5 =0 =25 =00 0 5 0 5 20 25 30 35 40 45 50 0.4 0 0. 0.2 0.3 0.4 λ Figure 2: (a) Upper bounds on he learning curves of phased d() for several values of, asa funcion of he number of phases (parameers n = 3000, = 0:9, =0:). The predicions are analogous o hose for d(k) in Figure, and we have again ploed he prediced bes learning curve obained via a decreasing schedule of. (b) For several values of he number of phases, he upper bound on for d() as a funcion of. While Equaion (4) may be more difficul o parse han is d(k) counerpar, he basic predicions and inuiions remain inac. As approaches infiniy, he bound on asympoes a a =(, b ), and he rae of approach o his asympoe is simply b, which is again exponenially fas. Analysis of he derivaive of b wih respec o confirms ha for all <, b is a decreasing funcion of ha is, he larger he, he faser he convergence. Analyically verifying ha he asympoe a =(, b ) increases wih is more difficul due o he presence of k, which involves a minimizaion operaion. However, he learning curve plos of Figure 2(a) clearly show he prediced phenomena increasing yields faser convergence o a worseasympoe. As we didfor he d(k) case, we use our recurrence o derive a schedule for ; Figure 2(a) also shows he prediced improvemen in he learning curve by using such a schedule. Finally, Figure 2(b) again shows he non-monoonic prediced error as a funcion of for a fixed number of phases. 4 Some Experimenal Confirmaion In order o es he various predicions made by our heory, we have performed a number of experimens using phased d(k) on a version of he so-called random walk problem [4]. In his problem, we have a Markov process wih 5 saes arranged in a ring. A each sep, here is probabiliy 0.05 ha we remain in our curren sae, and probabiliy 5 ha we advance one sae clockwise around he ring. (Noe ha since we are only concerned wih he evaluaion of a fixed policy, we have simply defined a Markov process raher han a Markov decision process.) Two adjacen saes on he ring have reward + and, respecively, while he remaining saes have reward 0. The sandard random walk problem has a chain of saes, wih an absorbing sae a each end; here we chose a ring srucure simply o avoid asymmeries in he saes induced by he absorbing saes. To es he heory, we ran a series of simulaions compuing he d(k) esimae of he value funcion in his Markov process. For several differen values of k, we compued he error in he value funcion esimae as a funcion of he number of phases. ( is easily compued, since we can compue he rue value funcion for his simple problem.) The resuling plo in Figure 3(a) is he experimenal analogue of he heoreical predicions in Figure (a). We see ha hese predicions are qualiaively confirmed larger k leads o faser convergence o an inferior asympoe. Given hese empirical learning curves, we can hen compue he empirical schedule ha hey sugges. Namely, o deermine experimenally a schedule for k ha should ouperform (a leas) he values of k we esed in Figure 3(a), we used he empirical learning curves o deermine, for any given value of, which of he empirical curves enjoyed he greaes

(a) (b) 00 90 80 70 k=2 00 k=50 k=20 0 k=5 scheduled k 0.4 0 0 20 30 40 50 60 Opimal k 60 50 40 30 20 0 0 5 0 5 20 25 30 Figure 3: (a) Empirical learning curves for d(k) for several values of k on he random walk problem (parameers n =40and =0:98). Each plo is averaged over 5000 runs of d(k). Also shown is he learning curve (averaged over 5000 runs) for he empirical schedule compued from he d(k) learning curves, which is beer han any of hese curves. (b) The empirical schedule. one-sep decrease in error when is curren error was (approximaely). This is simply he empirical counerpar of he schedule compuaion suggesed by he heory described above, and he resuling experimenal learning curve for his schedule is also shown in Figure 3(a), and he schedule iself in Figure 3(b). We see ha here are significan improvemens in he learning curve from using he schedule, and ha he form of he schedule is qualiaively similar o he heoreical schedule of Figure (b). 5 Conclusion We have given he firs provable upper bounds on he error of d mehods for policy evaluaion. These upper bounds have exponenial raes of convergence, and clearly ariculae he bias-variance rade-off ha such mehods obey. References [] M. Kearns and S. Singh Finie-Sample Convergence Raes for Q-Learning and Indirec Algorihms NIPS, 998. [2] S. Singh and P. Dayan Analyical Mean Squared Curves for Temporal Difference Learning. Machine Learning, 998. [3] R.S.SuonLearning o Predic by he Mehods of Temporal Differences. Machine Learning, 3, 9-44, 988. [4] R. S. Suon and A. G. Baro Reinforcemen Learning: An Inroducion. MIT Press, 998. [5] C.J.C.H. Wakins Learning from Delayed Rewards. Cambridge Univ., England, 989.