Bias-Variance Bounds for Temporal Difference Updaes Michael Kearns AT&T Labs mkearns@research.a.com Sainder Singh AT&T Labs baveja@research.a.com Absrac We give he firs rigorous upper bounds on he error of emporal difference (d) algorihms for policy evaluaion as a funcion of he amoun of experience. These upper bounds prove exponenially fas convergence, wih boh he rae of convergence and he asympoe srongly dependen on he lengh of he backups k or he parameer. Our bounds give formal verificaion o he long-sanding inuiion ha d mehods are subjec o a bias-variance rade-off, and hey lead o schedules for k and ha are prediced o be beer han any fixed values for hese parameers. We give preliminary experimenal confirmaion of our heory for a version of he random walk problem. Inroducion In he policy evaluaion problem, we mus predic he expeced discouned reurn (or value) for a fixed policy, given only he abiliy o generae experience in an unknown Markov decision process (MDP) M. A family of well-sudied emporal difference (or d) [3] algorihms have been developed for his problem ha make use of repeaed rajecories under from he sae(s) of ineres, and perform ieraive updaes o he value funcion. The main difference beween he d varians lies in how far hey look ahead in he rajecories. The d(k) family of algorihms use he firs k rewards and he (curren) value predicion a he (k +)s sae reached in making is updae. The more commonly used d() family of algorihms use exponenially weighed sums of d(k) updaes (wih decay parameer ). The smaller he value for k or, he less he algorihm depends on he acual rewards received in he rajecory, and he more i depends on he curren predicions for he value funcion. Conversely, he larger he value for k or, he more he algorihm depends on he acual rewards obained, wih he curren value funcion playing a lessened role. The exreme cases of d(k = ) and d( = )become he Mone Carlo algorihm, which updaes each predicion o be he average of he discouned reurns in he rajecories. A long-sanding quesion is wheher i is beer o use large or small values of he parameers k and. Wakins [5] informally discusses he rade-off ha his decision gives rise o: larger values for he d parameers suffer larger variance in he updaes (since more sochasic reward erms appear), bu also enjoy lower bias (since he error in he curren value funcion predicions have less influence). This argumen has largely remained an inuiion. However, some conclusions arising from his inuiion for insance, ha inermediae values of k and ofen yield he bes performance in he shor erm have been borne ou experimenally [4, 2]. In his paper, we provide he firs rigorous upper bounds on he error in he value funcions of
he d algorihms as a funcion of he number of rajecories used. In oher words, we give bounds on he learning curves of d mehods ha hold for any MDP. These upper bounds decay exponenially fas, and are obained by firs deriving a one-sep recurrence relaing he errors before and afer a d updae, and hen ieraing his recurrence for he desired number of seps. Of paricular ineres is he form of our bounds, since i formalizes he rade-off discussed above he bounds consis of erms ha are monoonically growing wih k and (corresponding o he increased variance), and erms ha are monoonically shrinking wih k and (corresponding o he decreased influence of he curren error). Overall, our bounds provide he following conribuions and predicions:. A formal heoreical explanaion of he bias-variance rade-off in muli-sep d updaes; 2. A proof of exponenially fas raes of convergence for any fixed k or ; 3. A rigorous upper bound ha predics ha larger values of k and lead o faser convergence, buohigher asympoic errror; 4. Formal explanaion of he superioriy of inermediae values of k and (U-shaped curves) for any fixed number of ieraions; 5. Derivaion of a decreasing schedule of k and ha our bound predics should bea any fixed value of hese parameers. Furhermore, we provide some preliminary experimenal confirmaion of our heory for he random walk problem. We noe ha some of he findings above were conjecured by Singh and Dayan [2] hrough analysis of specific MDPs. 2 Technical Preliminaries Le M = (P; R) be an MDP consising of he ransiion probabiliies P (js; a) and he reward disribuions R(js). For any policy in M, and any sar sae s 0,arajecory generaed by saring from s 0 is a random variable ha is an infinie sequence of saes and rewards: =(s 0 ;r 0 )! (s ;r )! (s 2 ;r 2 )!. Here each random reward r i is disribued according o R(js i ), and each sae s i+ is disribued according o P (js i ;(s i )). For simpliciy we will assume ha he suppor of R(js i ) is [,; +]. However, all of our resuls easily generalize o he case of bounded variance. We now recall he sandard d(k) (also known as k-sep backup) and d() mehods for updaing an esimae of he value funcion. Given a rajecory generaed by from s 0,and given an esimae ^V () for he value funcion V (), for any naural number k we define d(k; ; ^V ()) = (, ) ^V (s 0 )+ r 0 + r + + k, r k, + k ^V (s k ) : The d(k) updae based on is simply ^V (s 0 ) d(k; ; ^V ()). I is implici ha he updae is always applied o he esimae a he iniial sae of he rajecory, and we regard he discoun facor and he learning rae as being fixed. For any 2 [0; ], hed() updae can now be easily expressed as an infinie linear combinaion of he d(k) updaes: d(; ; ^V ()) = X (, ) k, d(k; ; ^V ()): Given a sequence ; 2 ; 3 ;:::, we can simply apply eiher ype of d updae sequenially. In eiher case, as eiher k becomes large or approaches, he updaes approach a Mone Carlo mehod, in which we use each rajecory i enirely, and ignore our curren esimae ^V (). Ask becomes small or approaches 0, we rely heavily on he esimae ^V (), and
effecively use only a few seps of each i. The common inuiionis ha early in he sequence of udpaes, he esimae ^V () is poor, and we are beer off choosing k large or near. However, since he rajecories i do obey he saisics of, he value funcion esimaes will evenually improve, a which poin we may be beer off boosrapping by choosing small k or. In order o provide a rigorous analysis of his inuiion, we will sudy a framework which we call phased d updaes. This framework is inended o simplify he complexiies of he moving average inroduced by he learning rae. In each phase, we are given n rajecories under from every sae s, wheren is a parameer of he analysis. Thus, phase consiss of a se S() =f s ())g i s;i, wheres ranges over all saes, i ranges from o n, and s () i is an independen random rajecory generaed by saring from sae s. In phase, phased d averages all n of he rajecories in S() ha sar from sae s o obain is updae of he value funcion esimae for s. In oher words, he d(k) updaes become nx ^V + (s) (=n) i= r i 0 + ri + + k, r i k, + k ^V (si k ) where he r i j are he rewards along rajecory s (), i andsi is he kh sae reached along ha k rajecory. The d() updaes become nx X! ^V + (s) (=n) (, ) r k, i 0 + ri + + k, r i k, + k ^V (si ) k i= Phased d updaes wih a fixed value of n are analogous o sandard d updaes wih a consan learning rae []. In he ensuing secions, we provide a rigorous upper bound on he error in he value funcion esimaes of phased d updaes as a funcion of he number of phases. This upper bound clearly capures he inuiions expressed above. 3 Bounding he of d Updaes Theorem (Phased d(k) Recurrence) Le S() be he se of rajecories generaed by in phase (n rajecories from each sae), le ^V () be he value funcion esimae of phased d(k) afer phase, and le = max s fj ^V (s),v (s)jg. Then for any >>0, wih probabiliy a leas,, r, k 3 log(k=) + k,: (), n Here he error, afer phase, is fixed, and he probabiliy is aken over only he rajecories in S(). Proof:(Skech) We begin by wriing V (s) = E[r 0 + r + + k, r k, + k V (s k )] = E[r 0 ]+E[r ]++ k, E[r k,] + k E[V (s k )]: Here he expecaions are over a random rajecory under ; hus E[r`] (` k, ) denoes he expeced value of he `h reward received, while E[V (s k )] is he expeced value of he rue value P funcion a he kh sae reached. The phased d(k) updae sums he erms n `(=n) i= rì, whose expecaions are exacly he `E[r`] appearing above. By a sandard large deviaion p analysis (omied), he probabiliy ha any of hese erms deviae by more han = 3 log(k=)=n from heir expeced values is a mos. If no such deviaion occurs, he oal conribuion o he error in he value funcion esimae is bounded
by ((, k )=(, )), giving rise o he variance erm in P our overall bound above. The remainder of he phased d(k) updae is simply k n (=n) ^V i=, (si ). Bu since k j ^V, (si ), V (s i )j k k, by definiion, he conribuion o he error is a mos k,, which is he bias erm of he bound. We noe ha a similar argumen leads o bounds in expecaion raher han he PAC-syle bounds given here. 2 Le us ake a brief momen o analyze he qualiaive behavior of Equaion () as a funcion of k. For large values of k, he quaniy k becomes negligible, and he bound is approximaely (=(, )) p 3 log(k=)=n, giving almos all he weigh o he error incurred by variance in he firs k rewards, and negligible weigh o he error in our curren value funcion. A he oher exreme, when k =our reward variance conribues error only p 3 log(=)=n, bu he error in our curren value funcion has weigh. Thus, he firs erm increases wih k, while he second erm decreases wih k, in a manner ha formalizes he inuiive rade-off ha one faces when choosing beween longer or shorer backups. Equaion () describes he effec of a single phase of d(k) backups, bu we can ierae his recurrence over many phases o derive an upper bound on he full learning curve for any value of k. Assuming ha he recurrence holds for consecuive seps, and assuming 0 = wihou loss of generaliy, soluion of he recurrence (deails omied) yields, k, p 3 log(k=)=n + k : (2) This bound makes a number of predicions abou he effecs of differen values for k. Firs of all, as approaches infiniy, he bound on approaches he value (=(, ))p 3 log(k=)=n, which increases wih k. Thus, he bound predics ha he asympoic error of phased d(k) updaes is larger for larger k. On he oher hand, he rae of convergence o his asympoe is k, which is always exponenially fas, bu faser for larger k. Thus, in choosing a fixed value of k, we mus choose beween having eiher rapid convergence o a worse asympoe, or slower convergence o a beer asympoe. This predicion is illusraed graphically in Figure (a), where wih all of he parameers besides k and fixed (namely,,,andn),we have ploedheboundof Equaion (2)as a funcionof for several differen choices of k. Noe ha while he plos of Figure (a) were obained by choosing fixed values for k and ieraing he recurrence of Equaion (), a each phase we can insead use Equaion () o choose he value of k ha maximizes he prediced decrease in error, +.Inoher words, he recurrence immediaely yields a schedule for k, along wih an upper boundon he learning curve for his schedule ha ouperforms he upper bound on he learning curve for any fixed value of k. The learning curve for he schedule is also shown in Figure (a), and Figure (b) plos he schedule iself. Anoher ineresing se of plos is obained by fixing he number of phases, and compuing for each k he error afer phases using d(k) updaes ha is prediced by Equaion (2). Such plos are given in Figure (c), and hey clearly predic a unique minimum ha is, an opimal value of k for each fixed (his can also be verified analyically from equaion 2). For moderae values of, valuesofk ha are oo small suffer from heir overemphasis on a sill-inaccurae value funcion approximaion, while values of k ha are oo large suffer from heir refusal o boosrap. Of course, as increases, he opimal value of k decreases, since small values of k have ime o reach heir superior asympoes. We now go on o provide a similar analysis for he d() family of updaes, beginning wih he analogue o Theorem. Formally, we can apply Theorem by choosing = 0 =(N ), wheren is he number of saes in he MDP. Then wih probabiliy a leas, 0, he bound of Equaion () will hold a every sae for consecuive seps.
(a) (b) (c) 25 5 scheduled k k=5 0 5 0 5 20 25 30 35 40 k=25 0 k=2 Opimal k 20 5 0 5 0 0 2 3 4 5 6 7 8 9 0 5 5 5 5 = =2 =5 =00 =0 0.45 0 5 0 5 20 25 30 35 40 45 50 k Figure : (a) Upper bounds on he learning curves of phased d(k) for several values of k, as a funcion of he number of phases (parameers n = 3000, = 0:9, = 0:). Noe ha larger values of k lead o more rapid convergence, bu o higher asympoic errors. Boh he heory and he curves sugges a (decreasing) schedule for k, inuiively obained by always jumping o he learning curve ha enjoys he greaes one-sep decrease from he curren error. This schedule can be efficienly compued from he analyical upper bounds, and leads o he bes (lowes) of he learning curves ploed, which is significanly beer han for any fixed k. (b) The schedulefor k derived from he heory as a funcion of he number of phases. (c) For several values of he number of phases, he upper bound on for d(k) as a funcion of k. These curves show he prediced rade-off, wih a unique opimal value for k idenified unil is sufficienly large o permi -sep backups o converge o heir opimal asympoes. Theorem 2 (Phased d() Recurrence) Le S() be he se of rajecories generaed by in phase (n rajecories from each sae), le ^V () be he value funcion esimae of phased d() afer phase, and le = max s fj ^V (s),v (s)jg. Then for any >>0, wih probabiliy a leas,, ( r ), () k 3 log(k=) min + ()k (, ) + k, n,,,: (3) Here he error, afer phase, is fixed, and he probabiliy is aken over only he rajecories in S(). We omi he proof of his heorem, bu i roughly follows ha of Theorem. Tha proof exploied he fac ha in d(k) updaes, we only need o apply large deviaion bounds o he rewards of a finie number (k) of averaged rajecory seps. In d(), all of he rewards conribue o he updae. However, we can always choose o bound he deviaions of he firs k seps, for any value of k, and assume maximum variance for he remainder (whose weigh diminishes rapidly as we increase k). This logic is he source of he min k fg erm of he bound. One can view Equaion (3) as a variaional upper bound, in he sense ha i provides a familyof upper bounds, one for each k, and hen minimizes over he variaional parameer k. The reader can verify ha he erms appearing in Equaion (3) exhibi a rade-off as a funcion of analogous o ha exhibied by Equaion () as a funcion of k. In he ineres of breviy, we move direcly o he d() analogue of Equaion (2). I will be noaionally convenien o define k = argmin k ff ()g, wheref () is he funcion appearing inside he min k fg in Equaion (3). (Here we regard all parameers oher han as fixed.) I can be shown ha for 0 =, repeaed ieraion of Equaion (3) yields he -phase inequaliy where, b a + b, b r a =, ()k 3 log(k =) + ()k, n, b = (, ), (4)
(a) (b) = =2 Scheduled λ λ = λ = λ = λ = 0.4 λ = 0.2 λ = 0.0 λ = 0. =5 =0 =25 =00 0 5 0 5 20 25 30 35 40 45 50 0.4 0 0. 0.2 0.3 0.4 λ Figure 2: (a) Upper bounds on he learning curves of phased d() for several values of, asa funcion of he number of phases (parameers n = 3000, = 0:9, =0:). The predicions are analogous o hose for d(k) in Figure, and we have again ploed he prediced bes learning curve obained via a decreasing schedule of. (b) For several values of he number of phases, he upper bound on for d() as a funcion of. While Equaion (4) may be more difficul o parse han is d(k) counerpar, he basic predicions and inuiions remain inac. As approaches infiniy, he bound on asympoes a a =(, b ), and he rae of approach o his asympoe is simply b, which is again exponenially fas. Analysis of he derivaive of b wih respec o confirms ha for all <, b is a decreasing funcion of ha is, he larger he, he faser he convergence. Analyically verifying ha he asympoe a =(, b ) increases wih is more difficul due o he presence of k, which involves a minimizaion operaion. However, he learning curve plos of Figure 2(a) clearly show he prediced phenomena increasing yields faser convergence o a worseasympoe. As we didfor he d(k) case, we use our recurrence o derive a schedule for ; Figure 2(a) also shows he prediced improvemen in he learning curve by using such a schedule. Finally, Figure 2(b) again shows he non-monoonic prediced error as a funcion of for a fixed number of phases. 4 Some Experimenal Confirmaion In order o es he various predicions made by our heory, we have performed a number of experimens using phased d(k) on a version of he so-called random walk problem [4]. In his problem, we have a Markov process wih 5 saes arranged in a ring. A each sep, here is probabiliy 0.05 ha we remain in our curren sae, and probabiliy 5 ha we advance one sae clockwise around he ring. (Noe ha since we are only concerned wih he evaluaion of a fixed policy, we have simply defined a Markov process raher han a Markov decision process.) Two adjacen saes on he ring have reward + and, respecively, while he remaining saes have reward 0. The sandard random walk problem has a chain of saes, wih an absorbing sae a each end; here we chose a ring srucure simply o avoid asymmeries in he saes induced by he absorbing saes. To es he heory, we ran a series of simulaions compuing he d(k) esimae of he value funcion in his Markov process. For several differen values of k, we compued he error in he value funcion esimae as a funcion of he number of phases. ( is easily compued, since we can compue he rue value funcion for his simple problem.) The resuling plo in Figure 3(a) is he experimenal analogue of he heoreical predicions in Figure (a). We see ha hese predicions are qualiaively confirmed larger k leads o faser convergence o an inferior asympoe. Given hese empirical learning curves, we can hen compue he empirical schedule ha hey sugges. Namely, o deermine experimenally a schedule for k ha should ouperform (a leas) he values of k we esed in Figure 3(a), we used he empirical learning curves o deermine, for any given value of, which of he empirical curves enjoyed he greaes
(a) (b) 00 90 80 70 k=2 00 k=50 k=20 0 k=5 scheduled k 0.4 0 0 20 30 40 50 60 Opimal k 60 50 40 30 20 0 0 5 0 5 20 25 30 Figure 3: (a) Empirical learning curves for d(k) for several values of k on he random walk problem (parameers n =40and =0:98). Each plo is averaged over 5000 runs of d(k). Also shown is he learning curve (averaged over 5000 runs) for he empirical schedule compued from he d(k) learning curves, which is beer han any of hese curves. (b) The empirical schedule. one-sep decrease in error when is curren error was (approximaely). This is simply he empirical counerpar of he schedule compuaion suggesed by he heory described above, and he resuling experimenal learning curve for his schedule is also shown in Figure 3(a), and he schedule iself in Figure 3(b). We see ha here are significan improvemens in he learning curve from using he schedule, and ha he form of he schedule is qualiaively similar o he heoreical schedule of Figure (b). 5 Conclusion We have given he firs provable upper bounds on he error of d mehods for policy evaluaion. These upper bounds have exponenial raes of convergence, and clearly ariculae he bias-variance rade-off ha such mehods obey. References [] M. Kearns and S. Singh Finie-Sample Convergence Raes for Q-Learning and Indirec Algorihms NIPS, 998. [2] S. Singh and P. Dayan Analyical Mean Squared Curves for Temporal Difference Learning. Machine Learning, 998. [3] R.S.SuonLearning o Predic by he Mehods of Temporal Differences. Machine Learning, 3, 9-44, 988. [4] R. S. Suon and A. G. Baro Reinforcemen Learning: An Inroducion. MIT Press, 998. [5] C.J.C.H. Wakins Learning from Delayed Rewards. Cambridge Univ., England, 989.