Convergence of reinforcement learning with general function approximators

Convergene of reinforement learning with general funtion approximators assilis A. Papavassiliou and Stuart Russell Computer Siene Division, U. of California, Berkeley, CA 94720-1776 fvassilis,russellg@s.berkeley.edu Abstrat A key open problem in reinforement learning is to assure onvergene when using a ompat hypothesis lass to approximate the value funtion. Although the standard temporal-differene learning algorithm has been shown to onverge when the hypothesis lass is a linear ombination of fixed basis funtions, it may diverge with a general (nonlinear) hypothesis lass. This paper desribes the Bridge algorithm, a new method for reinforement learning, and shows that it onverges to an approximate global optimum for any agnostially learnable hypothesis lass. Convergene is demonstrated on a simple example for whih temporal-differene learning fails. eak onditions are identified under whih the Bridge algorithm onverges for any hypothesis lass. Finally, onnetions are made between the omplexity of reinforement learning and the PAC-learnability of the hypothesis lass. 1 Introdution Reinforement learning (RL) is a widely used method for learning to make deisions in omplex, unertain environments. Typially, an RL agent pereives and ats in an environment, reeiving rewards that provide some indiation of the quality of its ations. The agent s goal is to maximize the sum of rewards reeived. RL algorithms work by learning a value funtion that desribes the long-term expeted sum of rewards from eah state; alternatively, they an learn a Q-funtion desribing the value of eah ation in eah state. These funtions an then be used to make deisions. Temporal-differene (TD) learning [Sutton, 1988] is a ommonly used family of reinforement learning methods. TD algorithms operate by adjusting the value funtion to be loally onsistent. hen used with funtion approximators, suh as neural networks, that provide a ompat parameterized representation of the value funtion, TD methods an solve realworld problems with very large state spaes. Beause of this, one would like to know if suh algorithms an be guaranteed to work i.e., to onverge and to return optimal solutions. The theoretial study of RL algorithms usually divides the problem into two aspets: exploration poliies that an guarantee omplete overage of the environment, and value determination to find the value funtion that orresponds to a given poliy. This paper onentrates on the seond aspet. Prior work [Jaakkola et al., 1995] has established onvergene of TD-learning with probability 1 when the value funtion is represented as a table where eah state has its own entry. For large state spaes, however, ompat parametri representations are required; for suh representations, we are interested in whether an algorithm will onverge to the funtion that is losest, by some metri, to the true value funtion (a form of agnosti learning). Gordon [1995] proved that TD onverges in this sense for representations alled averagers on whih the TD update is a max-norm ontration (see Setion 2). Tsitsiklis and van Roy [1996] proved onvergene and established error bounds for TD() with linear ombinations of fixed basis funtions. ith nonlinear representations, suh as neural networks, TD has been observed to give suboptimal solutions [Bertsekas and Tsitsiklis, 1996] or even to diverge. This is a serious problem sine most real problems require nonlinearity. Baird [1995] introdued residual algorithms, for whih onvergene an be proved when ombined with a gradient desent learning method (suh as used with neural networks). Unfortunately the error in the resulting approximation an be arbitrarily large and furthermore the method requires two i n- dependent visits to eah sampled state. This paper desribes the Bridge algorithm, a new RL method for whih we establish onvergene and error bounds with any agnostially learnable representation. Setion 2 provides the neessary definitions and notation. Setion 3 explains the problem of nononvergene and provides examples of this with TD. Setion 4 outlines the Bridge algorithm, skethes the proof of onvergene, and shows how it solves the examples for whih TD fails. Setion 5 briefly overs additional results on onvergene to loal optima for any representation and on the use of PAC-learning theory. Setion 6 mentions some alternative tehniques one might onsider. The paper is neessarily tehnially dense given the spae restritions. The results, however, should be of broad interest to the AI and mahine learning ommunities. 2 Definitions 2.1 MDP A Markov deision proess M = (S; A; p; r; ) is a set of states S, a set of ations A, transition probability distributions p(jx; a) that define the next state distribution given a urrent state x and ation a, reward distributions r(jx; a) that define the distribution of real-valued reward reeived upon exeuting a in x, and a disount fator 2 (0; 1). Sine we are interested in the problem of value determination, we assume we

are given a fixed poliy (hoie of ation at eah state). hen exeuting only this fixed poliy, the MDP atually beomes a Markov hain, and we may therefore also write the transition probabilities as p(jx) and the reward distributions as r(jx). e assume that we are able to define the stationary distribution of the resulting Markov hain and also that the rewards lie in the range [?R max ; R max ]. Let (X 1 ; X 2 ; X 3 ; :::) be a random trae in the Markov hain starting from x, i.e. X 1 = x, X 2 has distribution p(jx 1 ) and X k has distribution p(jx k?1 ). Let (R 1 ; R 2 ; R 3 ; :::) be the observed random rewards, i.e. R k has distribution r(jx k ). Define the true value funtion at a state x to be the expeted, disounted reward to go from state x: (x) = E[R 1 + R 2 + 2 R 3 + :::] The problem of value determination is to determine the true value funtion or a good approximation to it. Classial TD solutions have made use of the bakup operator T, whih takes an approximation and produes a better approximation T(x) = E[R 1 + (X 2 )] = E[R 1 ] + X x 22S p(x 2 jx)(x 2 ) An operator A is said to be a ontration with fator < 1 under some norm k k if 8; ka? Ak k? k If = 1, A is said to be a nonexpansion. If we define the max-norm to be kk max = max x2s (x) and the -norm to be kk = [ P x2s (x) 2 (x)] 1=2, then T is a ontration with fator and fixed point under both max-norm and -norm [Tsitsiklis and an Roy, 1996]. Therefore repeated appliation of T (i.e. the iterative proess n+1 = T n ) onverges to. e will use T j to represent the operator T applied j times. If the transition probabilities p and reward distributions r are known, then it is possible to ompute T diretly from its definition. owever, if p and r are not known, then it is not possible to ompute the expetation in the definition of T. In this ase, by observing a sequene of states and rewards in the Markov hain, we an form an unbiased estimate of T. Speifially, if we observe state x and reward r followed by state x 2, then the observed baked-up value, r + (x 2 ), is an unbiased estimate of T(x). Formally we define P T (jx) to be the onditional probability density of observed bakedup values from state x: PT (yjx) = Pr[R 1 + (X 2 ) = yjx] where R 1 and X 2 are, as defined above, random variables with distributions r(jx) and p(jx) respetively. Thus if a random variable Y has assoiated density PT (jx) then E[Y ] = T(x). Similarly, we define PT j (jx) to be the onditional probability density of j-step baked-up values observed from state x. 2.2 Funtion Approximation As state spaes beomes large or even infinite, it beomes infeasible to tabulate the value funtion for eah state x, and we must resort to funtion approximation. Our approximation sheme onsists of a hypothesis lass of representable funtions and a learning operator whih maps arbitrary value funtions to funtions in. The standard T based approahes that use funtion approximation essentially ompute or approximately ompute the iterative proess n+1 = T n. In pratie, the T mapping usually annot be performed exatly beause, even if we have aess to the neessary expetation to ompute T(x) exatly, it is infeasible to do so for all states x. Thus we perform an approximate mapping using samples. e will take the state sample distribution to be the stationary distribution. In general when we annot ompute T(x) exatly, we approximate T by generating samples (x; y) with sample distribution PT (x; y) = (x)p T (yjx) and passing them to a learning algorithm for. The joint probabilitydensity PT (from whih we generate the samples) simply ombines (from whih we sample the state x) with the onditional probability density PT (jx) (from whih we generate an estimate y for T(x)). In this paper we fous on agnosti learning. In this ase, the learning operator seeks the hypothesis h that best mathes the target funtion, even though typially the target funtion is not in the hypothesis lass. If we measure distane using the -norm, then we an define the learning operator for agnosti learning to be: = argmin h2 kh? k As already mentioned, in the typial ase we do not have aess to the exat funtion to be learned, but rather we an draw samples (x; y) from a sample distribution P suh that the expeted value of the onditional distribution P(jx) is (x). If, in addition, P samples the states aording to (or whatever distribution was used to measure distane in the previous definition) then an equivalent definition for agnosti learning is based on minimizing risk: = argmin h2 r(h; P) where we define the risk of a hypothesis h with respet to a distribution P to be: Z r(h; P) = (h(x)? y) 2 dp(x; y) (S;<) In pratie, is approximately performed by generating enough samples (x; y) from the sample distribution P so as to be able to estimate risk well, and thus to be able to output the hypothesis in that has minimal risk with respet to this distribution. In the algorithm we present, we assume the ability to ompute exatly for the given hypothesis lass. This is ertainly not a trivial assumption. In a later setion, we briefly disuss an extension to our algorithm for the ase where ; is a PAC-agnosti learning step rather than an exat agnosti learning step. Finally, let us define our goal. is defined to be the best approximation to possible using : = = argmin h2 kh? k e seek tehniques that return a value funtion that minimizes a relative or absolute error bound: k? k k? k or k? k 3 Nononvergene of TD In this setion, we examine the non-onvergene problem of TD when used with non-linear funtion approximators. e present simple examples whih we will reonsider with the Bridge algorithm in the next setion. As mentioned above, standard TD with funtion approximation is based on the iterative proess n+1 = T n. If

is a non-expansion under the same norm that makes T a ontration, then the omposite operator T is a ontration and this proess will onverge to some error bound relative to. For example, Tsitsiklis and an Roy [1996] onsider a linear hypothesis lass, for whih is simply a projetion. If one uses a nonlinear hypothesis lass for whih is not a nonexpansion then this iterative proess an either diverge or get stuk in a loal minimum arbitrarily far from. e now give simple examples demonstrating ways in whih TD an fail when it is used with a nonlinear hypothesis lass. Consider an MDP with two states where the probability of going from one state to the other is 1, and the rewards are also deterministi. The stationary distribution is 0:5 for eah state, the disount fator is.8 and the hypothesis lass is the subset of the value funtion spae R 2 given by the v 1 - axis, the v 2 -axis and the two diagonals v 1 = v 2 and v 1 =?v 2. Formally, = f(v 1 ; v 2 ) 2 R 2 : v 1 = 0 or v 2 = 0 or jv 1 j = jv 2 jg. The learning operator projets a funtion onto the nearest of the 4 axes of. For example, (6; 4) = (5; 5), (6;?4) = (5;?5) and (?7;?1) = (?7; 0). e first onsider the ase where the rewards of the two states are (r 1 ; r 2 ) = (10;?8). The true value funtion turns out to be (10; 0). Note that in this ase, the true value funtion is atually in the hypothesis lass. Starting from 0 = (0; 0) the result of repeated appliations of T is shown in Figure 1(a). For example, the first step is T 0 = (r 1 + (2); r 2 + (1)) = (10;?8) = (9;?9). This proess onverges to = (5;?5) whih is a fixed point beause T(5;?5) = (6;?4) = (5;?5). Remember that 2, so the relative error bound k? k k? k is infinite. Thus T an onverge arbitrarily far (in terms of relative error bound) from the best representation in of the true value funtion. =(10,0) (9, 9) (7.5,7.5) =(10.44,.56) (16,0) (30,0) e begin with a high level desription of the algorithm (details are in the Appendix). This is followed by the onvergene results and another look at the examples from the previous setion. The main algorithm BridgealueDet, determines the value funtion within some error bound by making repeated alls to BridgeStep. e will now desribe the first invoation of BridgeStep. Metaphorially it onsists of throwing a bridge aross the treaherous terrain that is the hypothesis lass, towards a point on the far side of the optimal solution. If the bridge lands somewhere lose to where we aimed it, we will be able to walk along it in a produtive diretion (ahieve a ontration). If the bridge lands far from our target, then we know that there isn t any -expressible value funtion near our target on whih the bridge ould have landed (hene an error bound). This is made preise by Lemma 2 in the next setion. e are given an old approximation from whih we try to reate a better approximation new. e basially have two toolsto work with: T and. As an be seen in Figure 2 (and in the example in the previous setion), if we ombine these two operators in the standard way, new = T, we an get stuk in a loal minimum. e will instead use them more reatively to guarantee progress or establish an error bound. T = T Figure 2: Stuk in a loal minimum e begin by using not T but rather T j where j is determined by the main algorithm BridgealueDet before it alls BridgeStep. e an then ask the question, given we know where and T j are, what does that tell us about the loation of? It turns out that is restrited to lie in some hypersphere whose position an be defined in terms of the positions of and T j. This is made preise by Lemma 1 in the next setion. The hypersphere is depited in Figure 3 and as required, lies inside it. B Figure 1: (a) Suboptimal Fixed Point and (b) Osillation If we modify the rewards slightly to be (r 1 ; r 2 ) = (10;?7:8) then the true value funtion = (10 4 ; 5 ) is 9 9 no longer in. The best representation of is = = (10 4 ; 0). If we start from (0; 0) as above, we will 9 again reah a suboptimal fixed point around (5;?5). owever, starting from 0 = (30; 0) (or even 0 = (15; 0)) the result of repeated appliations of T as shown in Figure 1(b) displays a different type of failure osillation between points approahing (7:5; 7:5) and (16; 0). As in the previous example, k? k is small, so the relative error bound is large. 4 The Bridge Algorithm T j T Figure 3: The bridge is aimed e now define a new operator B based on T j and the identity operator I. B = I + 1 1? j (Tj? I) 1 B simply amplifies the Bellman residual by a fator of 1? j. As an be seen in Figure 3, B is the point on the far side

of the hypersphere from. This operator is what we use to throw a bridge. e aim the bridge for B, whih is beyond anywhere where our goal might be, i.e.the true value funtion lies somewhere between and B. The motivation for using B is in a sense to jump over all loal minima between and. Ideally we would be able to represent B (just as in the standard approah we would want to represent T) but this funtion is most likely not in our lass of representable funtions. Therefore we must apply the operator to map it into. The result, = B 2, is shown in Figure 4. The bridge is supported by and and is shown as a line between them. In summary we throw the bridge aiming for B, but determines the point on whih it atually lands. B = B Figure 4: The bridge is established In pratie we perform the mapping by generating samples from an appropriate distribution and passing them to a learning algorithm for. In partiular to ompute B, we generate samples (x; y) aording to the distribution: P B (x; y) = (x)p B (yjx) = (x)pt j ((y? (x))(1? j ) + (x)jx) The key feature of this distributionis that if a random variable Y has assoiated density PB (jx) then E[Y ] = B(x). The final step is to walk along the bridge. The bridge is a line between and and our new approximation new will be some point on this line (see Figure 5). This point is determined by projeting a point 1 of the way from to T j onto the line, where is a funtion of the input parameters. (e ould just projet T j, but using is a refinement that yields a better guaranteed effetive ontration fator.) T j new Figure 5: The new approximation Thus the new approximation new, whih is not neessarily in, is a weighted average of the old approximation and 2. Calulating the weights ( and 1?) in this average requires the ability to measure distane and risk. In partiular we need to measure the distane between and and the risk of and with respet to the distribution P T j. These three lengths (and ) determine the relative position of new with respet to and (See Figure 5). In pratie we estimate the true risk with the empirial risk [aussler, 1992], whih we alulate using samples drawn from the distribution P T j. e have just desribed a single invoation of BridgeStep that represents the first iteration of the main algorithm. Eah iteration builds a new bridge based on the previous one, so a generi iteration would begin with a that was the new of the previous iteration (see Figure 6). In partiular, the input of a generi iteration is not in, but is rather a linear ombination of the initial approximation 0 and all previous funtions. Thus the final result is a tall weighted tree whose leaves are in. If we insist on a final result that is in, then we an apply a final mapping at the very end. Just as the standard TD algorithm was summarized as n+1 = T n, the Bridge algorithm an be essentially summarized as n+1 = (1? n ) n + n B n = ((1? n )I + n (I + 1 n-1 B n n j T n n+1 n-1 1? j (Tj? I))) n Figure 6: Generi iteration of BridgeStep 4.1 Convergene of the Bridge algorithm e will state the main onvergene theorem for the Bridge algorithm, but spae limitations allow us to state only the two most important Lemmas used in the proof. e begin with a very useful observation about the geometri relationship between ; T and. Lemma 1 Let A be a ontration with ontration fator under some norm. Let be the fixed point of A. For any point let O = + 1 1? 2 (A? ). Then, k? Ok 1? 2 ka? k In words, given the positions of and A, let = ka? k. Then we know that the position of has to be on or inside the hypersphere of radius 1? entered at O (see 2 Figure 7). This hypersphere is simply the set of points that are at least a fator of loser to A than to. Note that the distane from to the furthest point on the hypersphere is 1?. e apply Lemma 1 using T j for A and j for. This defines a hypersphere inside of whih the true value funtion must lie. Lemma 1 is used mainly to prove Lemma 2, whih haraterizes the behavior of BridgeStepand provides most of the meat of the onvergene proof. Lemma 2 Given an approximation and parameters > 0 and j 1, BridgeStep(; ; j) returns a new approxima- n

arsinζ ζ 1 ζ 2 O T 1+ζ 1 ζ 2 1 ζ 2 Figure 7: ypersphere ontaining 1 ζ tion new that satisfies at least one of the following two onditions, where the error bound = errbound(; ; j) is defined in the Appendix k new? k k? k (Contration) k new? k k? k (Error Bound) Intuitively, if the bridge lands lose to where we aimed it, we will ahieve a ontration towards the goal. If the bridge lands far away, we will prove a relative error bound for the result. The key quantity that determines whih of these two events happens, is the angle formed between the bridge and the line from to B. If = B is lose to B, then will be small, the bridge will lie lose to the hypersphere, and we will be able to walk along the bridge and make progress. If instead is far from B, then will be large and walking along the bridge will not take us loser to the goal, but we will be able to prove that we are already lose enough. Figure 8 shows the ase where the angle is small. As desribed previously, the small hypersphere represents the set of points that are at least a fator of j loser to T j than they are to. This follows from applying Lemma 1 to the operator T j. Now think of BridgeStep as an operator that takes and returns new, and ask the question, what set of points are at least a fator of (whih is an input parameter to BridgeStep) loser to new than to? Applying Lemma 1 to this question defines another, muh larger hypersphere whih is depited in Figure 8 with enter at O for the ase = arsin? arsin j. Note that this larger hypersphere ompletely ontains the smaller hypersphere whih ontains. Thus also lies inside the larger hypersphere and so new is at least a fator of loser to than is. This holds for = arsin? arsin j. If is smaller than this, the ahieved ontration is even better. arsinγ j B j T new θ Figure 8: Contration is ahieved when is small Figure 9 shows the ase where the angle is large. is large when it is not possible to find a hypothesis in lose to B. In fat we hoose = B to be the losest suh hypothesis, so the rest of must lie further away. In partiular must lie ompletely outside the big hypersphere depited O in Figure 9 with enter at B, for otherwise would not be the losest hypothesis to B. Furthermore we know that must lie on or inside the small hypersphere in Figure 9. Thus there is a separation between and and this separation allows us to prove, for any possible position of, an upper bound on the relative error knew? k k? k. B new Figure 9: Relative error bound is established when is large It should be noted that in general we do not know and we annot measure to determine whih of the two onditions of Lemma 2 new satisfies. e only know that it satisfies at least one of them. By Lemma 2, if already satisfies the relative error bound then so will new, beause if new ahieves a ontration over, its error dereases. Thus eah suessive approximation is better than the one before, until we ahieve the relative error bound from whih point every subsequent approximation will also ahieve that bound. e now give the main result, whih is guaranteed onvergene to a relative or absolute error bound. Moreover, the maximum number of invoations of BridgeStep, and thus the maximum number of hypotheses in the linear ombination, an be speified. Theorem 1 Let > 1 and 0 > 0 be the desired relative and absolute error bounds respetively. Let N be an upper bound on the desired number of iterations. Then the algorithm BridgealueDet(; 0 ; N ) produes an approximation ~, onsisting of a linear ombination of at most N +1 hypotheses from, that satisfies at least one of either the relative error bound or the absolute error bound 0 : k~? k k? or k~? k 0 k The proof of the theorem follows diretly from Lemma 2; rewards are bounded, so the true value funtion is bounded, so the absolute error of the initial approximation an be bounded. If all N iterations ahieve a ontration, then the absolute error will be smaller than requested. If at least one of the iterations failed to ahieve a ontration, then it ahieved a relative error bound and all subsequent iterations, inluding the last one, will ahieve the requested relative error bound. Again, sine we do not know whih of the two onditionsof Lemma 2 θ

eah iteration satisfies, we do not know whether the final answer ~ satisfies the relative or the absolute error bound. e know only that it satisfies at least one of them. Corollary 1 Let, 0 and N be as defined in Theorem 1. Let ~ = BridgealueDet(; 0 ; N ), a linear ombination of hypotheses from. Then ~ = ~, the result of mapping ~ bak into, satisfies at least one of either the relative error bound 2 + 1 or absolute error bound 0 (2 + 1 ). 4.2 The Examples Revisited e now reonsider the examples from Setion 3. The main algorithm BridgealueDet takes parameters ; 0 ; and N; from whih it omputes the number of lookahead steps j to use to ahieve the requested error bounds. Also for eah iteration, it hooses a parameter n whih determines the ontration fator ahieved or relative error bound established for that iteration. These two parameters, j and n, are passed to BridgeStep at eah iteration. In this setion, we examine the effet of repeated appliations of BridgeStep, using j = 3 and = :99 for every iteration. For the first example, with initial 0 = (0; 0), the results of repeated appliations of BridgeStep are shown in Figure 10(a). Beause for this example k? k = 0 (i.e. the true value funtion is in ), the relative error bound is always infinite. Therefore, by Lemma 2, every step ahieves at least a ontration and so the algorithmonverges to the true value funtion. 0 1 3 5 2 4 6 0 =(10,0) 1 2 3 =(10.44,.56) Figure 10: Examples revisited with Bridge 2 4 3 1 For the seond example, with 0 = (30; 0), the results of repeated appliations of BridgeStep are shown in Figure 10(b). Looking at the first step more losely, T j 0 = (10:2; 10:6), B 0 = (?10:7; 21:7) and 0 = B 0 = (?16:2; 16:2). The dotted line between 0 and 0 is the bridge. 1, being a weighted average of 0 and 0, lies on this bridge. Similarly, 2 lies on the bridge between 1 and 1. 0 5 10 15 20 25 30 35 40 45 relative error bound 35.2 13.5 7.4 4.1 2.3 1.3 0.8 0.5 0.4 0.4 effetive ontration fator.87.89.89.89.90.91.93.97.99 1.90 0=(30,0) 2.88.10 3.12 4.88.90.10 0=( 16.2,16.2) 5.90.12 1=(0,15.1) 6.91 3=(5.5,5.5) 2=(0,10.2) 7=(15.3,4.0).10.91.09 4=(5.5,5.5) Figure 11: (a) Lemma 2 applied to seond example (b) Linear ombination.09 6=(8.7,0) 5=(5.5,5.5) Figure 11(a) demonstrates Lemma 2 on every fifth appliation of BridgeStep. In partiular, note that the effetive 0 ontration fator only exeeds after the desired relative error bound errbound(; ; j) = 4:3 has been ahieved. In fat on this example, the algorithm performs far better than the theory guarantees. Figure 11(b) shows the weights of the averages and the struture of the resulting linear ombination after 7 steps. 5 Extensions It is possible to extend the algorithm in many ways. In partiular relying on an exat, agnosti learning operator is not pratial. ere we briefly disuss the use of two other learning operators and we hope in the future to onsider others still. 5.1 PAC(; ) learning Most signifiantly we have extended our algorithm to the ase where the learning step annot be done exatly but is instead a PAC learning step ; (see [Papavassiliou and Russell, 1998]). e atually use the same and for every iteration, so the learning step ; has the same omplexity for every iteration. This is simple but most likely not optimal. One appealing aspet of onsidering PAC agnosti learning is the potential availability of sample omplexity results based on some measure of the omplexity of. Unfortunately it is neessary to learn and estimate risk under the stationary distribution of the Markov hain. Simply running the hain to generate samples will only generate them orretly in the steadystate limit. Therefore omputing sample omplexity results for the risk estimation and agnosti learning steps requires extending the urrent state of the theory to the ase where samples are generated from a Markov hain, rather than i.i.d. One would expet the sample omplexity to depend on the mixing time of the Markov hain and the variane of the sample distribution. The form of these theorems will also determine the extent to whih samples an be reused between the different risk estimation steps within an iteration or even aross iterations. 5.2 Suboptimal learning Previous algorithms for this problem have been shown to onverge for learning operators that are non-expansions k? k k? k The onvergene results for Bridge hold for learning operators that perform agnosti learning. Unfortunately there is a general lak of useful agnosti learning algorithms (the risk minimization step is typially intratable), so it would be benefiial to extend the results to learning systems that are not optimal. It is possible to weaken the onditions on the learning operator and give onvergene results for Bridge that hold when satisfies the banana-fudge ondition k? k k1(k? k)? k 2(k? k) k? k for some nondereasing funtions k1 > 0 and k2. The only modifiation neessary to Bridge is to inlude k1 and k2 in the alulation of the relative error bound errbound(; ; j). Note that for k1(z) = 1 and k2(z) = z, this ondition redues to k? k? k? k k? k whih is in fat the property of agnosti learning that is used to derive the results in this paper. Intuitively, the nonexpansion ondition for requires that two points that are lose to eah other, are mapped lose

to eah other. The banana-fudge ondition requires that two points that are lose to eah other, are mapped a similar distane away, but they an be mapped in opposite diretions and so end up very far from eah other. The banana-fudge ondition is obviously the weaker one, requiring only similarity in level of suess and not similarity in outome. It disallows the ase where one funtion is learned very well, but another funtion very lose to the first is learned very poorly. e are urrently searhing for learning algorithms that satisfy the banana-fudge ondition, but unfortunately it seems most ommon pratial learning algorithms do not. 6 Other Approahes e briefly disuss other known alternatives to Bridge as well as mention some of the new diretions one might onsider. 6.1 Alternatives If is onvex and is the agnosti learning operator, then is a nonexpansion and n+1 = T n onverges. For nononvex, an alternative approah to Bridge is to PAC-agnostially learn the onvex hull of using at eah iteration [Lee et al., 1995]. The resulting iterated proedure n+1 = onvex?hull() T n onverges sine onvex?hull() is a nonexpansion. Unfortunately, this algorithm requires many more agnosti learning steps per iteration than seems pratial. A noniterative method that returns the optimal answer is to redue the value determination problem to a single instane of supervised learning by using the operator T 1 (otherwise known as TD( = 1)). It does unlimited lookahead, has ontration fator 1 = 0 and so it generates after just one iteration. Looked at another way, the distribution PT 1(jx) has mean (x) and so T 1 =. Unfortunately there is empirial evidene that suggests the sample distribution PT is very hard to learn and requires very many samples (perhaps 1 beause it an have high variane). Stritly speaking, it is not neessary to bakup values beyond the -horizon whih is log R max (1?). Even this, however, may yield sample distributions with too muh variane for pratial use, although it is offset by the need to perform only a single learning step. Finally it may be possible, using Lemma 1, to establish onvergene rates and error bounds for the iterated proedure n+1 = T m n where m is less than the -horizon. owever, m would probably have to be muh larger than j, the number of lookahead steps used by Bridge, and so again we would expet bad sample omplexity. 6.2 New Diretions There are many ways in whih the basi tools used in onstruting this algorithm might be used in onstruting more powerful methods. Speifially the geometri relationship between ; T; and established in Lemma 1 is very useful in (1) providing geometri intuition to design new methods and (2) proving performane guarantees for these methods. One an think of many different ways to throw a bridge and many different kinds of bridges to throw. For example, we establish, the other end of the bridge by learning the point + 1 (T j 1? j? ). This hoie is rather arbitrary, piked to simplify the error bound analysis. One might try instead learning a point further or a little loser to. One we establish, we throw a one-dimensional, linear bridge from to and learn a point lose to T j on this line (learning is equivalent to projetion in linear hypothesis lasses). One might try establishing more than two points with whih to support the bridge. For example, given and after establishing (1), we ould try establishing (2) by learning a point strategially loated far from both and (1) on the other side of the hypersphere defined by Lemma 1. Then we ould throw a two-dimensional, planar bridge aross these three points and projet T j (or a point lose by) onto this plane. e an ontinue in this way, onsidering methods that establish (1); :::; (n) and use an n-dimensional hyperplane to learn T j. In the logial limit this method looks like a loal version of [Lee et al., 1995] whih learns the full onvex hull. It is loal in that it only loses under weighted averaging those points of that are losest to some point of the hypersphere defined by Lemma 1. Our urrent method whih only uses one-dimensional bridges is effetively a light version of these onvex-hull methods, in that before learning T j it loses under linear ombinations only two points from, namely and. 7 Conlusion e have developed a method that redues the value determination problem to the agnosti learning problem. Requesting that our algorithm halt in fewer iterations or with better error bounds pushes more of the omplexity into the learning step and in the limit effetively fores it to onsider infinite lookahead whih is T 1. Similarly, if we were to extend our algorithm to use more and more supports for the bridge, we suspet it would approximate the performane of the onvex hull learning algorithm. Thus our method an be thought of as a more versatile and hopefully more effiient alternative to these aggressive methods. The key features that haraterize our approah are (1) the ompliation of learning is abstrated into a learning operator, (2) we use a new operator B rather than being restrited to the bakup operator T, (3) we form linear ombinations of hypotheses from a lass rather than being limited to just, and (4) we use Lemma 1 to prove onvergene and error bound results. These tehniques an be applied or modified to develop endless variations on Bridge as well as ompletely new algorithms. A big missing ingredient in justifying one method over another is sample omplexity. In partiular, we do not know how sample omplexity depends on the lookahead j, or, in the ase of ;, how it depends on, and so we annot properly trade off these parameters to ahieve the best performane. Our results are stated for the problem of value determination, but they apply to any situation with an operator that is a ontration with respet to a norm defined by a samplable distribution. For the problem of value determination, the operator is T, the one-step bakup operator, and it is a ontration under the norm defined by the stationary distribution of the Markov hain. As stated previously, this distribution an only be sampled exatly in the steady-state limit, so improvements in the theory are neessary. Finally, a big hurdle to a pratial, implementable algorithm is the lak of useful, well-behaved (agnosti or not) learning algorithms. By applying the tehniques used in developing Bridge, we hope to bridge the gap between the available supervised learning algorithms and those needed by theoretially justified reinfore-

ment learning methods. Referenes [Baird, 1995] Leemon Baird. Residual algorithms: Reinforement learning with funtion approximation. In Proeedings of the Twelfth International Conferene on Mahine Learning, Tahoe City, CA, Proeedings of the Twelfth International Conferene on Mahine Learning 1995. Morgan Kaufmann. [Bertsekas and Tsitsiklis, 1996] D. C. Bertsekas and J. N. Tsitsiklis. Neuro-dynami programming. Athena Sientifi, Belmont, Mass., 1996. [Gordon, 1995] Geoffrey J. Gordon. Stable funtion approximation in dynami programming. In Proeedings of the Twelfth International Conferene on Mahine Learning, Tahoe City, CA, July 1995. Morgan Kaufmann. [aussler, 1992] David aussler. Deision theoreti generalizations of the pa model for neural net and other learning appliations. Information and Computation, 100(1):78 150, 1992. [Jaakkola et al., 1995] Tommi Jaakkola, Satinder P. Singh, and Mihael I. Jordan. Reinforement learning algorithm for partially observable Markov deision problems. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advanes in Neural Information Proessing Systems 7, pages 345 352, Cambridge, Massahusetts, 1995. MIT Press. [Lee et al., 1995].S. Lee, P.L. Bartlett, and R.C. illiamson. On effiient agnosti learning of linear ombinations of basis funtions. In Proeedings of the Eighth Annual Conferene on Computational Learning Theory, pages 369 376, 1995. [Papavassiliou and Russell, 1998]. Papavassiliou and S. Russell. Convergene of reinforement learning with pa funtion approximators. Tehnial Report UCB//CSD-98-1005, University of California, Berkeley, 1998. [Sutton, 1988] R. S. Sutton. Learning to predit by the methods of temporal differenes. Mahine Learning, 3:9 44, August 1988. [Tsitsiklis and an Roy, 1996] John N. Tsitsiklis and Benjamin an Roy. An analysis of temporal-differene learning with funtion approximation. Tehnial Report LIDS- P-2322, Laboratory for Information and Deision Systems, Massahusetts Institute of Tehnology, 1996. A Detailed Algorithm ere we give details of the algorithm as well as define the relative error bound errbound(; ; j). BridgealueDet first alulates the neessary number of bakup steps j in order to ahieve the desired error bounds within the desired number of iterations. For eah iteration it intelligently selets the parameter and alls the subroutine BridgeStep with the urrent approximation. Finally it detets when it an halt suessfully and returns a weighted tree of hypotheses. BridgealueDet(; 0; N ) 2R max 0 1=N max = ; goal = ; max = goal 1? max Choose smallest j suh that n = 0; total = 1; 0 = some initial hypothesis LOOP UNTIL total goal f Choose smallest n s.t. errbound( max; ; j) max n ' goal errbound( n; ; j) total n+1 = BridgeStep( n; n; j) total = total n; RETURN n BridgeStep(; ; j) n = n + 1g B = I + 1 1? j (Tj? I) = B; u = k? k v = r (; P T j ); w = r(; P T j ) = = p1 jp? 2j? 1? s 2 (1? 2 ) (1? 2 )(1? 2j ) = v? w + u 2u RETURN errbound(; ; j) = (1? ) + p1 jp? 2j? 1? p 2 1 =? 1 p(1 j [1 +? 2j ) p p 2 = 2 2 p 1? 2 1? 2j ; 3 = ( 1? j )2 (1? 2 )(? p 1? 2 )] 4 = 2j 23 + 24 ; 5 = ; s =? 1? j 1? 2j 2 + 2 1 4 IF 4 5 TEN RETURN + 1 ELSE RETURN 8r >< >: 4 1 3? 2 2 4( 3 + 2 4 + 1 2 4 ) if s 5 q 1 2 5 + 2 5 + 3 ( 4? 5 ) 2 if s > 5