An Analysis of Reinforcement Learning with Function Approximation

Size: px

Start display at page:

Download "An Analysis of Reinforcement Learning with Function Approximation"

Susan Jenkins
6 years ago
Views:

1 Francisco S. Melo Carnegie Mellon University, Pittsburgh, PA 15213, USA Sean P. Meyn Coorinate Science Lab, Urbana, IL 61801, USA M. Isabel Ribeiro Institute for Systems an Robotics, Lisboa, Portugal Abstract We aress the problem of computing the optimal Q-function in Markov ecision problems with infinite state-space. We analyze the convergence properties of several variations of Q-learning when combine with function approximation, extening the analysis of TD-learning in (Tsitsiklis & Van Roy, 1996a) to stochastic control settings. We ientify conitions uner which such approximate methos converge with probability 1. We conclue with a brief iscussion on the general applicability of our results an compare them with several relate works. 1. Introuction Convergence of Q-learning with function approximation has been a long staning open question in reinforcement learning (Sutton, 1999). In general, valuebase reinforcement learning (RL) methos for optimal control behave poorly when combine with function approximation (Bair, 1995; Tsitsiklis & Van Roy, 1996a). In this paper, we aress this problem by analyzing the convergence of Q-learning when combine with linear function approximation. We ientify a set of conitions that imply the convergence of this approximation metho with probability 1 (w.p.1), when a fixe learning policy is use, an provie an interpretation of the resulting approximation as the fixe point of a Bellman-like operator. This motivates the analysis of several variations of Q-learning when combine with linear function approximation. In particular, we Appearing in Proceeings of the 25 th International Conference on Machine Learning, Helsinki, Finlan, Copyright 2008 by the author(s)/owner(s). stuy a variation of Q-learning using importance sampling an an on-policy variant of Q-learning (SARSA). The paper is organize as follows. We start in Section 2 by escribing Markov ecision problems. We procee with our analysis of the Q-learning algorithm an its variants, an prouce our main results in Section 3. We also compare our results with other relate works in the RL literature. We conclue with some further iscussion in Section Markov Decision Problems Let (X, A, P, r, γ) be a Markov ecision problem (MDP) with a compact state-space X R p an a finite action set A. The action-epenent kernel P a efines the transition probabilities for the unerlying controlle Markov chain {X t } as P X t+1 U X t = x, A t = a = P a (x, U), where U is any measurable subset of X. The A-value process {A t } represents the control process: A t is the control action at time instant t. 1 Solving the MDP consists in etermining the control process {A t } maximizing the expecte total iscounte rewar V ( {A t }, x ) = E γ t R(X t, A t ) X 0 = x, t=0 where 0 γ < 1 is a iscount-factor an R(x, a) represents a ranom rewar receive for taking action a A in state x X. For simplicity of notation, we consier a boune eterministic function r : X A X R assigning a rewar r(x, a, y) every time a transition from x to y occurs after taking 1 We take the control process {A t} to be aapte to the σ-algebra inuce by {X t}.

2 action a. This means that E R(x, a) = r(x, a, y)p a (x, y). X The optimal value function V is efine for each state x X as V (x) = max V ( {A t }, x ) = {A t} = max E γ t R(X t, A t ) X 0 = x {A t} t=0 an verifies the Bellman optimality equation V (x) = max r(x, a, y) + γv (y) P a (x, y). (1) a A X V (x) represents the expecte total iscounte rewar receive along an optimal trajectory starting at state x. We can also efine the optimal Q-function Q as Q (x, a) = r(x, a, y) + γv (y) P a (x, y), (2) X representing the expecte total iscounte rewar along a trajectory starting at state x obtaine by choosing a as the first action an following the optimal policy thereafter. The control process {A t } efine as A t = arg max Q (X t, a), a A t, is optimal in the sense that V ({A t }, x) = V (x) an efines a mapping π : X A known as the optimal policy. The optimal policy etermines the optimal ecision rule for a given MDP. More generally, a (Markov) policy is any mapping π t efine over X A generating a control process {A t } verifying, for all t, P A t = a X t = x = π t (x, a), We write V πt (x) instea of V ({A t }, x) if the control process {A t } is generate by policy π t. A policy π t is stationary if it oes not epen on t an eterministic if it assigns probability 1 to a single action in each state an is thus represente as a map π t : X A for every t. Notice that the optimal control process can be obtaine from the optimal (stationary, eterministic) policy π, which can in turn be obtaine from Q. Therefore, the optimal control problem is solve once the function Q is known for all pairs (x, a). t. Now given any real function q efine over X A, we efine the Bellman operator (Hq)(x, a) = r(x, a, y) + γ max q(y, u) P a (x, y). u A X (3) The function Q in (2) is the fixe-point of H an, since this operator is a contraction in the sup-norm, a fixe-point iteration can be use to etermine Q (at least theoretically) The Q-Learning Algorithm We previously suggeste that a fixe-point iteration coul be use to etermine the function Q. In practice, this requires two important conitions: The kernel P an the rewar function r are known; The successive estimates for Q can be represente compactly an store in a computer with finite memory. If P an/or r are not known, a fixe-point iteration using H is not possible. To solve this problem, Watkins propose in 1989 the Q-learning algorithm (Watkins, 1989). Q-learning procees as follows: consier a MDP M = (X, A, P, r, γ) an suppose that {x t } is an infinite sample trajectory of the unerlying Markov chain obtaine with some policy π t. The corresponing sample control process is enote as {a t } an the sequence of obtaine rewars as {r t }. Given any initial estimate Q 0, Q-learning successively upates this estimate using the rule Q t+1 (x, a) = Q t (x, a) + α t (x, a) t, (4) where {α t } is a step-size sequence an t is the temporal ifference at time t, t = r t + γ max b A Q t(x t+1, b) Q t (x t, a t ). (5) If both X an A are finite sets, each estimate Q t is simply a X A matrix an can be represente explicitly in a computer. In that case, the convergence of Q-learning an several other relate algorithms (such as TD(λ) or SARSA) has been thoroughly stuie (see, for example, (Bertsekas & Tsitsiklis, 1996) an references therein). However, if either X or A are infinite or very large, explicitly representing each Q t becomes infeasible an some form of compact representation is neee (e.g., using function approximation). In this paper, we aress how several RL methos such as Q- learning an SARSA can be combine with function approximation an still retain their main convergence properties. 3. Reinforcement Learning with Linear Function Approximation In this section, we aress the problem of etermining the optimal Q-function for MDPs with infinite statespace X. Let Q = {Q θ } be a family of real-value

3 functions efine in X A. It is assume that the function class is linearly parameterize, so that Q can be expresse as the linear span of a fixe set of M linearly inepenent functions φ i : X A R. For each M-imensional parameter vector θ R M, the function Q θ Q is efine as, Q θ (x, a) = M φ i (x, a)θ(i) = φ (x, a)θ, i=1 where represents the transpose operator. We will also enote the above function by Q(θ) to emphasize the epenence on θ over the epenency on (x, a). Let π be a fixe stochastic, stationary policy an suppose that {x t }, {a t } an {r t } are sample trajectories of states, actions an rewars obtaine from the MDP using policy π. In the original Q-learning algorithm, the Q-values are upate accoring to (4). The temporal ifference t can be interprete as a 1-step estimation error with respect to the optimal function Q. The upate rule in Q-learning moves the estimates Q t closer to the esire function Q, minimizing the expecte value of t. In our approximate setting, we apply the same unerlying iea to obtain the upate rule for approximate Q-learning: θ t+1 = θ t + α t θ Q θ (x t, a t ) t = θ t + α t φ(x t, a t ) t, (6) where, as above, t is the temporal ifference at time t efine in (5). Notice that (6) upates θ t using the temporal ifference t as the error. The graient θ Q θ provies the irection in which this upate is performe. To establish convergence of the algorithm (6) we aopt an ODE argument, establishing the trajectories of the algorithm to closely follow those of an associate ODE with a globally asymptotically stable equilibrium point. This will require several regularity properties on the policy π an on its inuce Markov chain that will, in turn, motivate the stuy of the on-policy version of the algorithm. This on-policy algorithm can be seen as an extension of SARSA to infinite settings Convergence of Q-Learning We now procee by ientifying conitions that ensure the convergence of Q-learning with linear function approximation as escribe by (6). Due to space limitations, we overlook some of the technical etails in the proofs that can easily be fille in. We start by introucing some notation that will greatly simplify the presentation. Given an MDP M = (X, A, P, r, γ) with compact state space X R p, let (X, P π ) be the Markov chain inuce by a fixe policy π. We assume the chain (X, P π ) to be uniformly ergoic with invariant probability measure µ X an the policy π to verify π(x, a) > 0 for all a A an µ X - almost all x X. We enote by µ π the probability measure efine for each measurable set U X an each action a A as µ π (U {a}) = π(x, a)µ X (x). Let now {φ i, i = 1,..., M} be a set of boune, linearly inepenent basis functions to be use in our approximate Q-learning algorithm. We enote by Σ π the matrix efine as Σ π = E π φ(x, a)φ (x, a) = φ φ µ π U X A Notice that the above expression is well-efine an inepenent of the initial istribution for the chain, ue to our assumption of uniform ergoicity. For fixe θ R M an x X, efine the set of maximizing actions at x as A θ x = {a A φ (x, a )θ = max φ (x, a)θ} a an the greey policy with respect to θ as any policy that, at each state x, assigns positive probability only to actions in A θ x. Finally, let φ x enote the row-vector φ (x, a), where a is a ranom action generate accoring to the policy π at state x; likewise, let φ θ x enote the row-vector φ (x, a θ x), where a θ x is now any action in A θ x. We now introuce the θ-epenent matrix Σ π(θ) = E π (φ θ x ) φ θ x. By construction, both Σ π an Σ π are positive efinite, since the functions φ i are assume linearly inepenent. Notice also the ifference between Σ π an each Σ π: the actions in the efinition of the former are taken accoring to π while in the latter they are taken greeily with respect to a particular θ. We are now in position to introuce our first result. Theorem 1 Let M, π an {φ i, i = 1,..., M} be as efine above. If, for all θ, Σ π > γ 2 Σ π(θ) (7) an the step-size sequence verifies α t = αt 2 <, t t

4 then the algorithm in (6) converges w.p.1 an the limit point θ verifies the recursive relation Q(θ ) = Π Q HQ(θ ), where Π Q is the orthogonal projection onto Q. 2 Proof We establish the main statement of the theorem using a stanar ODE argument. The assumptions on the chain (X, P π ) an basis functions {φ i, i = 1,..., M} an the fact that π(x, a) > 0 for all a A an µ X -almost every x X ensure the applicability of Theorem 17 in page 239 of (Benveniste et al., 1990). Therefore, the convergence of the algorithm can be analyze in terms of the stability of the equilibrium points of the associate ODE θ = E π φ x ( r(x, a, y) + γφ θ y θ φ x θ ), (8) where we omitte the explicit epenence of θ on t to avoi excessively cluttering the expression. If the ODE (8) has a globally asymptotically stable equilibrium point, this implies the algorithm (6) to converge w.p.1 (Benveniste et al., 1990). Let then θ 1 (t) an θ 2 (t) be two trajectories of the ODE starting at ifferent initial conitions, an let θ(t) = θ 1 (t) θ 2 (t). From (8), we get t θ 2 2 = 2 θ Σ π θ + 2γEπ (φx θ)( φ θ 1 y θ 1 φ θ2 y θ 2 ). Notice now that, from the efinition of φ θ1 y φ θ1 y θ 2 φ θ2 y θ 2 φ θ2 y θ 1 φ θ1 y θ 1. an φ θ2 y, Taking this into account an efining the sets S + = {(x, a) φ (x, a) θ > 0} an S = X A S +, the previous expression becomes t θ θ Σ π θ + 2γEπ (φx θ)( φ θ 1 y θ ) I S+ + 2γE π (φx θ)( φ θ 2 y θ ) I S, where I S represents the inicator function for the set S. Applying Höler s inequality to each of the expec- 2 The orthogonal projection is naturally efine in the (infinite-imensional) Hilbert space containing Q with inner-prouct given by f, g = X A f g µ π. tations above, we get t θ θ Σ π θ + 2γ E π (φ x θ)2 I S+ E π (φ θ1 x θ) 2 I S+ + 2γ E π (φ x θ)2 I S E π (φ θ2 x θ) 2 I S an a few simple computations finally yiel t θ θ Σ π θ + 2γ θ Σ π θ max ( θ Σ π(θ 1 ) θ, θ Σ π(θ 2 ) θ). Since, by assumption, Σ π > γ 2 Σ π(θ), we can conclue from the expression above that t θ 2 2 < 0. This means, in particular, that θ(t) converges asymptotically to the origin, i.e., the ODE (8) is globally asymptotically stable. Since the ODE is autonomous (i.e., time-invariant), there exists one globally asymptotically stable equilibrium point for the ODE, that verifies the recursive relation θ = Σ 1 ( π E π φ x r(x, a, y) + γφ θ y θ ). (9) Since Σ π is, by construction, positive efinite, the inverse in (9) is well-efine. Multiplying (9) by φ (x, a) on both sies yiels the esire result. It is now important to observe that conition (7) is quite restrictive: since γ is usually taken close to 1, conition (7) essentially requires that, for every θ, max a A φ (x, a)θ π(x, a)φ (x, a)θ. a A Therefore, such conition will selom be met in practice, since it implies that the learning policy π is alreay close to the policy that the algorithm is meant to compute. In other wors, the maximization above yiels a policy close to the policy use uring learning. An, when this is the case, the algorithm essentially behaves like an on-policy algorithm. On the other han, the above conition can be ensure by consiering only a local maximization aroun the learning policy π. This is the most interesting aspect of the above result: it explicitly relates how much information the learning policy provies about greey policies, as a function of γ. To better unerstan this,

5 notice that each policy π is associate with a particular invariant measure on the inuce chain (X, P π ). In particular, the measure associate with the learning policy may be very ifferent from the one inuce by the greey/optimal policy. Taking into account the fact that γ measures, in a sense, the importance of the future, Theorem 1 basically states that: In problems where the performance of the agent greatly epens on future rewars (γ 1), the information provie by the learning policy can only be safely generalize to nearby greey policies, in the sense of (7). In problems where the performance of the agent is less epenent on future rewars (γ 1), the information provie by the learning policy can be safely generalize to more general greey policies. Suppose then that the maximization in the upate equation (6) is to be replace by a local maximization. In other wors, instea of maximizing over all actions in A, the algorithm shoul maximize over a small neighborhoo of the learning policy π (in policy space). The ifficulty with this approach is that such maximization can be har to implement. The use of importance sampling can reaily overcome such ifficulty, by making the maximization in policy-space implicit. The algorithm thus obtaine, which resembles in many aspects the one propose in (Precup et al., 2001), is escribe by the upate rule θ t+1 = θ t + α t φ(x t, a t ) ˆ t, (10) where the moifie temporal ifference ˆ t is given by ˆ t = r t + γ π θ (x t+1, b) π(x t+1, b) Q θ t (x t+1, b) Q θt (x t, a t ), b (11) where π θ is, for example, a θ-epenent ε-greey policy close to the learning policy π. 3 A possible implementation of such algorithm is sketche in Figure 1. We enote by π N the behavior policy at iteration N of the algorithm; in the stopping conition for the algorithm, any aequate policy norm can be use SARSA with Linear Function Approximation The analysis in the previous subsection suggests that on-policy algorithms may potentially yiel more reliable convergence properties. Such fact has alreay been observe in (Tsitsiklis & Van Roy, 1996a; Perkins & Penrith, 2002). In this subsection we thus focus on 3 Given ε > 0, a policy π is ε-greey with respect to a function Q θ Q if, at each state x X, it chooses a ranom action with probability ε an a greey action a A θ x with probability (1 ε). Algorithm 1 Moifie Q-learning. Require: Initial policy π 0 ; 1: Initialize X 0 = x 0 an set N = 0; 2: for t = 0 until T o 3: Sample A t π N (x t, ); 4: Sample next-state X t+1 P at (x t, ); 5: r t = r(x t, a t, x t+1 ); 6: Upate θ t accoring to (10); 7: en for 8: Set π N+1 (x, a) = π θ (x, a); 9: N = N + 1; 10: if π N π N 1 then 11: return π N ; 12: else 13: Goto 2; 14: en if on-policy algorithms. We analyze the convergence of SARSA when combine with linear function approximation. In our main result, we recover the essence of the result in (Perkins & Precup, 2003), although in a somewhat ifferent setting. The main ifferences between our work an that in (Perkins & Precup, 2003) are iscusse further ahea. Once again, we consier a family Q of real-value functions, the linear span of a fixe set of M linearly inepenent functions φ i : X A R, an erive an on-policy algorithm to compute a parameter vector θ such that φ (x, a)θ approximates the optimal Q- function. To this purpose, an unlike what has been one so far, we now consier a θ-epenent learning policy π θ verifying π θ (x, a) > 0 for all θ. In particular, we consier at each time step a learning policy π θt that is ε-greey with respect to φ (x, a)θ t an Lipschitz continuous with respect to θ, with Lipschitz constant C (with respect to some preferre metric). We further assume that, for every fixe θ, the Markov chain (X, P θ ) inuce by such policy is uniformly ergoic. Let then {x t }, {a t } an {r t } be sample trajectories of states, actions an rewars obtaine from the MDP M = (X, A, P, r, γ) using (at each time-step) the θ- epenent policy π θt. The upate rule for our approximate SARSA algorithm is: θ t+1 = θ t + α t φ(x t, a t ) t, (12) where t is the temporal ifference at time t, t = r t + γφ (x t+1, a t+1 )θ t φ (x t, a t )θ t. In orer to use the SARSA algorithm above to approximate the optimal Q-function, it is necessary to slowly ecay the exploration rate, ε, to zero, while guaran-

6 teeing the learning policy to verify the necessary regularity conitions (namely, Lipschitz continuous w.r.t. θ). However, as will soon become apparent, ecreasing the exploration rate to zero will rener our convergent result (an other relate results) not applicable. We are now in position to introuce our main result. Theorem 2 Let M, π θt an {φ i, i = 1,..., M} be as efine above. Let C be the Lipschitz constant of the learning policy π θ with respect to θ. Assume that the step-size sequence verifies α t = αt 2 <. t Then, there is C 0 > 0 such that, if C < C 0, the algorithm in (12) converges w.p.1. Proof We again use an ODE argument to establish the statement of the theorem. As before, the assumptions on (X, P θ ) an basis functions {φ i, i = 1,..., M} an the fact that the learning policy is Lipschitz continuous with respect to θ an verifies π(x, a) > 0 ensure the applicability of Theorem 17 in page 239 of (Benveniste et al., 1990). Therefore, the convergence of the algorithm can be analyze in terms of the stability of the associate ODE: θ = E θ φ x ( r(x, a, y) + γφy θ φ x θ ). (13) Notice that the expectation is taken with respect to the invariant measure of the chain an learning policy, both θ-epenent. To establish global asymptotic stability, we re-write (13) as where θ(t) = A θ θ(t) + b θ A θ = E θ φ x ( γφy φ x ) ; bθ = E θ φ x r(x, a, y). An equilibrium point of (13) must verify θ = A 1 θ b θ an the existence of such equilibrium point has been establishe in (e Farias & Van Roy, 2000) (Theorem 5.1). Let θ(t) = θ(t) θ. Then, Let t θ 2 2 = 2 θ ( A θ θ + b θ ) = = 2 θ ( A θ θ A θ θ + b θ b θ ). λ A = sup A θ A θ 2 θ t λ b = sup θ θ b θ b θ 2 θ θ 2, where the norm in the efinition of λ A is the inuce operator norm an the one in the efinition of λ b is the regular Eucliian norm. The previous expression thus becomes t θ 2 2 = = 2 θ A θ θ + 2 θ (A θ A θ )θ + 2 θ (b θ b θ ) 2 θ A θ θ + 2(λA + λ b ) θ 2 2. Letting λ = λ A + λ b, the above expression can be written as t θ 2 2 θ ( A θ + λi) θ. The fact that the learning policy is assume Lipschitz w.r.t. θ an the uniform ergoicity of the corresponing inuce chain implies that A θ an b θ are also Lipschitz w.r.t. θ (with a ifferent constant). This means that λ goes to zero with C an, therefore, for C sufficiently small, (A+λI) is a negative efinite matrix. 4 Therefore, the ODE (13) is globally asymptotically stable an the conclusion of the theorem follows. Several remarks are now in orer. First of all, Theorem 2 basically states that, for fixe ε if the epenence of the learning policy π θ can be mae sufficiently smooth, then SARSA converges w.p.1. This result is similar to the result in (Perkins & Precup, 2003), although the algorithms are not exactly similar: we consier a continuing task, while the algorithm feature in (Perkins & Precup, 2003) is implemente in an episoic fashion. Furthermore, in our case, convergence was establishe using an ODE argument, instea of the contraction argument in (Perkins & Precup, 2003). Nevertheless, both methos of proof are, in its essence, equivalent an the results in both papers concorant. A secon remark is relate with the implementation of SARSA with a ecaying exploration policy. The analysis of one such algorithm coul be conucte using, once again, an ODE argument. In particular, SARSA coul be escribe as a two-time-scale algorithm: the iterations of the main algorithm (corresponing to (12)) woul evelop on a faster time-scale an the ecaying exploration rate woul evelop at a slower time-scale. The analysis in (Borkar, 1997) coul then be replicate. However, it is well-known that, as ε approaches zero, the learning policy will approach the greey policy w.r.t. θ which is, in general, iscontinuous. Therefore, there is little hope that the smoothness 4 The fact that A θ is negative efinite has been establishe in several works. See, for example, Lemma 3 in (Perkins & Precup, 2003) or, in a slightly ifferent setting (easily extenable to our setting) the proof of Theorem 1 in (Tsitsiklis & Van Roy, 1996a).

7 conition in Theorem 2 (or its equivalent in (Perkins & Precup, 2003)) can be met as ε approaches to zero. 4. Discussion We now briefly iscuss some of the assumptions in the above theorems. We start by emphasizing that all state conitions are only sufficient, meaning that it is possible that convergence may occur even if some (or all) fail to hol. We also iscuss the relation between our results an other relate works from the literature. Seconsly, uniform ergoicity of a Markov chain essentially means that the chain quickly converges to a stationary behavior uniformly over the state-space an we can stuy any properties of the stationary chain by irect sampling. 5 This property an the requirement that π(x, a) > 0 for all a A an µ X -almost all x X can be interprete as a continuous counterpart to the usual conition that all state-action pairs are visite infinitely often. In fact, uniform ergoicity implies that all the regions of the state-space with positive µ X measure are sufficiently visite (Meyn & Tweeie, 1993), an the conition π(x, a) > 0 ensures that, at each state, every action is sufficiently trie. It appears to be a stanar requirement in this continuous scenario, as it has also been use in other works (Tsitsiklis & Van Roy, 1996a; Singh et al., 1994; Perkins & Precup, 2003). 6 The requirement that π(x, a) > 0 for all a A an µ X -almost all x X also correspons to the concept of fair control as introuce in (Borkar, 2000). We also remark that the ivergence example in (Goron, 1996) is ue to the fact that the learning policy fails to verify the Lipschitz continuity conition state in Theorem 2 (as iscusse in (Perkins & Penrith, 2002)) Relate Work In this paper, we analyze how RL algorithms can be combine with linear function approximation to approximate the optimal Q-function in MDPs with infinite state-spaces. In the last ecae or so, several authors have aresse this same problem from ifferent perspectives. We now briefly iscuss several such 5 Explicit bouns on the rate of convergence to stationarity are available in the literature (Meyn & Tweeie, 1994; Diaconis & Saloff-Coste, 1996; Rosenthal, 2002). However, for general chains, such bouns ten to be loose. 6 Most of these works make use of geometric ergoicity which, since we amit a compact state-space, is a consequence of uniform ergoicity. approaches an their relation to the results in this paper. One possible approach is to rely on soft-state aggregation (Singh et al., 1994; Goron, 1995; Tsitsiklis & Van Roy, 1996b), partitioning the state-space into soft regions. Treating the soft-regions as hyper-states, these methos then use stanar learning methos (such as Q-learning or SARSA) to approximate the optimal Q-function. The main ifferences between such methos an those using linear function approximation (such as the ones portraye here) are that, in the former, only one component of the parameter vector is upate at each iteration an the basis functions are, by construction, restraine to be positive an to a to one at each point of the state-space. Sample-base methos (Ormoneit & Sen, 2002; Szepesvári & Smart, 2004) further generalize the applicability of soft-state aggregation methos by using spreaing functions/kernels (Ribeiro & Szepesvári, 1996). Sample-base methos thus exhibit superior convergence rate when compare with simple softstate aggregation methos, although uner somewhat more restrictive conitions. Finally, RL with general linear function approximation was thorougly stuie in (Tsitsiklis & Van Roy, 1996a; Taić, 2001). Posterior works extene the applicability of such results. In Precup01icml, an offpolicy convergent algorithm was propose that uses an importance-sampling principle similar to the one escribe in Section 3. In (Perkins & Precup, 2003), the authors establish the convergence of SARSA with linear function approximation Concluing Remarks We conclue by observing that all methos analyze here as well as those surveye above experience a egraation in performance as the istance between the target function an the chosen linear space increases. If the functions in the chosen linear space provie only a poor approximation of the esire function, there are no practical guarantees on the usefulness of such approximation. The error bouns erive in (Tsitsiklis & Van Roy, 1996a) are reassuring in that they state that the performance of approximate TD gracefully egraes as the istance between the target function an the chosen linear space increases. Although we have not aresse such topic in our analysis, we expect the error bouns in (Tsitsiklis & Van Roy, 1996a) to carry with little changes to our setting. Finally, we make no use of eligibility traces in our algorithms. However, it is just expectable that the methos escribe herein can easily be aapte to ac-

8 commoate for eligibility traces, this eventually yieling better approximations (with tighter error bouns) (Tsitsiklis & Van Roy, 1996a) Acknowlegements The authors woul like to acknowlege the many useful comments from Doina Precup an the unknown reviewers. Francisco S. Melo acknowleges the support from the ICTI an the Portuguese FCT, uner the Carnegie Mellon-Portugal Program. Sean Meyn gratefully acknowleges the financial support from the National Science Founation (ECS ). Isabel Ribeiro was supporte by the FCT in the frame of ISR-Lisboa pluriannual funing. Any opinions, finings, an conclusions or recommenations expresse in this material are those of the authors an o not necessarily reflect the views of the NSF. References Bair, L. (1995). Resiual algorithms: Reinforcement learning with function approximation. Proc. 12th Int. Conf. Machine Learning (pp ). Benveniste, A., Métivier, M., & Priouret, P. (1990). Aaptive algorithms an stochastic approximations, vol. 22. Springer-Verlag. Bertsekas, D., & Tsitsiklis, J. (1996). Neuro-ynamic programming. Athena Scientific. Borkar, V. (1997). Stochastic approximation with two time scales. Systems & Control Letters, 29, Borkar, V. (2000). A learning algorithm for iscretetime stochastic control. Probability in the Engineering an Informational Sciences, 14, e Farias, D., & Van Roy, B. (2000). On the existence of fixe points for approximate value iteration an temporal-ifference learning. Journal of Optimization Theory an Applications, 105, Diaconis, P., & Saloff-Coste, L. (1996). Logarithmic Sobolev inequalities for finite Markov chains. Annals of Applie Probability, 6, Goron, G. (1995). Stable function approximation in ynamic programming (Technical Report CMU-CS ). School of Computer Science, Carnegie Mellon University. Goron, G. (1996). Chattering in SARSA(λ). (Technical Report). CMU Learning Lab Internal Report. Meyn, S., & Tweeie, R. (1993). Markov chains an stochastic stability. Springer-Verlag. Meyn, S., & Tweeie, R. (1994). Computable bouns for geometric convergence rates of Markov chains. Annals of Applie Probability, 4, Ormoneit, D., & Sen, S. (2002). Kernel-base reinforcement learning. Machine Learning, 49, Perkins, T., & Penrith, M. (2002). On the existence of fixe-points for Q-learning an SARSA in partially observable omains. Proc. 19th Int. Conf. Machine Learning (pp ). Perkins, T., & Precup, D. (2003). A convergent form of approximate policy iteration. Av. Neural Information Proc. Systems (pp ). Precup, D., Sutton, R., & Dasgupta, S. (2001). Offpolicy temporal-ifference learning with function approximation. Proc. 18th Int. Conf. Machine Learning (pp ). Ribeiro, C., & Szepesvári, C. (1996). Q-learning combine with spreaing: Convergence an results. Proc. ISRF-IEE Int. Conf. Intelligent an Cognitive Systems (pp ). Rosenthal, J. (2002). Quantitative convergence rates of Markov chains: A simple account. Electronic Communications in Probability, 7, Singh, S., Jaakkola, T., & Joran, M. (1994). Reinforcement learning with soft state aggregation. Av. Neural Information Proc. Systems (pp ). Sutton, R. (1999). Open theoretical questions in reinforcement learning. Lecture Notes in Computer Science, 1572, Szepesvári, C., & Smart, W. (2004). Interpolationbase Q-learning. Proc. 21st Int. Conf. Machine learning (pp ). Taić, V. (2001). On the convergence of temporalifference learning with linear function approximation. Machine Learning, 42, Tsitsiklis, J., & Van Roy, B. (1996a). An analysis of temporal-ifference learning with function approximation. IEEE Trans. Automatic Control, 42, Tsitsiklis, J., & Van Roy, B. (1996b). Feature-base methos for large scale ynamic programming. Machine Learning, 22, Watkins, C. (1989). Learning from elaye rewars. Doctoral issertation, King s College, University of Cambrige.

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012 CS-6 Theory Gems November 8, 0 Lecture Lecturer: Alesaner Mąry Scribes: Alhussein Fawzi, Dorina Thanou Introuction Toay, we will briefly iscuss an important technique in probability theory measure concentration