Linear stochastic approximation driven by slowly varying Markov chains

Available online at www.sciencedirect.com Systems & Control Letters 50 2003 95 102 www.elsevier.com/locate/sysconle Linear stochastic approximation driven by slowly varying Marov chains Viay R. Konda, John N. Tsitsilis Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA 02139, USA Received 5 June 2002; received in revised form 30 December 2002; accepted 23 February 2003 Abstract We study a linear stochastic approximation algorithm that arises in the context of reinforcement learning. The algorithm employs a decreasing step-size, and is driven by Marov noise with time-varying statistics. We show that under suitable conditions, the algorithm can trac the changes in the statistics of the Marov noise, as long as these changes are slower than the rate at which the step-size of the algorithm goes to zero. c 2003 Elsevier B.V. All rights reserved. Keywords: Stochastic approximation; Adaptive algorithms; Reinforcement learning 0. Introduction The convergence of stochastic approximation algorithms driven by ergodic noise sequences has been extensively studied in the literature [1,5,9,10]. Typical convergence results show that the iterates converge to a stationary point of the vector eld obtained by averaging the update direction with respect to the statistics of the driving noise. In some applications, the statistics of the driving noise change with time. In such cases, the point to which the algorithm would converge if the noise were held stationary, also changes with time. In this context, a meaningful obective is to have the stochastic approximation algorithm trac this changing point Corresponding author. E-mail addresses: onda@alum.mit.edu V.R. Konda, nt@mit.edu J.N. Tsitsilis. closely, after an initial transient period. Algorithms of this type are often called adaptive, as they adapt themselves to the changing environment for a textboo account of adaptive algorithms, see [1]. The tracing ability of adaptive algorithms has been analyzed in several contexts [6,14]. For example, it is nown that the usual constant step-size stochastic approximation algorithms can adapt to changes in the statistics of the driving noise that are slow relative to the step-size of the algorithm. However, the tracing ability of decreasing step-size stochastic approximation algorithms has not been studied before, because of the nature of the assumptions that would be required: for the changes in the statistics of the driving noise to remain slow relative to the decreasing step-size, these changes should become progressively slower. Such an assumption would be too restrictive and unnatural in applications where the noise is exogenously generated. 0167-6911/03/$ - see front matter c 2003 Elsevier B.V. All rights reserved. doi:10.1016/s0167-69110300132-4

96 V.R. Konda, J.N. Tsitsilis / Systems & Control Letters 50 2003 95 102 In some contexts, however, the statistics of the driving noise depend only on a parameter that is deliberately changed by the user at a rate which is slower than the natural time scale of the stochastic approximation algorithm. In such cases, it becomes meaningful to study the tracing ability of decreasing step-size stochastic approximation. In this paper, we focus on linear iterations and establish that when the update direction and the statistics of the noise in the update direction depend on a slowly changing parameter, the algorithm can trac such changes in a strong sense to be explained later. To the best of our nowledge, this is the rst result on the tracing ability of adaptive algorithms with decreasing step-sizes. Similar results are possible for methods involving nonlinear iterations under certain stability conditions see, e.g. [4] for stability conditions in a single time scale setting with a simpler noise model, but this direction is not pursued here. The rest of the paper is organized as follows. In Section 1, we motivate the linear algorithms considered within the context of reinforcement learning. In Section 2, we state the main result of this paper. The last three sections are devoted to the proof of this result. 1. Motivation The motivation for the analysis of the linear iterations considered in this paper comes from simulation-based reinforcement learning methods for Marov decision problems [2,13], such as Temporal Dierence TD learning [12], that are used to approximate the value function under a given xed policy. More precisely, consider an ergodic nite-state Marov chain {X }, with transition probabilities py x. Let gx be a one-stage reward function, and let be its steady-state expectation. TD learning methods consider value functions that are linear in a prescribed set of basis functions i x, the Poisson equation V x=gx + py xv y: y The parameters and r, which denote the estimate of average reward and the parameters of the approximate value function, are updated recursively as +1 = + gx ; r +1 = r + gx + r X +1 r X Z ; where Z = l X l l=0 is a so-called eligibility trace, and is a constant with 0 6 1. Observe that the update equation for the vector ;r is of the form r +1 = r + hy +1 GY +1 r ; where Y +1 =X ;X +1 ;Z is a Marov chain whose transition probabilities are not aected by the parameters being updated. However, in the context of optimization over a parametric family of Marov chains, both the basis functions and the transition probabilities of the Marov chain {X } depend on a policy parameter that is constantly changing. Therefore, in this context, the update taes the form r +1 = r + h Y +1 G Y +1 r ; where denotes the value of at time, and where Y +1 is generated from Y using the transition probabilities corresponding to.if is held constant at, the algorithm would generally converge to some r. We would lie to prove that even if moves slowly, then r tracs r, in the sense that r r goes to zero. A result of this type, as developed in the next section, is necessary for analyzing the convergence properties of actor-critic algorithms, which combine temporal dierence learning and slowly changing policies [7,8]. ˆ V x; r= i r i i x 2. Main result and adust the vector r of free parameters r i, to obtain a sequence of approximations to the solution V of Consider a stochastic process {Y } taing values in a Polish complete, separable, metric space Y,

V.R. Konda, J.N. Tsitsilis / Systems & Control Letters 50 2003 95 102 97 endowed with its Borel -eld. Let {P y; dy; R n } be a parameterized family of transition ernels on Y. Consider the following update equations for a vector r R m and a parameter R n : r +1 = r + h Y +1 G Y +1 r + +1 r ; +1 = + H +1 : 1 In the above iteration, {h ;G : R n } is a parameterized family of m-vector valued and m m-matrixvalued measurable functions on Y. Also, H is a random process that drives the changes in the parameter, which in turn aects the linear stochastic approximation updates of r.wenow continue with our assumptions. Assumption 1. The step-size sequence { } is deterministic, non-increasing, and satises = ; 2 : Let F be the -eld generated by {Y l ;H l ; r l ; l : l 6 }. Assumption 2. For every measurable set A Y, PY +1 A F =PY +1 A Y ; =P Y ;A: For any measurable function f on Y, let P f denote the measurable function y P y; dyfy. Also, for any vector r, let r denote its Euclidean norm. Assumption 3 Existence and properties of solutions to the Poisson equation: For each, there exist functions h R m, G R m m, ĥ : Y R m, and Ĝ : Y R m m that satisfy the following: 1. For each y Y, ĥ y=h y h+p ĥ y; Ĝ y=g y G+P Ĝ y: 2. For some constant C and for all, we have max h ; G 6 C: 3. For any d 0, there exists C d 0 such that sup E[ f Y d ] 6 C d ; where f represents any of the functions ĥ ;h ; Ĝ ;G. 4. For some constant C 0, and for all ; R n, max h h ; G G 6 C : 5. There exists a positive measurable function C on Y such that for each d 0, sup E[CY d ] ; and P f y P f y 6 Cy ; where f is any of the functions ĥ and Ĝ. Assumption 4 Slowly changing environment: The random process {H } satises sup E[ H d ] for all d 0. Furthermore, the sequence { } is deterministic and satises d ; for some d 0. Assumption 5. The sequence { } is a m m-matrix valued F -martingale dierence, with bounded moments i.e., E[ +1 F ]=0; sup E[ d ] d 0: Assumption 6 Uniformpositivedeniteness. There exists some a 0 such that for all r R m and R n : r Gr a r 2 : Our main result is the following. Theorem 7. If Assumptions 1 6 are satised, then lim r G 1 h =0: When is held xed at some value, for all, our result states that r converges to G 1 h, which is a special case of Theorem 17 in p. 239 of [1]. In fact Assumptions 1 and 2 are the counterparts

98 V.R. Konda, J.N. Tsitsilis / Systems & Control Letters 50 2003 95 102 of Assumptions A.1 and A.2 of [1], and Assumption 3 is the counterpart of Assumptions A.3 and A.4 in [1]. Several sucient conditions for this assumption are presented in [1] when the state space Y of the process Y is a subset of a Euclidean space. When Y is Polish, these conditions can be generalized using the techniques of [11]. There are two dierences between our assumptions and those in [1]. A minor dierence is that [1] considers vector-valued processes Y, whereas we consider more general processes Y. Accordingly, the bounds in [1] are stated in terms of Y, whereas our bounds are stated in terms of some positive functions of Y. The second dierence, which is the more signicant one, is that is changing, albeit slowly. For this reason, we need to use a dierent proof technique. Our proof combines various techniques used in [1,3,4]. In the next section, we present an overview of the proof and the intuition behind it. 3. Overview of the proof We note that the sequence ˆ = G r h satises the iteration: ˆ +1 =ˆ G +1 ˆ + 1 +1 + 2 +1 ; where 1 +1 = G +1 h Y +1 h G +1 G Y +1 G r + G +1 +1 r ; 2 +1 = 1 G +1 G r h +1 h as can be veried by some simple algebra. Assumption 3 implies that the vector h Y +1 and the matrix G Y +1 have expected values h and G respectively under the steady-state distribution of the time-homogeneous Marov chain Y with transition ernel P. Therefore, we expect that the eect of the error term 1 +1 can be averaged out in the long term. Similarly, since is changing very slowly with respect to the step-size, we expect that 2 +1 goes to zero. The proof consists of showing that the terms ;i=1; 2; are inconsequential in the limit, and by observing that the sequence { ˆ } converges to zero when these terms are absent. We formalize this intuition in the next two sections. Note that the error terms are ane in r, and therefore can be very large if r is large. A ey step in the proof is to show boundedness of the iterates r, and this is the subect of the next section. The main result is then proved in the last section. The approach and techniques used here are inspired by more general techniques developed in [1,4]. i +1 4. Proof of boundedness Note that the dierence between two successive iterates at time is of the order, which goes to zero as goes to innity. Therefore, to study the asymptotic behavior of the sequence {r }, we need to focus on a subsequence {r }, where the sequence of non-negative integers { } is dened by 0 =0; +1 = min 1 T : l= Here, T is a positive constant that will be held xed throughout the rest of the paper. The sequence { } is chosen so that any two successive elements are suciently apart, resulting in a non-trivial dierence between r +1 and r. To obtain a relation between r +1 and r, we dene a sequence { ˆr } by ˆr = r =max1; r for : Note that ˆr 6 1 for all. Furthermore, ˆr is F -adapted and satises ˆr +1 =ˆr + h max1; r G ˆr + ˆ +1 ; ; where for, the term ˆ +1 = h Y +1 h max1; r G Y +1 G ˆr + +1 ˆr ;

V.R. Konda, J.N. Tsitsilis / Systems & Control Letters 50 2003 95 102 99 can be viewed as perturbation noise. Similarly, for each, dene a sequence {r } as follows: r =ˆr ; r +1 = r + h max1; r G r ; : Note that the iteration satised by r is the same as that of ˆr, except that the perturbation noise is not present in it. We will show that the perturbation noise is negligible, and that ˆr tracs r, in the sense that lim max ˆr r =0; w:p:1: 66 +1 The next lemma provides bounds on the perturbation noise. It involves the stopping times 1 C, which are dened as follows: given some constant C 1, let 1 C = min{ : ˆr C}: The stopping time 1 C is the rst time the random sequence { ˆr } exits a ball of radius C. We will often use the simpler notation 1, if the value of C is clear from the context. The following lemma derives bounds on the eect of the perturbation noise before time 1. ˆ Lemma 8. For any given C 0, there exists a constant C 1 0 such that for all, we have 2 E max l ˆ +1 6 1 l+1 6 C1 2 : +1 l= +1 = +1 Proof. We only outline the proof as this result is similar to Proposition 7, in [1, pp. 228]. The main dierence is that this proposition considers the sum 1 l=0 l ˆ l+1, whereas we are interested in the sum l= +1 l ˆ l+1. The dierence in the initial limit of the summation does not aect the proof. However, the difference in the nal limit of the summation could aect the proof, because ˆr I{1 =} is not bounded. However, we note that the proof in [1] still goes through, as long as ˆr I{1 =} has bounded moments. To see that ˆr I{1 = } has bounded moments, we observe that it is ane in ˆr 1 I{1 = } which is bounded, with the coecients having bounded moments. The outline of the rest of the proof is as follows. Consider a xed, and suppress the superscript to simplify notation. Note that the perturbation noise ˆ l+1 is of the form F l ˆr l ; Y l+1 F l ˆr l + l+1 ˆr l ; where F r is the steady-state expectation of F r; Y l, and where Y l is a Marov chain with transition ernel P. Using Assumption 3, it is easy to see that for each ; r there exists a solution ˆF r; y to the Poisson equation: ˆF r; y=f r; y F r+p ˆF r; y: The perturbation noise can be expressed in terms of ˆF as follows: ˆ l+1 = l+1 ˆr l + F l ˆr l ; Y l+1 F l ˆr l = l+1 ˆr l +ˆF l ˆr l ; Y l+1 P l ˆF l ˆr l ; Y l+1 = l+1 ˆr l + ˆF l ˆ r l ; Y l+1 P l ˆF l ˆr l ; Y l +P l 1 ˆF l 1 ˆr l 1 ; Y l P l ˆF l ˆr l ; Y l+1 +P l ˆF l ˆr l ; Y l P l ˆF l ˆr l 1 ; Y l +P l ˆF l ˆr l 1 ; Y l P l 1 ˆF l 1 ˆr l 1 ; Y l : To prove the lemma, it is sucient to show that the desired inequality holds when ˆ l is replaced by each of the terms on the right-hand side of the above equation. Indeed, the rst term is a martingale dierence with bounded second moment and the second term is the summand in a telescoping series. The last two terms are of the order O ˆr l ˆr l 1 and O l+1 l, respectively. Using these observations, it is easily shown that each of these terms satises the desired inequality. Lemma 8 indicates that as long as ˆr is bounded, the perturbation noise remains negligible. In the next lemma, we prove that the sequence { ˆr } closely approximates r. Lemma 9. Lim max 66 +1 ˆr r =0,w.p.1. Proof. Since G is bounded, there exists a constant D such that ˆr +1 r +1 6 D l ˆr l r l + l ˆ l+1 l= l=

100 V.R. Konda, J.N. Tsitsilis / Systems & Control Letters 50 2003 95 102 for every. Using the discrete Gronwall inequality, 1 it is easy to see that max 66 +1 1 6 e DT max ˆr +1 r +1 66 +1 1 l ˆ l+1 l= ; where T = T + +1. Therefore, Lemma 8 and the Chebyshev inequality imply that P max 66 +1 1 ˆr +1 r +1 6 C +1 1 1 2 l= for some C 1 0 that depends on the constant C in the denition of the stopping time 1 C. Consider the stopping time 2 dened by 2 = min{ : ˆr r }; which is the rst time that the sequence { ˆr } exits from a tube of radius around the sequence {r }.To prove the lemma, we need bounds on P 2 6 +1, whereas we only have bounds on P 2 6 +1 1. Therefore, we need to relate the stopping times 1 and 2. To do this, note that sup max r 6 C 6 for some constant C 1, because h and G are bounded, and the function G satises Assumption 6. At time = 1 C +, we have ˆr C + and r 6 C, which implies that ˆr r, i.e., 6 1 C + : 2 1 For a non-negative sequence { } and a constant B 0, let {b } be a sequence satisfying b 0 = 0 and b +1 6 b + B : l=0 Then, for every, we have b +1 6 B exp l : l=0 2 l Therefore, P max ˆr +1 r +1 66 +1 = P 2 6 +1 = P 2 6 +1 1 C + 6 C +1 1 1 2 2 l : l= The result follows from the summability of the series and the Borel Cantelli lemma. 2 Now we are ready to prove boundedness, using the following simple results. Lemma 10. Suppose that 0 6 1 and that {a }, { } are non-negative sequences that satisfy a +1 6 a +. 1 If sup, then sup a. 2 If 0, then a 0. Lemma 11. If an m m matrix G satises r Gr r 2 r R m ; then for suciently small 0, I Gr 6 1 1 2 r : Lemma 12. sup r, w.p.1. Proof. Since h is bounded, Assumption 6 and Lemma 11 imply the following. There exists a constant C such that, for suciently large, and, r +1 6 1 1 2 a r + C max1; r : Using the inequality 1 x 6 e x, we have 1 r +1 6 e 2 a l= l r + C max1; r : 2 l= l

V.R. Konda, J.N. Tsitsilis / Systems & Control Letters 50 2003 95 102 101 This, along with Lemma 9, implies r +1 max1; r 6 e at=2 r max1; r C +T max1; r + ; where 0, w.p.1. Multiplying both sides by max1; r and using the fact that it is less than 1+ r, we have r +1 6 e at=2 + r + CT + : Since e at 1 and 0, w.p.1, it follows from Lemma 10a that sup r ; w:p:1: Recall that r =ˆr is bounded. Then, using Eq. 2 it follows that sup r w:p:1: ; The boundedness of r now follows from the observation: sup r = sup max1; r max ˆr 66 +1 6 sup {1 + r max 66 +1 r + max r ˆr }; 66 +1 the boundedness of r and r, and Lemma 9. 5. Proof of Theorem 7 To prove Theorem 7, we consider the sequence ˆ = G r h. As noted in Section 3, this sequence evolves according to the iteration: ˆ +1 =ˆ G +1 ˆ + 1 +1 + 2 +1 ; where 1 +1 = G +1 h Y +1 h G +1 G Y +1 G r + G +1 +1 r ; 2 +1 = 1 G +1 G r h +1 h : Lemma 13. 1 +1 converges w.p.1. Proof. The proof relies on Assumptions 1, 3, and 5, and is omitted because it is very similar to the proof of Lemma 2 in [2, pp. 224]. Lemma 14. lim 2 =0,w.p.1. Proof. Using Assumption 34 and the update equation for, we have 2 +1 6 C +1 r +1 = C H r +1: Since {H } has bounded moments, the second part of Assumption 4 yields [ ] d E H d for some d 0. Therefore, = H converges to zero, w.p.1. The result follows from the boundedness of {r }. Recall the notation from the previous section. For each, dene, for as follows: +1 =I G ; =ˆ : Lemma 15. lim max 66 +1 ˆ =0,w.p.1. Proof. Since G is bounded, there exists some C such that for every and, 1 ˆ 6 C l ˆ l l + 1 l 1 l+1 l= + 2 l+1 l= : Using the discrete Gronwall inequality, it can be seen that max ˆ 66 +1 1 6 e CT max 66 +1 l 1 l+1 + 2 l+1 l= 1 6 e CT sup l 1 l+1 l= +ect T sup 2 +1 :

102 V.R. Konda, J.N. Tsitsilis / Systems & Control Letters 50 2003 95 102 The result follows from the previous two lemmas. We are now ready to complete the proof of Theorem 7. Using Assumption 6 and Lemma 11, we have +1 6 e at=2 : Therefore Lemma 15 implies that ˆ +1 6 e at=2 ˆ + ; where 0 w.p.1. Using Lemma 10b, it follows that G r h = ˆ converges to zero. Using arguments similar to the closing arguments in the proof of Lemma 12, we nally conclude that lim G r h =0; w:p:1: Acnowledgements This research was partially supported by the National Science Foundation under contract ECS- 9873451, and by the AFOSR under contract F49620-99-1-0320. References [1] A. Benveniste, M. Metivier, P. Priouret, Adaptive Algorithms and Stochastic Approximations, Springer, Berlin, 1990. [2] D.P. Bertseas, J.N. Tsitsilis, Neuro-Dynamic Programming, Athena Scientic, Belmont, MA, 1996. [3] V.S. Borar, Stochastic approximation with two time scales, Systems Control Lett. 29 1997 291 294. [4] V.S. Borar, S.P. Meyn, The o.d.e. method for convergence of stochastic approximation and reinforcement learning, SIAM J. Control Optim. 38 2 2000 447 469. [5] M. Duo, Random Iterative Models, Springer, New Yor, 1997. [6] E. Eweda, O. Machi, Convergence of an adaptive linear estimation algorithm, IEEE Trans. Automat. Control AC-29 2 1984 119 127. [7] V.R. Konda, Actor-critic algorithms, Ph.D. Thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 2002. [8] V.R. Konda, J.N. Tsitsilis, On actor-critic algorithms, SIAM J. Control Optim., to appear. [9] H.H. Kushner, Asymptotic behavior of stochastic approximation and large deviations, IEEE Trans. Automat. Control 29 1984 984 990. [10] H.J. Kushner, G.G. Yin, Stochastic Approximation Algorithms and Applications, Springer, New Yor, 1997. [11] S.P. Meyn, R.L. Tweedie, Marov Chains and Stochastic Stability, Springer, London, 1993. [12] R.S. Sutton, Learning to predict by the methods of temporal dierences, Mach. Learning 3 1988 9 44. [13] R. Sutton, A. Barto, Reinforcement Learning: An Introduction, MIT Press, Cambridge, MA, 1998. [14] B. Widrow, J. McCool, M.G. Larimore, C.R. Johnson, Stationary and non-stationary learning characteristics of the LMS adaptive lter, Proc. IEEE 64 8 1976 1151 1161.