A Least Squares Q-Learning Algorithm for Optimal Stopping Problems

Size: px

Start display at page:

Download "A Least Squares Q-Learning Algorithm for Optimal Stopping Problems"

Edgar Fisher
5 years ago
Views:

1 LIDS REPORT 273 December 2006 Revised: June 2007 A Least Squares Q-Learning Algorithm for Optimal Stopping Problems Huizhen Yu janey.yu@cs.helsinki.fi Dimitri P. Bertsekas dimitrib@mit.edu Abstract We consider the solution of discounted optimal stopping problems using linear function approximation methods. A Q-learning algorithm for such problems, proposed by Tsitsiklis and Van Roy, is based on the method of temporal differences and stochastic approximation. We propose alternative algorithms, which are based on projected value iteration ideas and least squares. We prove the convergence of some of these algorithms and discuss their properties. Huizhen Yu is with HIIT, University of Helsinki, Finland. Dimitri Bertsekas is with the Laboratory for Information and Decision Systems LIDS, M.I.T., Cambridge, MA 0239.

2 Introduction Optimal stopping problems are a special case of Markovian decision problems where the system evolves according to a discrete-time stochastic system equation, until an explicit stopping action is taken. At each state, there are two choices: either to stop and incur a state-dependent stopping cost, or to continue and move to a successor state according to some transition probabilities and incur a state-dependent continuation cost. Once the stopping action is taken, no further costs are incurred. The objective is to minimize the expected value of the total discounted cost. Examples are classical problems, such as search, and sequential hypothesis testing, as well as recent applications in finance and the pricing of derivative financial instruments see Tsitsiklis and Van Roy [TV99], Barraquand and Martineau [BM95], Longstaff and Schwartz [LS0]. The problem can be solved in principle by dynamic programming DP for short, but we are interested in problems with large state spaces where the DP solution is practically infeasible. It is then natural to consider approximate DP techniques where the optimal cost function or the Q- factors of the problem are approximated with a function from a chosen parametric class. Generally, cost function approximation methods are theoretically sound i.e., are provably convergent only for the single-policy case, where the cost function of a fixed stationary policy is evaluated. However, for the stopping problem of this paper, Tsitsiklis and Van Roy [TV99] introduced a linear function approximation to the optimal Q-factors, which they prove to be the unique solution of a projected form of Bellman s equation. While in general this equation may not have a solution, this difficulty does not occur in optimal stopping problems thanks to a critical fact: the mapping defining the Q-factors is a contraction mapping with respect to the weighted Euclidean norm corresponding to the steady-state distribution of the associated Markov chain. For textbook analyses, we refer to Bertsekas and Tsitsiklis [BT96], Section 6.8, and Bertsekas [Ber07], Section 6.4. The algorithm of Tsitsiklis and Van Roy is based on single trajectory simulation, and ideas related to the temporal differences method of Sutton [Sut88], and relies on the contraction property just mentioned. We propose a new algorithm, which is also based on single trajectory simulation and relies on the same contraction property, but uses different algorithmic ideas. It may be viewed as a fixed point iteration for solving the projected Bellman equation, and it relates to the least squares policy evaluation LSPE method first proposed by Bertsekas and Ioffe [BI96] and subsequently developed by Nedić and Bertsekas [NB03], Bertsekas, Borkar, and Nedić [NB03], and Yu and Bertsekas [YB06] see also the books [BT96] and [Ber07]. We prove the convergence of our method for finite-state models, and we discuss some variants. The paper is organized as follows. In Section 2, we introduce the optimal stopping problem, and we derive the associated contraction properties of the mapping that defines Q-learning. In Section 3, we describe our LSPE-like algorithm, and we prove its convergence. We also discuss the convergence rate of the algorithm, and we provide a comparison with another algorithm that is related to the least squares temporal differences LSTD method, proposed by Bradtke and Barto [BB96], and further developed by Boyan [Boy99]. In Section 4, we describe some variants of the algorithm, which involve a reduced computational overhead per iteration. In this section, we also discuss the relation of our algorithms with the recent algorithm by Choi and Van Roy [CV06], which can be used to solve the same optimal stopping problem. In Section 5, we prove the convergence of some of the variants of Section 4. We give two alternative proofs, the first of which uses results from the o.d.e. ordinary differential equation line of convergence analysis of stochastic iterative algorithms, and the second of which is a direct proof reminiscent of the o.d.e. line of analysis. A computational comparison of our methods with other algorithms for the optimal stopping problem is beyond the scope of the present paper. However, our analysis and the available results using least squares methods Bradtke and Barto [BB96], Bertsekas and Ioffe [BI96], Boyan [Boy99], Bertsekas, Borkar, and Nedić [BBN03], Choi and Van Roy [CV06] clearly suggest a superior performance to the algorithm of Tsitsiklis and Van Roy [TV99], and likely an improved convergence rate over the 2

3 method of Choi and Van Roy [CV06], at the expense of some additional overhead per iteration. 2 Q-Learning for Optimal Stopping Problems We are given a Markov chain with state space {,..., n}, described by transition probabilities p ij. We assume that the states form a single recurrent class, so the chain has a steady-state distribution vector π = π,..., πn with πi > 0 for all states i. Given the current state i, we assume that we have two options: to stop and incur a cost ci, or to continue and incur a cost gi, j, where j is the next state there is no control to affect the corresponding transition probabilities. The problem is to minimize the associated α-discounted infinite horizon cost, where α 0,. For a given state i, we associate a Q-factor with each of the two possible decisions. The Q-factor for the decision to stop is equal to ci. The Q-factor for the decision to continue is denoted by Qi. The optimal Q-factor for the decision to continue, denoted by Q, relates to the optimal cost function J of the stopping problem by and Q i = n p ij gi, j + αj j, i =,..., n, j= J i = min { ci, Q i }, i =,..., n. The value Q i is equal to the cost of choosing to continue at the initial state i and following an optimal policy afterwards. The function Q satisfies Bellman s equation Q i = n p ij gi, j + α min { cj, Q j }, i =,..., n. j= Once the Q-factors Q i are calculated, an optimal policy can be implemented by stopping at state i if and only if ci Q i. The Q-learning algorithm Watkins [Wat89] is Qi := Qi + γ gi, j + α min { cj, Qj } Qi, where i is the state at which we update the Q-factor, j is a successor state, generated randomly according to the transition probabilities p ij, and γ is a small positive stepsize, which diminishes to 0 over time. The convergence of this algorithm is addressed by the general theory of Q-learning see Watkins and Dayan [WD92], and Tsitsiklis [Tsi94]. However, for problems where the number of states n is large, this algorithm is impractical. Let us now consider the approximate evaluation of Q i. We introduce the mapping F : R n R n given by n { } F Qi = p ij gi, j + α min cj, Qj, i =,..., n. j= We denote by F Q or F Q the vector whose components are F Qi, i =,..., n. By Eq., the optimal Q-factor for the choice to continue, Q, is a fixed point of F, and it is the unique fixed point because F is a sup-norm contraction mapping. For the approximation considered here, it turns out to be very important that F is also a Euclidean contraction. Let π be the weighted Euclidean norm associated with the steady-state probability vector π, i.e., n v 2 π = πi vi 2. i= 3

4 It has been shown by Tsitsiklis and Van Roy [TV99] see also Bertsekas and Tsitsiklis [BT96], Section that F is a contraction with respect to this norm. For purposes of easy reference, we include the proof. Lemma. The mapping F is a contraction with respect to π, with modulus α. Proof. For any two vectors Q and Q, we have or, in vector notation, F Qi F Qi n { } { } α p ij min cj, Qj min cj, Qj α j= n p ij Qj Qj, j= F Q F Q αp Q Q, where x denotes a vector whose components are the absolute values of the components of x. Hence, F Q F Q π α P Q Q π α Q Q π, where the last inequality follows from the relation P J π J π, which holds for every vector J see Tsitsiklis and Van Roy [TV97] or Bertsekas and Tsitsiklis [BT96], Lemma 6.4. We consider Q-factor approximations using a linear approximation architecture Qi, r = φi r, where φi is an s-dimensional feature vector associated with state i. In our notation, all vectors are viewed as column vectors, and prime denotes transposition. We also write the vector Q r = Q, r,..., Qn, r in the compact form Q r = Φr, where Φ is the n s matrix whose rows are φi, i =,..., n. We assume that Φ has rank s, and we denote by Π the projection mapping with respect to π on the subspace i.e., for all J R n, S = {Φr r R s }, ΠJ = arg min J Ĵ π. Ĵ S Because F is a contraction with respect to π with modulus α, and Π is nonexpansive, the mapping ΠF is a contraction with respect to π with modulus α. Therefore, the mapping ΠF has a unique fixed point within the subspace S, which in view of the rank assumption on Φ can be uniquely represented as Φr. Thus r is the unique solution of the equation Φr = ΠF Φr. by Tsitsiklis and Van Roy [TV99] show that the error of this Q-factor approximation can be bounded Φr Q π α 2 ΠQ Q π. 4

5 Value Iterate FΦr t Value Iterate FΦr t Projection on S Projection on S Φr t+ Φr t 0 S: Subspace spanned by basis functions Φr t+ Φr t Simulation error 0 S: Subspace spanned by basis functions Projected Value Iteration Least Squares Policy Evaluation LSPE Figure : A conceptual view of projected value iteration and its simulation-based implementation. Furthermore, if we implement a policy µ that stops at state i if and only if ci φi r, then the cost of this policy, denoted by J µ, satisfies n πi J µ i J i 2 α α 2 ΠQ Q π. i= These bounds indicate that if Q is close to the subspace S spanned by the basis functions, then the approximate Q-factor and its associated policy will also be close to the optimal. The contraction property of ΠF suggests the fixed point iteration Φr t+ = ΠF Φr t, which in the related contexts of policy evaluation for discounted and average cost problems see [BBN03, YB06, Ber07] is known as projected value iteration [to distinguish it from the value iteration method, which is Q t+ = F Q t ]; see Fig.. This iteration converges to the unique fixed point Φr of ΠF, but is not easily implemented because the dimension of the vector F Φr t is potentially very large. In the policy evaluation context, a simulation-based implementation of the iteration has been proposed, which does not suffer from this difficulty, because it uses simulation samples of the cost of various states in a least-squares type of parametric approximation of the value iteration method. This algorithm is known as least squares policy evaluation LSPE, and can be conceptually viewed as taking the form Φr t+ = ΠF Φr t + ɛ t, where ɛ t is simulation noise which diminishes to 0 with probability w.p. as t see Fig.. The algorithm to be introduced in the next section admits a similar conceptual interpretation, and its analysis has much in common with the analysis given in [BBN03, Ber07] for the case of single-policy evaluation. In fact, if the stopping option was not available [or equivalently if ci is so high that it is never optimal to stop], our Q-learning algorithm would coincide with the LSPE algorithm for approximate evaluation of the discounted cost function of a fixed stationary policy. Let us also note that LSPE like the temporal differences method is actually a family of methods parameterized by a scalar λ [0, ]. Our Q-learning algorithm of the next section corresponds to LSPE0, the case where λ = 0; we do not have a convenient Q-learning algorithm that parallels LSPEλ for λ > 0. 5

6 3 A Least Squares Q-learning Algorithm 3. Algorithm We generate a single infinitely long simulation trajectory x 0, x,... corresponding to an unstopped system, i.e., using the transition probabilities p ij. Our algorithm starts with an initial guess r 0, and generates a parameter vector sequence {r t }. Following the transition x t, x t+, we form the following least squares problem at each time t, min whose solution is t ˆr t+ = Then we set t r R s φx k r gx k, x k+ α min { cx k+, φx k+ r t } 2, 2 φx k φx k t φx k gx k, x k+ + α min { cx k+, φx k+ } r t. 3 r t+ = r t + γˆr t+ r t, 4 where γ is some fixed constant stepsize, whose range will be given later. 2 This algorithm is related to the LSPE0 algorithm, which is used for the approximate evaluation of a single stationary policy of a discounted Markovian decision problem, and is analyzed by Bertsekas and Ioffe [BI96], Nedić and Bertsekas [NB03], Bertsekas, Borkar, and Nedić [BBN03], and Yu and Bertsekas [YB06] see also the recent book by Bertsekas [Ber07], Chapter 6. In particular, if there were no stopping action or equivalently if the stopping costs are so large that they are inconsequential, then, for γ =, the algorithm 3 becomes r t+ = t φx k φx k t φx k gx k, x k+ + αφx k+ r t, 5 and is identical to the LSPE0 algorithm for evaluating the policy that never stops. On the other hand, we note that the least squares Q-learning algorithm 3 has much higher computation overhead than the LSPE0 algorithm 5 for evaluating this policy. In the process of updating r t via Eq. 3, we can compute the matrix t+ t φx kφx k and the vector t+ t φx kgx k, x k+ iteratively and efficiently as in Eq. 5. The terms min { cx k+, φx k+ r t }, however, need to be recomputed for all the samples x k+, k < t. Intuitively, this computation corresponds to repartitioning the states into those at which to stop and those at which to continue, based on the current approximate Q-factors Φr t. In Section 4, we will discuss how to reduce this extra overhead. We will prove that the sequence {Φr t } generated by the least squares Q-learning algorithm 3 asymptotically converges to the unique fixed point of ΠF. The idea of the proof is to show that the algorithm can be written as the iteration Φr t+ = Π t Ft Φr t or its damped version, where Π t and F t approximate Π and F, respectively, within simulation error that asymptotically diminishes to 0 w.p.. Multiple independent infinitely long trajectories can also be used similarly. 2 We ignore the issues associated with the invertibility of the matrix in Eq. 3. They can be handled, for example, by adding a small positive multiple of the identity to the matrix if it is not invertible. 6

7 3.2 Convergence Proof The iteration 3 can be written equivalently as n n n ˆr t+ = ˆπ t iφiφi ˆπ t iφi ĝ t i + α i= i= j= ˆπ t j i min { cj, φj } r t, with ˆπ t i and ˆπ t j i being the empirical frequencies defined by t ˆπ t i = δx t k = i, ˆπ t j i = δx k = i, x k+ = j t + t δx, k = i where δ is the indicator function, and ĝ t is the empirical mean of the per-stage costs: t ĝ t i = gx k, x k+ δx k = i t δx. k = i [In the case where t δx k = i = 0, we define ĝ t i = 0, ˆπ t j i = 0 by convention.] In a more compact notation, Φˆr t+ = Π t Ft Φr t, 6 where the mappings Π t and F t are simulation-based approximations to Π and F, respectively: Π t = ΦΦ Dt Φ Φ Dt, Ft J = ĝ t + α P t min { c, J }, J R n, D t = diag..., ˆπ t i,..., Pt = ˆπ ij tj i. With a stepsize γ, the least squares Q-learning iteration 4 is written as By ergodicity of the Markov chain, we have w.p., Φr t+ = γφr t + γ Π t Ft Φr t. 7 ˆπ t π, Pt P, ĝ t g, as t, where g denotes the expected per-stage cost vector with n j= p ijgi, j as the i-th component. For each t, denote the invariant distribution of Pt by π t. We now have three distributions, π, ˆπ t, π t, which define, respectively, three weighted Euclidean norms, π, ˆπt, πt. The mappings we consider are non-expansive or contraction mappings with respect to one of these norms. In particular: the mapping Π t is non-expansive with respect to ˆπt ˆπt, and since Π t is projection with respect to the mapping F t is a contraction, with modulus α, with respect to πt the proof of Lemma can be used to show this. We have the following facts, each being a consequence of the ones preceding it: i ˆπ t, π t π w.p., as t. ii For any ɛ > 0 and a sample trajectory with converging sequences ˆπ t, π t, there exists a time t such that for all t t and all states i + ɛ ˆπ ti πi + ɛ, + ɛ π ti πi + ɛ, + ɛ π ti ˆπi + ɛ. 7

8 iii Under the condition of ii, for any J R n, we have J π + ɛ J ˆπt, J ˆπt + ɛ J πt, J πt + ɛ J π, for all t sufficiently large. Fact iii implies the contraction of Π t Ft with respect to π, as shown in the following lemma. Lemma 2. Let ˆα α,. Then, w.p., Π t Ft is a π -contraction mapping with modulus ˆα for all t sufficiently large. Proof. Consider a simulation trajectory from the set of probability for which P t P and ˆπ t, π t π. Fix an ɛ > 0. For any functions J and J 2, using fact iii above and the non-expansiveness and contraction properties of Π t and F t, respectively, we have for t sufficiently large, Π t Ft J Π t Ft J 2 π + ɛ Π t Ft J Π t Ft J 2 ˆπt + ɛ F t J F t J 2 ˆπt + ɛ 2 F t J F t J 2 πt + ɛ 2 α J J 2 πt + ɛ 3 α J J 2 π. Thus, by letting ɛ be such that + ɛ 3 α < ˆα <, we see that Π t Ft is a π -contraction mapping with modulus ˆα for all t sufficiently large. Proposition. For any constant stepsize γ 0, 2 +α, r t converges to r w.p., as t. Proof. We choose t such that for all t t, the contraction property of Lemma 2 applies. We have for such t, Φr t+ Φr π = γ Φr t Φr + γ Πt Ft Φr t ΠF Φr π γ Φr t Φr π + γ Π t Ft Φr t Π t Ft Φr π + γ Π t Ft Φr ΠF Φr π γ + γ ˆα Φr t Φr π + γɛ t, 8 where ɛ t = Π t Ft Φr ΠF Φr π. Because Π t Ft Φr ΠF Φr π 0, we have ɛ t 0. Thus, for γ, since γ + γ ˆα <, it follows that Φr t Φr, or equivalently, r t r, w.p.. Similarly, based on Eq. 8, in order to have Φr t+ Φr π converge to 0 under a stepsize γ >, it is sufficient that γ + γ ˆα <, or equivalently, γ < 2 + ˆα. Hence Φr t converges to Φr for the stepsize γ 0, 2 +α. Note that the range of stepsizes for which convergence was shown includes γ =. Remark. The convergence of r t implies that Φr t is bounded w.p.. Using this fact, we can interpret the iteration of the least squares Q-learning algorithm, with the unit stepsize, for instance, as the deterministic fixed point iteration ΠF Φr t plus an asymptotically diminishing stochastic 8

9 disturbance see Fig.. In particular, the difference between ΠF Φr t and the simulation-based fixed point iteration Φr t+ = Π t Ft Φr t is Π t Ft Φr t ΠF Φr t = Π t ĝ t Πg + α Π t Pt ΠP min{c, Φr t }, and can be bounded by Πt Ft Φr t ΠF Φr t Πt Π ĝ t + Π ĝ t g + α Π t Pt ΠP min{c, Φrt }, where is any norm. Since Φr t is bounded w.p., the bound on the right-hand side r.h.s. can be seen to asymptotically diminish to 0. Remark 2. A slightly different proof of the convergence of r t, reminiscent of the argument used later in Section 5, is to interpret the iteration Φr t+ = Π t Ft Φr t as ΠF Φr t plus a stochastic disturbance whose magnitude is bounded by ɛ t + Φr t, with ɛ t asymptotically diminishing to 0. This can be seen from the discussion in the preceding remark. The convergence of r t can then be established using the contraction property of ΠF. This will result in a shorter proof. However, the line of proof based on Lemma 2 is more insightful and transparent. Furthermore, Lemma 2 is of independent value, and in particular it will be used in the following convergence rate analysis. 3.3 Comparison to an LSTD Analogue A natural alternative approach to finding r that satisfies Φr = ΠF Φr is to replace Π and F with asymptotically convergent approximations. In particular, let r t+ be the solution of Φr = Π t Ft Φr, i.e., Φ r t+ = Π t Ft Φ r t+, t = 0,,.... With probability the solutions exist for t sufficiently large by Lemma 2. The conceptual algorithm that generates the sequence { r t } may be viewed as the analogue of the LSTD method, proposed by Bradtke and Barto [BB96], and further developed by Boyan [Boy99] see also the text by Bertsekas [Ber07], Chapter 6. For the optimal stopping problem this is not a viable algorithm because it involves solution of a nonlinear equation. It is introduced here as a vehicle for interpretation of our least squares Q-learning algorithm 2-4. In particular, we note that r t+ is the solution of the equation r t+ = arg min r R s t φx k r gx k, x k+ α min { cx k+, φx k+ r t+ } 2, 9 so it is the fixed point of the arg min mapping in the r.h.s. of the above equation. On the other hand, the least squares Q-learning algorithm 2-4, with stepsize γ =, that generates r t+ can be viewed as a single iteration of a fixed point algorithm that aims to find r t+, starting from r t. This relation can be quantified further. Using an argument similar to the one used in [YB06] for evaluating the optimal asymptotic convergence rate of LSPE, we will show that with any stepsize in the range 0, 2 +α, the LSPE-like update Φr t converges to the LSTD-like update Φ r t asymptotically at the rate of Ot [while we expect both to converge to Φr at a slower rate O t]. In particular, t Φr t Φ r t is bounded w.p., thus t β Φr t Φ r t converges to zero w.p. for any β <. First we prove the following lemma. Lemma 3. i r t r w.p., as t. ii t r t+ r t is bounded w.p.. 9

10 Proof. i Let ˆα α,. Similar to the proof of Prop., using Lemma 2, we have that w.p., for all t sufficiently large, Thus, w.p., Φ r t+ Φr π = Π t Ft Φ r t+ Π t Ft Φr + Π t Ft Φr ΠF Φr π ˆα Φ r t+ Φr π + Π t Ft Φr ΠF Φr π. Φ r t+ Φr π ˆα Π t Ft Φr ΠF Φr π 0, as t, implying that r t is bounded and r t r as t. ii By applying Lemma 2, we have that w.p., for all t sufficiently large, Φ r t+ Φ r t π = Π t Ft Φ r t+ Π t Ft Φ r t + Π t Ft Φ r t Π t Ft Φ r t π which implies, by the definition of F t and F t, that Φ r t+ Φ r t π ˆα ˆα Φ r t+ Φ r t π + Π t Ft Φ r t Π t Ft Φ r t π, Π t ĝ t Π t ĝ t π + Π t Pt Π t Pt π min{c, Φ r t } π. Evidently as shown in [YB06], Π t ĝ t Π t ĝ t π and Π t Pt Π t Pt π are bounded by C/t for some constant C and all t sufficiently large. By the first part of our proof, Φ r t is bounded. Hence, w.p., there exists some sample path-dependent constant C such that for all t sufficiently large, Φ r t+ Φ r t π C t. Proposition 2. For any constant stepsize γ 0, 2 +α, tφr t Φ r t is bounded w.p.. Proof. The proof is similar to that in [YB06]. With the chosen stepsize, by Lemma 2, Φr t+ Φ r t+ π = γφrt Φ r t+ + γ Πt Ft Φr t Π t Ft Φ r t+ π ᾱ Φr t Φ r t+ π ᾱ Φr t Φ r t π + ᾱ Φ r t+ Φ r t π, for some ᾱ < and all t sufficiently large. Multiplying both sides by t +, using Lemma 3 ii, and defining ζ t = t Φr t Φ r t π, we have that w.p. for some constant C, β < and time t, ζ t+ β ζ t + C, t t, which implies that Hence t Φr t Φ r t π is bounded w.p.. ζ t β t t ζ t + C β, t t. 4 Variants with Reduced Overhead per Iteration At each iteration of the least squares Q-learning algorithm 3, 4, while updating r t, it is necessary to recompute the terms min { cx k+, φx k+ r t } for all the samples xk+, k < t. Intuitively, this corresponds to repartitioning the sampled states into those at which to stop and those at which to continue based on the most recent approximate Q-factors Φr t. In this section we discuss some variants of the algorithm that aim to reduce this computation. 0

11 4. First Variant A simple way to reduce the overhead in iteration 3 is to forgo the repartitioning just mentioned. Thus, in this variant we replace the terms min { cx k+, φx k+ r t } by qxk+, r t, given by qx k+, r t = { cx k+ if k K, φx k+ r t if k / K, where K = { } k cx k+ φx k+ r k is the set of states to stop based on the earlier approximate Q-factors Φr k, rather than the most recent approximate Q-factors Φr t. In particular, we replace the term t φx k min { cx k+, φx k+ } r t in Eq. 3 with t φx k qx k+, r t = k t, k K φx k cx k+ + k t, k / K φx k φx k+ r t, which can be efficiently updated at each time t. Some other similar variants are possible, which employ a limited form of repartitioning the states into those to stop and those to continue. For example, one may repartition only the sampled states within a time window of the m most recent time periods. In particular, in the preceding calculation, instead of the set K, we may use at time t the set K t = {k k K t, k < t m} { k t m k t, cx k+ φx k+ r t }, starting with K 0 = {0}. Here m = corresponds to the algorithm of the preceding section, while m = corresponds to the algorithm of the preceding paragraph. Thus the overhead for repartitioning per iteration is proportional to m, and remains bounded. An important observation is that in the preceding variations, if r t converges, then asymptotically the terms min { cx k+, φx k+ r t } and qxk+, r t coincide, and it can be seen that the limit of r t must satisfy the equation Φr = ΠF Φr, so it must be equal to the unique solution r. However, at present we have no proof of convergence of r t. 4.2 Second Variant Let us consider another variant, whereby we simply replace the terms min{cx k+, φx k+ r t } in the least squares problem 2 with min{cx k+, φx k+ r k }. The idea is that for large k and t, these two terms may be close enough to each other, so that convergence may still be maintained. Thus we consider the iteration r t+ = arg min r R s t φx k r gx k, x k+ α min { cx k+, φx k+ r k } 2. 0 This is a special case of an algorithm due to Choi and Van Roy [CV06], as we will discuss shortly. By carrying out the minimization over r, we can equivalently write Eq. 0 as r t+ = Bt+ t + t φx k gx k, x k+ + α min { cx k+, φx k+ } r k,

12 where we denote B t = t φx k φx k. t To gain some insight into this iteration, let us rewrite it as follows: r t+ = t + B t+ t φx k gx k, x k+ + α min { cx k+, φx k+ } r k + t + B t+ φx t gx t, x t+ + α min { cx t+, φx t+ } r t = t + B t+ tb tr t + t + B t+ φx t = t + B t+ tb t + φx t φx t r t + t + B t+ φx t gx t, x t+ + α min { cx t+, φx t+ r t } gx t, x t+ + α min { cx t+, φx t+ } r t φxt r t, and finally r t+ = r t + t + B t+ φx t gx t, x t+ + α min { cx t+, φx t+ } r t φxt r t. 2 This iteration can be shown to converge to r. However, we will show by example that its rate of convergence can be inferior to the least squares Q-learning algorithm [cf. Eqs. 3-4]. Accordingly, we consider another variant that aims to improve the practical if not the theoretical rate of convergence of iteration 0 [or equivalently 2], and is new to our knowledge. In particular, we introduce a time window of size m, and we replace the terms min { cx k+, φx k+ r t } in the least squares problem 2 with min { cx k+, φx k+ r lk,t }, where In other words, we consider the algorithm r t+ = arg min r R s t l k,t = min{k + m, t}. φx k r gx k, x k+ α min { cx k+, φx k+ r lk,t } 2. 3 Thus, at time t, the last m terms in the least squares sum are identical to the ones in the corresponding sum for the least squares Q-learning algorithm [cf. Eq. 2]. The terms min { cx k+, φx k+ r lk,t } remain constant after m updates when l k,t reaches the value k + m, so they do not need to be updated further. Note that in the first m iterations, this iteration is identical to the least squares Q-learning algorithm of Section 3 with unit stepsize. An important issue is the size of m. For large m, the algorithm approaches the least squares Q-learning algorithm, while for m =, it is identical to the earlier variant Comparison with Other Algorithms Let us now consider an algorithm, due to Choi and Van Roy [CV06], and referred to as the fixed point Kalman filter. It applies to more general problems, but when specialized to the optimal stopping problem, it takes the form r t+ = r t + γ t B t+ φx t gx t, x t+ + α min { cx t+, φx t+ } r t φxt r t, 4 2

13 where γ t is a diminishing stepsize. The algorithm is motivated by Kalman filtering ideas and the recursive least squares method in particular. It can also be viewed as a scaled version with scaling matrix Bt+ of the method by Tsitsiklis and Van Roy [TV99], which has the form r t+ = r t + γ t φx t gx t, x t+ + α min{cx t+, φx t+ r t } φx t r t. 5 Scaling is believed to be instrumental for enhancing the rate of convergence. It can be seen that when γ t = /t +, the iterations 2 and 4 coincide. However, the iterations 3 and 4 are different for a window size m >. As far as we know, the convergence proofs of [TV99] and [CV06] do not extend to iteration 3 or its modification that we will introduce in the next section in part because of the dependence of r t+ on as many as t m past iterates through the time window. The following example provides some insight into the behavior of the various algorithms discussed in this paper. Example. This is a somewhat unusual example, which can be viewed as a simple DP model to estimate the mean of a random variable using a sequence of independent samples. It involves a Markov chain with a single state. At each time period, the cost produced at this state is a random variable taking one of n possible values with equal probability. 3 Let g k be the cost generated at the kth transition. The stopping cost is taken to be very high so that the stopping option does not affect the algorithms. We assume that the costs g k are independent and have zero mean and variance σ 2. The matrix Φ is taken to be the scalar, so r is equal to the true cost and r = 0. Then, the least squares Q-learning algorithm of Section 3 with unit stepsize [cf. Eqs. 3 and 4] takes the form r t+ = g g t + αr t. 6 t + The first variant Section 4. also takes this form, regardless of the method used for repartitioning, since the stopping cost is so high that it does not materially affect the calculations. Since iteration 6 coincides with the LSPE0 method for this example, the corresponding rate of convergence results apply see Yu and Bertsekas [YB06]. In particular, as t, t r t converges in distribution to a Gaussian distribution with mean zero and variance σ 2 / α 2, so that E{r 2 t } converges to 0 at the rate /t, i.e., there is a constant C such that te{r 2 t } C, t = 0,,.... The second variant [Section 4.2, with time window m = ; cf. Eq. 2], takes the form r t+ = g t t + + t + α t + r t. 7 The fixed point Kalman filter algorithm [cf. Eq. 4], and the Tsitsiklis and Van Roy algorithm [cf. Eq. 5] are identical because the scaling matrix B t+ is the scalar in this example. They take the form r t+ = r t + γ t g t + αr t r t. For a stepsize γ t = /t +, they are identical to the second variant 7. We claim that iteration 7 converges more slowly than iteration 6, and that te{rt 2 }. To this end, we write E{r 2 t+} = 2 t + α E{rt 2 } + t + σ2 t A more conventional but equivalent example can be obtained by introducing states,..., n, one for each possible value of the cost per stage, and transition probabilities p ij = /n for all i, j =,..., n. 3

14 Let ζ t = te{r 2 t }. Then ζ t+ = From this equation for α > /2, we have t + α2 tt + ζ t + σ2 t +. ζ t+ ζ t + σ2 t +, so ζ t tends to. Finally, the variant of Section 4.2 with time window m > [cf. Eq. 3], for t m takes the form r t+ = g g t + α r m + r m + + r t + mr t, t + t + t m. 8 For t < m, it takes the form We may write iteration 8 as r t+ = g g t t + + αr t, t < m. r t+ = g t t + + t + α t + r t + α m r t r t, t m, t + and it can be shown again that t E{r 2 t }, similar to iteration 7. This suggests that the use of m > may affect the practical convegence rate of the algorithm, but is unlikely to affect the theoretical convergence rate. 5 Convergence Analysis for Some Variants In this section, we prove the convergence of the second variant, iteration 3, with a window-size m. To simplify notation, we define function h by and we write iteration 3 equivalently as r t+ = Bt+ t + where l k,t = min{k + m, t}. hx, r = min { cx, φx r }, t φx k gx k, x k+ + αh x k+, r lk,t, 9 Proposition 3. Let r t be defined by Eq. 9. Then w.p., r t r as t. In the remainder of this section we provide two alternative proofs. The first proof is based on the o.d.e. ordinary differential equation techniques for analyzing stochastic approximation algorithms, and makes use of theorems by Borkar [Bor06] and Borkar and Meyn [BM00], which are also given in the yet unpublished book by Borkar [Bor07] Chapters 2, 3, and 6. We have adapted the theorems in these sources for our purposes the subsequent Prop. 4 with the assistance of V. Borkar. This proof is relatively short, but requires familiarity with the intricate methodology of the o.d.e. line of analysis. We have also provided a direct, somewhat longer proof, which does not rely on references to o.d.e.-related sources, although it is in the same spirit as the o.d.e.-based proof. In an earlier version of this report, we have used the line of argument of the direct proof to show the result of Prop. 3 with an additional assumption that guaranteed boundedness of the iterates r t. We are indebted to V. Borkar who gave us the idea and several suggestions regarding the first proof. These suggestions in turn motivated our modification of the second proof to weaken our boundedness assumption. 4

15 5. A Proof of Proposition 3 Based on O.D.E. Methods First, we notice that Eq. 9 implies the following relation, t + B t+ r t+ = tb t r t + φx t gx t, x t+ + αhx t+, r t t + αφx k hx k+, r lk,t hx k+, r lk,t = tb t + φx t φx t r t + φx t gx t, x t+ + αhx t+, r t φx t r t t + αφx k hx k+, r lk,t hx k+, r lk,t. Thus iteration 9 is equivalent to r t+ =r t + t + B t+ φx t gx t, x t+ + αh x t+, r t φxt r t + t + B t+ t αφx k hx k+, r lk,t hx k+, r lk,t. 20 The idea is to reduce iteration 20 to the following form and study its convergence: r t+ = r t + t + B φx t gx t, x t+ + αh x t+, r t φxt r t + t + t, 2 where B = lim t Bt, and t is a noise sequence. It is worth to point out that the effect of window size m > will be neglected in our convergence analysis, and this does not contradict our favoring m > to m =, because in general the asymptotic convergence rate of the iterations with and without the noise term can differ from each other. We need the following result from the o.d.e. analysis of stochastic approximation, which only requires a rather weak assumption on the noise term. A General Convergence Result Consider the iteration r t+ = r t + γ t Hyt, r t + t, 22 where γ t is the stepsize deterministic or random; {y t } is the state sequence of a Markov process; Hy, r is a function of y, r; and t is the noise sequence. Let the norm of R s be any norm. We assume the following. Assumption. The function Hy, r is Lipschitz continuous in r for all y with the same Lipschitz constant. The stepsize γ t satisfies w.p., t=0 γ t =, t=0 γ2 t <, and γ t γ t for all t sufficiently large. The noise t satisfies where ɛ t is a scalar sequence that converges to 0 w.p., as t. t ɛ t + r t, w.p., 23 The convergence of iteration 22 under Assumption can be analyzed based on the analysis in Borkar [Bor06] on averaging of Markov noise, and the stability analysis in Borkar and Meyn [BM00] through the scaling limit o.d.e. We assume that y t is the state of a finite-state Markov chain. The analysis of [BM00, Bor06, Bor07] applies to a much more general setting. For a function fy, we denote by E 0 {fy } the 5

16 expectation over Y with respect to the invariant distribution of the Markov chain y t. For a positive scalar b, define Hy, b r Hy, b r H b y, r =, H y, r = lim, b b b where the existence of the limit defining H y, r for every y, r will be part of our assumption. We refer to ṙ = E 0 {HY, r} 24 as our basic o.d.e. We consider also the scaled version of our basic o.d.e., and the scaling limit o.d.e., ṙ = E 0 {H b Y, r}, 25 ṙ = E 0 {H Y, r}. 26 Proposition 4. Suppose that H exists, and the origin is an asymptotically stable equilibrium point of the scaling limit o.d.e. 26, and the basic o.d.e. 24 has a unique globally asymptotically stable equilibrium point r. Suppose also that Assumption holds. Then iteration 22 converges to r w.p.. Proof. Assume first that sup t r t < w.p.. Then the noise sequence t satisfies lim t t = 0 w.p.. Applying the averaging result in [Bor06] Corollary 3. and its preceding remark on noise, p. 44, we have that r t converges to r, the unique stable equilibrium of the basic o.d.e. ṙ = E 0 {HY, r}. To establish the boundedness of r t, we use the scaled o.d.e. 25 and the limit o.d.e. 26 together with the averaging argument of [Bor06]. The proof proceeds in three steps. i For any given positive number T, we define time intervals [t, k t ], where k t = min{k k > t, k j=t γ j T }. For every such interval, we consider the scaled sequence Then for j [t, k t, where t j = j b t ˆr t j = r j b t, j [t, k t ], where b t = max{ r t, }. ˆr t j+ = ˆrt j is the scaled noise and satisfies t j + γ j Hbt y j, ˆr t j + ɛ j + ˆr t j. t j Using the Lipschitz continuity of Hy, and the discrete Gronwall inequality Lemma 4.3 of [BM00], we have that ˆr t j is bounded on [t, k t ] with the bound independent of t. Also, as a consequence, t the noise satisfies j ɛ jc T with C T being some constant independent of t. These allow us to apply again the averaging analysis in [Bor06] to ˆr t j, j [t, k t ] and obtain our convergence result, as we show next. ii Let x t u be the solution of the scaled o.d.e. ṙ = E 0 {H bt Y, r} at time u with initial condition x0 = ˆr t t. Note that x t is a curve on R s whose argument is the natural continuous time. By applying the analysis in [Bor06] the proof of Lemma 2.2, Lemma 2.3 itself, and the proof of Lemma 3., p , we have in particular that where u t = k t j=t γ j and u t [T, T + for t sufficiently large. lim t ˆrt k t x t u t = 0, w.p., 27 6

17 iii Notice that x0 = ˆr t t is within the unit ball. Hence, by Lemma 4.4 of [BM00], for suitably chosen T, x t u t would be close to the origin 0, had b t been sufficiently large, because the origin 0 is the globally asymptotically stable equilibrium point of the scaling limit o.d.e. 26. Thus, using Eq. 27 and applying the analysis of [BM00] Theorem 2.i and its proof in Section 4., with Lemma and Lemma 4.6, p , we can establish that r t is bounded w.p.. Expressing the 2nd Variant as Iteration 22 We will show that our iteration 9, equivalently 20, can be written in the form 2 with the noise t satisfying Eq. 23. Iteration 2 is a special case of iteration 22 with the following identifications. Let y t be the pair of states x t, x t+. For y = y, y 2, let Hy, r be the mapping from R s to R s defined by Hy, r = B φy gy, y 2 + αh y 2, r φy r. Clearly, Hy, r is Lipschitz continuous in r uniformly for all y. Let the stepsize γ t be t+. We consider the associated o.d.e. and verify that they satisfy the corresponding assumptions of Prop. 4. For a function fy, E 0 {fy } = i,j πip ijf i, j. Consider our basic o.d.e. ṙ = E 0 {HY, r} and the mapping associated with its r.h.s., ΦE 0 {HY, r} = ΠF Φr Φr. Since ΠF Φr is a contraction mapping, its fixed point r is the globally asymptotically stable equilibrium point of the basic o.d.e. Consider the scaled o.d.e. ṙ = E 0 {H b Y, r}, and the scaling limit o.d.e., ṙ = E 0 {H Y, r}. It is easy to see that H exists, and ΦE 0 {H b Y, r} = Π g/b + αp min { c/b, Φr } Φr, ΦE 0 {H Y, r} = απp min{0, Φr} Φr, where g denotes the vector of per-stage costs whose components are defined by gi = j p ijgi, j. Since by Lemma, the mappings Π g/b + αp min { c/b, Φr } and απp min{0, Φr} are contractions with modulus α, the scaled o.d.e. 25 and the scaling limit 26 have globally asymptotically stable equilibrium points, which we denote by r b and r, respectively. We have in particular, r b = r for b = ; r = 0, the original of R s ; and r b converges to r as b. We now show that the the noise term t satisfies Eq. 23, when we reduce iteration 20, i.e., to iteration 2, i.e., r t+ =r t + t + B t+ φx t gx t, x t+ + αh x t+, r t φxt r t + t + B t+ t αφx k hx k+, r lk,t hx k+, r lk,t, r t+ = r t + t + B φx t gx t, x t+ + αh x t+, r t φxt r t + t + t. To simplify the notation, in the proofs and discussions we will use o to denote a scalar sequence that converges to 0 w.p.; we can write Eq. 23, for instance, as t = o + r t for short. Since Bt+ B w.p., as t, replacing Bt+ by B in the third term of the r.h.s. of Eq. 20 will only introduce a noise term of magnitude o + r t. We aim to show that the second term of the r.h.s. of Eq. 20 can also be treated as a noise term of magnitude o + r t. 7

18 Since r lk,t and r lk,t differ for only the last m values of k, for which r lk,t = r t and r lk,t = r t, it can be seen that for all t sufficiently large, t + t Bt+ αφx k hx k+, r lk,t hx k+, r lk,t C t + r t r t 28 for some constant C. Therefore it is sufficient to show that r t r t = o + r t. This is indicated in the following. Lemma 4. For all t sufficiently large and some path-dependent constant C, r t+ r t C t + + r t, r t+ r t C t + + r t+, w.p.. Proof. From Eqs. 20 and 28, it can be seen that for all t t 0, r t+ r t C r t r t + C 2 + r t 29 t + for some sufficiently large t 0 and suitably chosen positive constants C and C 2. Define K = C +C 2 and at, K =. Choose t t K 0 such that for all t t, at, K are positive and t+ at, K K t+, at, K C t This is possible, since at, K tends to and the expressions of the l.h.s. are decreasing to 0 as t increases. For t, Eq. 29 can be written as, using r t r t r t + r t, r t+ r t t + K r t + C r t + C 2 t + K r t + K 2 3 where K 2 is defined, together with some scalar, such that K 2 = C r t + C 2 +, K 2 2. Note that in the following argument, K and K 2 are fixed scalars do not depend on t. We will show by induction that Eq. 3 holds with t replaced by t for all t t. This will imply the first relation in the lemma with C = max{k, K 2 }. It is sufficient to verify Eq. 3 for t = t+, which we now set to do. From Eq. 3 we have r t+ r t t+ K r t r t+ + r t+ + t+ K2. Subtracting K t+ r t r t+ from both sides and multiplying by a t, K, we obtain using the definition of a t, K, r t+ r t a t, K K t+ r t+ + a t, K K2 t+ 32 r t+ + K 2 2 C, where the last inequality follows from relations in 30. Substituting into the r.h.s. of 29 for t = t +, we have r t+2 r t+ C r t+ r t + C 2 + r t+ t + 2 C r t+ + t K 2 + C 2 + r t+ K r t+ + t K 2 + C 2 K r t+ + K 2, t + 2 8

19 where the last inequality follows from 2 K 2 + C 2 + C 2 K 2 by our definitions of K 2 and. This completes our induction showing the first claim. The second relation of the lemma follows from the same induction: we have that for all t t, Eq. 32 holds with t replaced by t, i.e., r t+ r t at, K K t+ r t+ + at, K K2 t+, t t; and since at, K as t, the claim follows. This completes the proof of Prop An Alternative Direct Proof of Proposition 3 The method of proof uses a device that is typical of the o.d.e. approach: we define a sequence of times k j, where the length of the interval [k j, k j+ ] is increasing with j at the rate of a geometric progression. We then analyze the progress of the algorithm using a different clock under which the interval [k j, k j+ ] is treated as a unit time that is long enough for the contraction property of the algorithm to manifest itself and to allow a convergence proof. In particular, we will argue, roughly speaking, that r t changes slowly so that during a considerable amount of time after t, namely [t, t + δ t] for some small scalar δ, the terms h x k+, r lk,t are close to h xk+, r t. This will allow us to view Φr t+δt as the result of a contracting fixed point iteration applied on Φr t, plus stochastic noise. Based on this interpretation, we will first show that the subsequence r k at times k = + δt, + δ 2 t,... comes close to r within a distance that is a decreasing function of δ, and we will then show the convergence of the entire sequence. To be properly viewed as time indices, the real numbers δ t and + δt need to be rounded to the closest integers, but to be concise we will skip this step and treat them as if they were integers The error introduced can be absorbed in the o factor. Also, in various estimates we will simply denote by the Euclidean norm π on R n, except where noted. For convenience, we define the norm on the space of r by r = Φr π. Thus r = Φr in our notation. An Ergodicity Property In our convergence proof, we will argue that for a sample trajectory, the empirical state frequencies on the segments [t, t + δt approach the steady-state probabilities, as t increases to. This is fairly evident, but for completeness we include the proof. Lemma 5. Let δ be a positive number, let i and j be two states such that p ij > 0, and let n δ,t i, j be the empirical frequency of a transition i, j in the time interval [t, t + δ t. For any t > 0, let t k = + δ k t, k = 0,,... Then lim k n δ,t k i, j = πi p ij, w.p.. Proof. Fix the transition i, j. By renewal theory and the central limit theorem, for some constant C, By Chebyshev s inequality, E { n δ,tk i, j πi p ij 2} C δ t k. P { } C n δ,tk i, j πi p ij ɛ k ɛ 2 k δ t. k 9

20 Let ɛ k = t β/2 k for some β 0,. Thus lim ɛ k = 0, k ɛ 2 k t k = t β k. Define the events A k by Then for some constants C 2, C 3, P {A k } C 2 k= A k = { n δ,tk i, j πi p ij ɛ k }. k= t β k = C 3 + δ k β <, so by the Borel-Cantelli Lemma, w.p., only finitely many events A k occur, which proves our claim. k= Note that although Lemma 5 is stated for one fixed δ and one fixed t, the conclusion holds for a countable number of δ and t simultaneously. Thus it is valid to choose in the following proof any t, and any δ arbitrarily small, say, from the sequence 2 j. For conciseness, we will not mention this explicitly again. Estimates Within Trajectory Segments of the Form [t, t + δ t] Consider a sample trajectory from a set of probability for which the assertions in Lemmas 4 and 5 for a countable number of δ hold. We can thus omit the statement w.p. in the following analysis. Fix δ, which can be an arbitrarily small positive scalar. Lemma 6. For t sufficiently large and k [t, t + δ t] φi r k φi r t θ δ + r t, i =,..., n, 33 where θ δ is a scalar independent of t, and θ δ 0 as δ 0. Proof. For m, using r t+m = r t + m j= r t+j r t+j and using Lemma 4 to bound r t+j r t+j, we can bound r t+m r t for t sufficiently large by r t+m r t = m j= m j= m C t+j + r t+j C t+j + r t + r t+j r t C t+j j= + r t + m j= C t+j r t+j r t for some constant C, where the second inequality follows from the triangle inequality. By using the version of the discrete Gronwall inequality given as Lemma 4.3i of [BM00], we have for m δ t, P r t+m r t mc m C j= t + r t e t+j δce δc + r t. By the equivalence of norms, there exists C > 0 such that for all r, Φr C Φr π = C r. The claim then follows by choosing θ δ = C δce δc, which converges to 0 as δ 0. 20

21 Since l k,t+δt [t, t + δ t for k [t, t + δ t, inequality 33 implies that for all k [t, t + δ t, h xk+, r lk,t+δt h xk+, r t θδ + r t. 34 Using this, we now write Φr t+δt in terms of Φr t, ΠF Φr t, and residual terms. Lemma 7. For t sufficiently large, where δ,t satisfies Φr t+δt = + δ Φr t + δ + δ ΠF Φr t + δ,t, 35 δ,t δθδ + δ + ɛ t + Φr t, where θ δ is a scalar independent of t such that θ δ 0 as δ 0, and ɛ t is a scalar sequence that converges to 0 as t. Proof. Assume t is sufficiently large so that δt > m. We have by definition Φr t+δt = ΦB t+δt + ΦB t+δt t φx k gx k, x k+ + αhx k+, r k+m t + δ t t + δ t t+δt k=t φx k gx k, x k+ + αhx k+, r lk,t+δt. 36 We approximate separately the two terms on the r.h.s. Using an expression of r t similar to Eq. 9, we write the first term as ΦB t+δt where accounts for replacing r lk,t+δt = r k+m with r lk,t : and = ΦB t + δ t tb tr t + t + δ t = + δ Φr t + ɛ δ,t, 37 t αφx k hx k+, r k+m hx k+, r lk,t, t+δt ɛ δ,t = + δ Φ B t+δt B t I r t + t + δ t. As t, the differences r k+m r t, k [t m +, t ] are bounded by a multiple of + r t that diminishes to 0 by Lemma 4 and the proof of Lemma 6, hence C + r t, for some constant C and t sufficiently large. Also, B t+δt B t I, as t by Lemma 5. Hence, we have ɛ δ,t = o + r t. 38 We write the second term of Eq. 36 as δ t + δ t ΦB B t+δt B δt δt δ t t+δt k=t φx k gx k, x k+ + αhx k+, r t + ɛ 2 δ,t, 39 where B δt = t+δt φx k φx k, δ t k=t 2

22 and ɛ 2 δ,t accounts for the residual in substituting r t for r lk,t+δt in hx k+, r lk,t+δt for k [t, t+δ t. Thus ɛ 2 δ,t satisfies, by Eq. 34, ɛ 2 δ,t δ +δ θ δ + r t, 40 for some positive scalar θ δ independent of t converges to 0 as δ 0. We further approximate the first term of Eq. 39 as follows. Using Lemma 5, we have that as t, within the segment [t, t + δt, the term Φ B δt δ t t+δt k=t φx k gx k, x k+ + αhx k+, r t converges to ΠF Φr t. Using Lemma 5, we also have that B t+δt B δt I as t. Thus we can write the first term of Eq. 39 as δ + δ ΠF Φr t + ɛ 3 δ,t, 4 where ɛ 3 δ,t accounts for the residual in this approximation, and Putting Eqs together, we can write Φr t+δt as ɛ 3 δ,t = o + r t. 42 where The claim thus follows. Φr t+δt = + δ Φr t + δ + δ ΠF Φr t + ɛ δ,t + ɛ 2 δ,t + ɛ 3 δ,t ɛ δ,t + ɛ 2 δ,t + ɛ 3 δ,t δθδ + δ + o + Φr t. 43 A Contraction Argument Fix a scalar ᾱ α,. Let θ δ and ɛ t be as in the preceding lemma. Let δ be such that α + θ δ < ᾱ, and choose β δ such that + δ + α + θ δ δ + δ < β δ < + δ + ᾱ δ + δ. Using the contraction property ΠF Φr t Φr α Φr t Φr, the preceding lemma, and the triangle inequality Φr t Φr t Φr + Φr, we have Φr t+δt Φr + δ + α δ Φr t Φr δ + θ δ + δ + δ + ɛ t + Φr t + δ + α + θ δ δ + δ + ɛ t Φr t Φr δ + θ δ + δ + ɛ t + Φr. For any arbitrarily small positive scalar ɛ, we have ɛ t Φr + < ɛ and α + θ δ + ɛ t < ᾱ for t sufficiently large since ɛ t = o. Hence, inequality 44 implies that for any ɛ, and for some fixed t independent of ɛ, the sequence { Φr kj k j = + δ j t } satisfies lim sup Φr kj Φr j ɛ β δ + 44 δ +δ θ δ β δ + Φr

WE consider finite-state Markov decision processes

WE consider finite-state Markov decision processes IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 7, JULY 2009 1515 Convergence Results for Some Temporal Difference Methods Based on Least Squares Huizhen Yu and Dimitri P. Bertsekas Abstract We consider