A Least Squares Q-Learning Algorithm for Optimal Stopping Problems

Size: px
Start display at page:

Download "A Least Squares Q-Learning Algorithm for Optimal Stopping Problems"

Transcription

1 LIDS REPORT 273 December 2006 Revised: June 2007 A Least Squares Q-Learning Algorithm for Optimal Stopping Problems Huizhen Yu janey.yu@cs.helsinki.fi Dimitri P. Bertsekas dimitrib@mit.edu Abstract We consider the solution of discounted optimal stopping problems using linear function approximation methods. A Q-learning algorithm for such problems, proposed by Tsitsiklis and Van Roy, is based on the method of temporal differences and stochastic approximation. We propose alternative algorithms, which are based on projected value iteration ideas and least squares. We prove the convergence of some of these algorithms and discuss their properties. Huizhen Yu is with HIIT, University of Helsinki, Finland. Dimitri Bertsekas is with the Laboratory for Information and Decision Systems LIDS, M.I.T., Cambridge, MA 0239.

2 Introduction Optimal stopping problems are a special case of Markovian decision problems where the system evolves according to a discrete-time stochastic system equation, until an explicit stopping action is taken. At each state, there are two choices: either to stop and incur a state-dependent stopping cost, or to continue and move to a successor state according to some transition probabilities and incur a state-dependent continuation cost. Once the stopping action is taken, no further costs are incurred. The objective is to minimize the expected value of the total discounted cost. Examples are classical problems, such as search, and sequential hypothesis testing, as well as recent applications in finance and the pricing of derivative financial instruments see Tsitsiklis and Van Roy [TV99], Barraquand and Martineau [BM95], Longstaff and Schwartz [LS0]. The problem can be solved in principle by dynamic programming DP for short, but we are interested in problems with large state spaces where the DP solution is practically infeasible. It is then natural to consider approximate DP techniques where the optimal cost function or the Q- factors of the problem are approximated with a function from a chosen parametric class. Generally, cost function approximation methods are theoretically sound i.e., are provably convergent only for the single-policy case, where the cost function of a fixed stationary policy is evaluated. However, for the stopping problem of this paper, Tsitsiklis and Van Roy [TV99] introduced a linear function approximation to the optimal Q-factors, which they prove to be the unique solution of a projected form of Bellman s equation. While in general this equation may not have a solution, this difficulty does not occur in optimal stopping problems thanks to a critical fact: the mapping defining the Q-factors is a contraction mapping with respect to the weighted Euclidean norm corresponding to the steady-state distribution of the associated Markov chain. For textbook analyses, we refer to Bertsekas and Tsitsiklis [BT96], Section 6.8, and Bertsekas [Ber07], Section 6.4. The algorithm of Tsitsiklis and Van Roy is based on single trajectory simulation, and ideas related to the temporal differences method of Sutton [Sut88], and relies on the contraction property just mentioned. We propose a new algorithm, which is also based on single trajectory simulation and relies on the same contraction property, but uses different algorithmic ideas. It may be viewed as a fixed point iteration for solving the projected Bellman equation, and it relates to the least squares policy evaluation LSPE method first proposed by Bertsekas and Ioffe [BI96] and subsequently developed by Nedić and Bertsekas [NB03], Bertsekas, Borkar, and Nedić [NB03], and Yu and Bertsekas [YB06] see also the books [BT96] and [Ber07]. We prove the convergence of our method for finite-state models, and we discuss some variants. The paper is organized as follows. In Section 2, we introduce the optimal stopping problem, and we derive the associated contraction properties of the mapping that defines Q-learning. In Section 3, we describe our LSPE-like algorithm, and we prove its convergence. We also discuss the convergence rate of the algorithm, and we provide a comparison with another algorithm that is related to the least squares temporal differences LSTD method, proposed by Bradtke and Barto [BB96], and further developed by Boyan [Boy99]. In Section 4, we describe some variants of the algorithm, which involve a reduced computational overhead per iteration. In this section, we also discuss the relation of our algorithms with the recent algorithm by Choi and Van Roy [CV06], which can be used to solve the same optimal stopping problem. In Section 5, we prove the convergence of some of the variants of Section 4. We give two alternative proofs, the first of which uses results from the o.d.e. ordinary differential equation line of convergence analysis of stochastic iterative algorithms, and the second of which is a direct proof reminiscent of the o.d.e. line of analysis. A computational comparison of our methods with other algorithms for the optimal stopping problem is beyond the scope of the present paper. However, our analysis and the available results using least squares methods Bradtke and Barto [BB96], Bertsekas and Ioffe [BI96], Boyan [Boy99], Bertsekas, Borkar, and Nedić [BBN03], Choi and Van Roy [CV06] clearly suggest a superior performance to the algorithm of Tsitsiklis and Van Roy [TV99], and likely an improved convergence rate over the 2

3 method of Choi and Van Roy [CV06], at the expense of some additional overhead per iteration. 2 Q-Learning for Optimal Stopping Problems We are given a Markov chain with state space {,..., n}, described by transition probabilities p ij. We assume that the states form a single recurrent class, so the chain has a steady-state distribution vector π = π,..., πn with πi > 0 for all states i. Given the current state i, we assume that we have two options: to stop and incur a cost ci, or to continue and incur a cost gi, j, where j is the next state there is no control to affect the corresponding transition probabilities. The problem is to minimize the associated α-discounted infinite horizon cost, where α 0,. For a given state i, we associate a Q-factor with each of the two possible decisions. The Q-factor for the decision to stop is equal to ci. The Q-factor for the decision to continue is denoted by Qi. The optimal Q-factor for the decision to continue, denoted by Q, relates to the optimal cost function J of the stopping problem by and Q i = n p ij gi, j + αj j, i =,..., n, j= J i = min { ci, Q i }, i =,..., n. The value Q i is equal to the cost of choosing to continue at the initial state i and following an optimal policy afterwards. The function Q satisfies Bellman s equation Q i = n p ij gi, j + α min { cj, Q j }, i =,..., n. j= Once the Q-factors Q i are calculated, an optimal policy can be implemented by stopping at state i if and only if ci Q i. The Q-learning algorithm Watkins [Wat89] is Qi := Qi + γ gi, j + α min { cj, Qj } Qi, where i is the state at which we update the Q-factor, j is a successor state, generated randomly according to the transition probabilities p ij, and γ is a small positive stepsize, which diminishes to 0 over time. The convergence of this algorithm is addressed by the general theory of Q-learning see Watkins and Dayan [WD92], and Tsitsiklis [Tsi94]. However, for problems where the number of states n is large, this algorithm is impractical. Let us now consider the approximate evaluation of Q i. We introduce the mapping F : R n R n given by n { } F Qi = p ij gi, j + α min cj, Qj, i =,..., n. j= We denote by F Q or F Q the vector whose components are F Qi, i =,..., n. By Eq., the optimal Q-factor for the choice to continue, Q, is a fixed point of F, and it is the unique fixed point because F is a sup-norm contraction mapping. For the approximation considered here, it turns out to be very important that F is also a Euclidean contraction. Let π be the weighted Euclidean norm associated with the steady-state probability vector π, i.e., n v 2 π = πi vi 2. i= 3

4 It has been shown by Tsitsiklis and Van Roy [TV99] see also Bertsekas and Tsitsiklis [BT96], Section that F is a contraction with respect to this norm. For purposes of easy reference, we include the proof. Lemma. The mapping F is a contraction with respect to π, with modulus α. Proof. For any two vectors Q and Q, we have or, in vector notation, F Qi F Qi n { } { } α p ij min cj, Qj min cj, Qj α j= n p ij Qj Qj, j= F Q F Q αp Q Q, where x denotes a vector whose components are the absolute values of the components of x. Hence, F Q F Q π α P Q Q π α Q Q π, where the last inequality follows from the relation P J π J π, which holds for every vector J see Tsitsiklis and Van Roy [TV97] or Bertsekas and Tsitsiklis [BT96], Lemma 6.4. We consider Q-factor approximations using a linear approximation architecture Qi, r = φi r, where φi is an s-dimensional feature vector associated with state i. In our notation, all vectors are viewed as column vectors, and prime denotes transposition. We also write the vector Q r = Q, r,..., Qn, r in the compact form Q r = Φr, where Φ is the n s matrix whose rows are φi, i =,..., n. We assume that Φ has rank s, and we denote by Π the projection mapping with respect to π on the subspace i.e., for all J R n, S = {Φr r R s }, ΠJ = arg min J Ĵ π. Ĵ S Because F is a contraction with respect to π with modulus α, and Π is nonexpansive, the mapping ΠF is a contraction with respect to π with modulus α. Therefore, the mapping ΠF has a unique fixed point within the subspace S, which in view of the rank assumption on Φ can be uniquely represented as Φr. Thus r is the unique solution of the equation Φr = ΠF Φr. by Tsitsiklis and Van Roy [TV99] show that the error of this Q-factor approximation can be bounded Φr Q π α 2 ΠQ Q π. 4

5 Value Iterate FΦr t Value Iterate FΦr t Projection on S Projection on S Φr t+ Φr t 0 S: Subspace spanned by basis functions Φr t+ Φr t Simulation error 0 S: Subspace spanned by basis functions Projected Value Iteration Least Squares Policy Evaluation LSPE Figure : A conceptual view of projected value iteration and its simulation-based implementation. Furthermore, if we implement a policy µ that stops at state i if and only if ci φi r, then the cost of this policy, denoted by J µ, satisfies n πi J µ i J i 2 α α 2 ΠQ Q π. i= These bounds indicate that if Q is close to the subspace S spanned by the basis functions, then the approximate Q-factor and its associated policy will also be close to the optimal. The contraction property of ΠF suggests the fixed point iteration Φr t+ = ΠF Φr t, which in the related contexts of policy evaluation for discounted and average cost problems see [BBN03, YB06, Ber07] is known as projected value iteration [to distinguish it from the value iteration method, which is Q t+ = F Q t ]; see Fig.. This iteration converges to the unique fixed point Φr of ΠF, but is not easily implemented because the dimension of the vector F Φr t is potentially very large. In the policy evaluation context, a simulation-based implementation of the iteration has been proposed, which does not suffer from this difficulty, because it uses simulation samples of the cost of various states in a least-squares type of parametric approximation of the value iteration method. This algorithm is known as least squares policy evaluation LSPE, and can be conceptually viewed as taking the form Φr t+ = ΠF Φr t + ɛ t, where ɛ t is simulation noise which diminishes to 0 with probability w.p. as t see Fig.. The algorithm to be introduced in the next section admits a similar conceptual interpretation, and its analysis has much in common with the analysis given in [BBN03, Ber07] for the case of single-policy evaluation. In fact, if the stopping option was not available [or equivalently if ci is so high that it is never optimal to stop], our Q-learning algorithm would coincide with the LSPE algorithm for approximate evaluation of the discounted cost function of a fixed stationary policy. Let us also note that LSPE like the temporal differences method is actually a family of methods parameterized by a scalar λ [0, ]. Our Q-learning algorithm of the next section corresponds to LSPE0, the case where λ = 0; we do not have a convenient Q-learning algorithm that parallels LSPEλ for λ > 0. 5

6 3 A Least Squares Q-learning Algorithm 3. Algorithm We generate a single infinitely long simulation trajectory x 0, x,... corresponding to an unstopped system, i.e., using the transition probabilities p ij. Our algorithm starts with an initial guess r 0, and generates a parameter vector sequence {r t }. Following the transition x t, x t+, we form the following least squares problem at each time t, min whose solution is t ˆr t+ = Then we set t r R s φx k r gx k, x k+ α min { cx k+, φx k+ r t } 2, 2 φx k φx k t φx k gx k, x k+ + α min { cx k+, φx k+ } r t. 3 r t+ = r t + γˆr t+ r t, 4 where γ is some fixed constant stepsize, whose range will be given later. 2 This algorithm is related to the LSPE0 algorithm, which is used for the approximate evaluation of a single stationary policy of a discounted Markovian decision problem, and is analyzed by Bertsekas and Ioffe [BI96], Nedić and Bertsekas [NB03], Bertsekas, Borkar, and Nedić [BBN03], and Yu and Bertsekas [YB06] see also the recent book by Bertsekas [Ber07], Chapter 6. In particular, if there were no stopping action or equivalently if the stopping costs are so large that they are inconsequential, then, for γ =, the algorithm 3 becomes r t+ = t φx k φx k t φx k gx k, x k+ + αφx k+ r t, 5 and is identical to the LSPE0 algorithm for evaluating the policy that never stops. On the other hand, we note that the least squares Q-learning algorithm 3 has much higher computation overhead than the LSPE0 algorithm 5 for evaluating this policy. In the process of updating r t via Eq. 3, we can compute the matrix t+ t φx kφx k and the vector t+ t φx kgx k, x k+ iteratively and efficiently as in Eq. 5. The terms min { cx k+, φx k+ r t }, however, need to be recomputed for all the samples x k+, k < t. Intuitively, this computation corresponds to repartitioning the states into those at which to stop and those at which to continue, based on the current approximate Q-factors Φr t. In Section 4, we will discuss how to reduce this extra overhead. We will prove that the sequence {Φr t } generated by the least squares Q-learning algorithm 3 asymptotically converges to the unique fixed point of ΠF. The idea of the proof is to show that the algorithm can be written as the iteration Φr t+ = Π t Ft Φr t or its damped version, where Π t and F t approximate Π and F, respectively, within simulation error that asymptotically diminishes to 0 w.p.. Multiple independent infinitely long trajectories can also be used similarly. 2 We ignore the issues associated with the invertibility of the matrix in Eq. 3. They can be handled, for example, by adding a small positive multiple of the identity to the matrix if it is not invertible. 6

7 3.2 Convergence Proof The iteration 3 can be written equivalently as n n n ˆr t+ = ˆπ t iφiφi ˆπ t iφi ĝ t i + α i= i= j= ˆπ t j i min { cj, φj } r t, with ˆπ t i and ˆπ t j i being the empirical frequencies defined by t ˆπ t i = δx t k = i, ˆπ t j i = δx k = i, x k+ = j t + t δx, k = i where δ is the indicator function, and ĝ t is the empirical mean of the per-stage costs: t ĝ t i = gx k, x k+ δx k = i t δx. k = i [In the case where t δx k = i = 0, we define ĝ t i = 0, ˆπ t j i = 0 by convention.] In a more compact notation, Φˆr t+ = Π t Ft Φr t, 6 where the mappings Π t and F t are simulation-based approximations to Π and F, respectively: Π t = ΦΦ Dt Φ Φ Dt, Ft J = ĝ t + α P t min { c, J }, J R n, D t = diag..., ˆπ t i,..., Pt = ˆπ ij tj i. With a stepsize γ, the least squares Q-learning iteration 4 is written as By ergodicity of the Markov chain, we have w.p., Φr t+ = γφr t + γ Π t Ft Φr t. 7 ˆπ t π, Pt P, ĝ t g, as t, where g denotes the expected per-stage cost vector with n j= p ijgi, j as the i-th component. For each t, denote the invariant distribution of Pt by π t. We now have three distributions, π, ˆπ t, π t, which define, respectively, three weighted Euclidean norms, π, ˆπt, πt. The mappings we consider are non-expansive or contraction mappings with respect to one of these norms. In particular: the mapping Π t is non-expansive with respect to ˆπt ˆπt, and since Π t is projection with respect to the mapping F t is a contraction, with modulus α, with respect to πt the proof of Lemma can be used to show this. We have the following facts, each being a consequence of the ones preceding it: i ˆπ t, π t π w.p., as t. ii For any ɛ > 0 and a sample trajectory with converging sequences ˆπ t, π t, there exists a time t such that for all t t and all states i + ɛ ˆπ ti πi + ɛ, + ɛ π ti πi + ɛ, + ɛ π ti ˆπi + ɛ. 7

8 iii Under the condition of ii, for any J R n, we have J π + ɛ J ˆπt, J ˆπt + ɛ J πt, J πt + ɛ J π, for all t sufficiently large. Fact iii implies the contraction of Π t Ft with respect to π, as shown in the following lemma. Lemma 2. Let ˆα α,. Then, w.p., Π t Ft is a π -contraction mapping with modulus ˆα for all t sufficiently large. Proof. Consider a simulation trajectory from the set of probability for which P t P and ˆπ t, π t π. Fix an ɛ > 0. For any functions J and J 2, using fact iii above and the non-expansiveness and contraction properties of Π t and F t, respectively, we have for t sufficiently large, Π t Ft J Π t Ft J 2 π + ɛ Π t Ft J Π t Ft J 2 ˆπt + ɛ F t J F t J 2 ˆπt + ɛ 2 F t J F t J 2 πt + ɛ 2 α J J 2 πt + ɛ 3 α J J 2 π. Thus, by letting ɛ be such that + ɛ 3 α < ˆα <, we see that Π t Ft is a π -contraction mapping with modulus ˆα for all t sufficiently large. Proposition. For any constant stepsize γ 0, 2 +α, r t converges to r w.p., as t. Proof. We choose t such that for all t t, the contraction property of Lemma 2 applies. We have for such t, Φr t+ Φr π = γ Φr t Φr + γ Πt Ft Φr t ΠF Φr π γ Φr t Φr π + γ Π t Ft Φr t Π t Ft Φr π + γ Π t Ft Φr ΠF Φr π γ + γ ˆα Φr t Φr π + γɛ t, 8 where ɛ t = Π t Ft Φr ΠF Φr π. Because Π t Ft Φr ΠF Φr π 0, we have ɛ t 0. Thus, for γ, since γ + γ ˆα <, it follows that Φr t Φr, or equivalently, r t r, w.p.. Similarly, based on Eq. 8, in order to have Φr t+ Φr π converge to 0 under a stepsize γ >, it is sufficient that γ + γ ˆα <, or equivalently, γ < 2 + ˆα. Hence Φr t converges to Φr for the stepsize γ 0, 2 +α. Note that the range of stepsizes for which convergence was shown includes γ =. Remark. The convergence of r t implies that Φr t is bounded w.p.. Using this fact, we can interpret the iteration of the least squares Q-learning algorithm, with the unit stepsize, for instance, as the deterministic fixed point iteration ΠF Φr t plus an asymptotically diminishing stochastic 8

9 disturbance see Fig.. In particular, the difference between ΠF Φr t and the simulation-based fixed point iteration Φr t+ = Π t Ft Φr t is Π t Ft Φr t ΠF Φr t = Π t ĝ t Πg + α Π t Pt ΠP min{c, Φr t }, and can be bounded by Πt Ft Φr t ΠF Φr t Πt Π ĝ t + Π ĝ t g + α Π t Pt ΠP min{c, Φrt }, where is any norm. Since Φr t is bounded w.p., the bound on the right-hand side r.h.s. can be seen to asymptotically diminish to 0. Remark 2. A slightly different proof of the convergence of r t, reminiscent of the argument used later in Section 5, is to interpret the iteration Φr t+ = Π t Ft Φr t as ΠF Φr t plus a stochastic disturbance whose magnitude is bounded by ɛ t + Φr t, with ɛ t asymptotically diminishing to 0. This can be seen from the discussion in the preceding remark. The convergence of r t can then be established using the contraction property of ΠF. This will result in a shorter proof. However, the line of proof based on Lemma 2 is more insightful and transparent. Furthermore, Lemma 2 is of independent value, and in particular it will be used in the following convergence rate analysis. 3.3 Comparison to an LSTD Analogue A natural alternative approach to finding r that satisfies Φr = ΠF Φr is to replace Π and F with asymptotically convergent approximations. In particular, let r t+ be the solution of Φr = Π t Ft Φr, i.e., Φ r t+ = Π t Ft Φ r t+, t = 0,,.... With probability the solutions exist for t sufficiently large by Lemma 2. The conceptual algorithm that generates the sequence { r t } may be viewed as the analogue of the LSTD method, proposed by Bradtke and Barto [BB96], and further developed by Boyan [Boy99] see also the text by Bertsekas [Ber07], Chapter 6. For the optimal stopping problem this is not a viable algorithm because it involves solution of a nonlinear equation. It is introduced here as a vehicle for interpretation of our least squares Q-learning algorithm 2-4. In particular, we note that r t+ is the solution of the equation r t+ = arg min r R s t φx k r gx k, x k+ α min { cx k+, φx k+ r t+ } 2, 9 so it is the fixed point of the arg min mapping in the r.h.s. of the above equation. On the other hand, the least squares Q-learning algorithm 2-4, with stepsize γ =, that generates r t+ can be viewed as a single iteration of a fixed point algorithm that aims to find r t+, starting from r t. This relation can be quantified further. Using an argument similar to the one used in [YB06] for evaluating the optimal asymptotic convergence rate of LSPE, we will show that with any stepsize in the range 0, 2 +α, the LSPE-like update Φr t converges to the LSTD-like update Φ r t asymptotically at the rate of Ot [while we expect both to converge to Φr at a slower rate O t]. In particular, t Φr t Φ r t is bounded w.p., thus t β Φr t Φ r t converges to zero w.p. for any β <. First we prove the following lemma. Lemma 3. i r t r w.p., as t. ii t r t+ r t is bounded w.p.. 9

10 Proof. i Let ˆα α,. Similar to the proof of Prop., using Lemma 2, we have that w.p., for all t sufficiently large, Thus, w.p., Φ r t+ Φr π = Π t Ft Φ r t+ Π t Ft Φr + Π t Ft Φr ΠF Φr π ˆα Φ r t+ Φr π + Π t Ft Φr ΠF Φr π. Φ r t+ Φr π ˆα Π t Ft Φr ΠF Φr π 0, as t, implying that r t is bounded and r t r as t. ii By applying Lemma 2, we have that w.p., for all t sufficiently large, Φ r t+ Φ r t π = Π t Ft Φ r t+ Π t Ft Φ r t + Π t Ft Φ r t Π t Ft Φ r t π which implies, by the definition of F t and F t, that Φ r t+ Φ r t π ˆα ˆα Φ r t+ Φ r t π + Π t Ft Φ r t Π t Ft Φ r t π, Π t ĝ t Π t ĝ t π + Π t Pt Π t Pt π min{c, Φ r t } π. Evidently as shown in [YB06], Π t ĝ t Π t ĝ t π and Π t Pt Π t Pt π are bounded by C/t for some constant C and all t sufficiently large. By the first part of our proof, Φ r t is bounded. Hence, w.p., there exists some sample path-dependent constant C such that for all t sufficiently large, Φ r t+ Φ r t π C t. Proposition 2. For any constant stepsize γ 0, 2 +α, tφr t Φ r t is bounded w.p.. Proof. The proof is similar to that in [YB06]. With the chosen stepsize, by Lemma 2, Φr t+ Φ r t+ π = γφrt Φ r t+ + γ Πt Ft Φr t Π t Ft Φ r t+ π ᾱ Φr t Φ r t+ π ᾱ Φr t Φ r t π + ᾱ Φ r t+ Φ r t π, for some ᾱ < and all t sufficiently large. Multiplying both sides by t +, using Lemma 3 ii, and defining ζ t = t Φr t Φ r t π, we have that w.p. for some constant C, β < and time t, ζ t+ β ζ t + C, t t, which implies that Hence t Φr t Φ r t π is bounded w.p.. ζ t β t t ζ t + C β, t t. 4 Variants with Reduced Overhead per Iteration At each iteration of the least squares Q-learning algorithm 3, 4, while updating r t, it is necessary to recompute the terms min { cx k+, φx k+ r t } for all the samples xk+, k < t. Intuitively, this corresponds to repartitioning the sampled states into those at which to stop and those at which to continue based on the most recent approximate Q-factors Φr t. In this section we discuss some variants of the algorithm that aim to reduce this computation. 0

11 4. First Variant A simple way to reduce the overhead in iteration 3 is to forgo the repartitioning just mentioned. Thus, in this variant we replace the terms min { cx k+, φx k+ r t } by qxk+, r t, given by qx k+, r t = { cx k+ if k K, φx k+ r t if k / K, where K = { } k cx k+ φx k+ r k is the set of states to stop based on the earlier approximate Q-factors Φr k, rather than the most recent approximate Q-factors Φr t. In particular, we replace the term t φx k min { cx k+, φx k+ } r t in Eq. 3 with t φx k qx k+, r t = k t, k K φx k cx k+ + k t, k / K φx k φx k+ r t, which can be efficiently updated at each time t. Some other similar variants are possible, which employ a limited form of repartitioning the states into those to stop and those to continue. For example, one may repartition only the sampled states within a time window of the m most recent time periods. In particular, in the preceding calculation, instead of the set K, we may use at time t the set K t = {k k K t, k < t m} { k t m k t, cx k+ φx k+ r t }, starting with K 0 = {0}. Here m = corresponds to the algorithm of the preceding section, while m = corresponds to the algorithm of the preceding paragraph. Thus the overhead for repartitioning per iteration is proportional to m, and remains bounded. An important observation is that in the preceding variations, if r t converges, then asymptotically the terms min { cx k+, φx k+ r t } and qxk+, r t coincide, and it can be seen that the limit of r t must satisfy the equation Φr = ΠF Φr, so it must be equal to the unique solution r. However, at present we have no proof of convergence of r t. 4.2 Second Variant Let us consider another variant, whereby we simply replace the terms min{cx k+, φx k+ r t } in the least squares problem 2 with min{cx k+, φx k+ r k }. The idea is that for large k and t, these two terms may be close enough to each other, so that convergence may still be maintained. Thus we consider the iteration r t+ = arg min r R s t φx k r gx k, x k+ α min { cx k+, φx k+ r k } 2. 0 This is a special case of an algorithm due to Choi and Van Roy [CV06], as we will discuss shortly. By carrying out the minimization over r, we can equivalently write Eq. 0 as r t+ = Bt+ t + t φx k gx k, x k+ + α min { cx k+, φx k+ } r k,

12 where we denote B t = t φx k φx k. t To gain some insight into this iteration, let us rewrite it as follows: r t+ = t + B t+ t φx k gx k, x k+ + α min { cx k+, φx k+ } r k + t + B t+ φx t gx t, x t+ + α min { cx t+, φx t+ } r t = t + B t+ tb tr t + t + B t+ φx t = t + B t+ tb t + φx t φx t r t + t + B t+ φx t gx t, x t+ + α min { cx t+, φx t+ r t } gx t, x t+ + α min { cx t+, φx t+ } r t φxt r t, and finally r t+ = r t + t + B t+ φx t gx t, x t+ + α min { cx t+, φx t+ } r t φxt r t. 2 This iteration can be shown to converge to r. However, we will show by example that its rate of convergence can be inferior to the least squares Q-learning algorithm [cf. Eqs. 3-4]. Accordingly, we consider another variant that aims to improve the practical if not the theoretical rate of convergence of iteration 0 [or equivalently 2], and is new to our knowledge. In particular, we introduce a time window of size m, and we replace the terms min { cx k+, φx k+ r t } in the least squares problem 2 with min { cx k+, φx k+ r lk,t }, where In other words, we consider the algorithm r t+ = arg min r R s t l k,t = min{k + m, t}. φx k r gx k, x k+ α min { cx k+, φx k+ r lk,t } 2. 3 Thus, at time t, the last m terms in the least squares sum are identical to the ones in the corresponding sum for the least squares Q-learning algorithm [cf. Eq. 2]. The terms min { cx k+, φx k+ r lk,t } remain constant after m updates when l k,t reaches the value k + m, so they do not need to be updated further. Note that in the first m iterations, this iteration is identical to the least squares Q-learning algorithm of Section 3 with unit stepsize. An important issue is the size of m. For large m, the algorithm approaches the least squares Q-learning algorithm, while for m =, it is identical to the earlier variant Comparison with Other Algorithms Let us now consider an algorithm, due to Choi and Van Roy [CV06], and referred to as the fixed point Kalman filter. It applies to more general problems, but when specialized to the optimal stopping problem, it takes the form r t+ = r t + γ t B t+ φx t gx t, x t+ + α min { cx t+, φx t+ } r t φxt r t, 4 2

13 where γ t is a diminishing stepsize. The algorithm is motivated by Kalman filtering ideas and the recursive least squares method in particular. It can also be viewed as a scaled version with scaling matrix Bt+ of the method by Tsitsiklis and Van Roy [TV99], which has the form r t+ = r t + γ t φx t gx t, x t+ + α min{cx t+, φx t+ r t } φx t r t. 5 Scaling is believed to be instrumental for enhancing the rate of convergence. It can be seen that when γ t = /t +, the iterations 2 and 4 coincide. However, the iterations 3 and 4 are different for a window size m >. As far as we know, the convergence proofs of [TV99] and [CV06] do not extend to iteration 3 or its modification that we will introduce in the next section in part because of the dependence of r t+ on as many as t m past iterates through the time window. The following example provides some insight into the behavior of the various algorithms discussed in this paper. Example. This is a somewhat unusual example, which can be viewed as a simple DP model to estimate the mean of a random variable using a sequence of independent samples. It involves a Markov chain with a single state. At each time period, the cost produced at this state is a random variable taking one of n possible values with equal probability. 3 Let g k be the cost generated at the kth transition. The stopping cost is taken to be very high so that the stopping option does not affect the algorithms. We assume that the costs g k are independent and have zero mean and variance σ 2. The matrix Φ is taken to be the scalar, so r is equal to the true cost and r = 0. Then, the least squares Q-learning algorithm of Section 3 with unit stepsize [cf. Eqs. 3 and 4] takes the form r t+ = g g t + αr t. 6 t + The first variant Section 4. also takes this form, regardless of the method used for repartitioning, since the stopping cost is so high that it does not materially affect the calculations. Since iteration 6 coincides with the LSPE0 method for this example, the corresponding rate of convergence results apply see Yu and Bertsekas [YB06]. In particular, as t, t r t converges in distribution to a Gaussian distribution with mean zero and variance σ 2 / α 2, so that E{r 2 t } converges to 0 at the rate /t, i.e., there is a constant C such that te{r 2 t } C, t = 0,,.... The second variant [Section 4.2, with time window m = ; cf. Eq. 2], takes the form r t+ = g t t + + t + α t + r t. 7 The fixed point Kalman filter algorithm [cf. Eq. 4], and the Tsitsiklis and Van Roy algorithm [cf. Eq. 5] are identical because the scaling matrix B t+ is the scalar in this example. They take the form r t+ = r t + γ t g t + αr t r t. For a stepsize γ t = /t +, they are identical to the second variant 7. We claim that iteration 7 converges more slowly than iteration 6, and that te{rt 2 }. To this end, we write E{r 2 t+} = 2 t + α E{rt 2 } + t + σ2 t A more conventional but equivalent example can be obtained by introducing states,..., n, one for each possible value of the cost per stage, and transition probabilities p ij = /n for all i, j =,..., n. 3

14 Let ζ t = te{r 2 t }. Then ζ t+ = From this equation for α > /2, we have t + α2 tt + ζ t + σ2 t +. ζ t+ ζ t + σ2 t +, so ζ t tends to. Finally, the variant of Section 4.2 with time window m > [cf. Eq. 3], for t m takes the form r t+ = g g t + α r m + r m + + r t + mr t, t + t + t m. 8 For t < m, it takes the form We may write iteration 8 as r t+ = g g t t + + αr t, t < m. r t+ = g t t + + t + α t + r t + α m r t r t, t m, t + and it can be shown again that t E{r 2 t }, similar to iteration 7. This suggests that the use of m > may affect the practical convegence rate of the algorithm, but is unlikely to affect the theoretical convergence rate. 5 Convergence Analysis for Some Variants In this section, we prove the convergence of the second variant, iteration 3, with a window-size m. To simplify notation, we define function h by and we write iteration 3 equivalently as r t+ = Bt+ t + where l k,t = min{k + m, t}. hx, r = min { cx, φx r }, t φx k gx k, x k+ + αh x k+, r lk,t, 9 Proposition 3. Let r t be defined by Eq. 9. Then w.p., r t r as t. In the remainder of this section we provide two alternative proofs. The first proof is based on the o.d.e. ordinary differential equation techniques for analyzing stochastic approximation algorithms, and makes use of theorems by Borkar [Bor06] and Borkar and Meyn [BM00], which are also given in the yet unpublished book by Borkar [Bor07] Chapters 2, 3, and 6. We have adapted the theorems in these sources for our purposes the subsequent Prop. 4 with the assistance of V. Borkar. This proof is relatively short, but requires familiarity with the intricate methodology of the o.d.e. line of analysis. We have also provided a direct, somewhat longer proof, which does not rely on references to o.d.e.-related sources, although it is in the same spirit as the o.d.e.-based proof. In an earlier version of this report, we have used the line of argument of the direct proof to show the result of Prop. 3 with an additional assumption that guaranteed boundedness of the iterates r t. We are indebted to V. Borkar who gave us the idea and several suggestions regarding the first proof. These suggestions in turn motivated our modification of the second proof to weaken our boundedness assumption. 4

15 5. A Proof of Proposition 3 Based on O.D.E. Methods First, we notice that Eq. 9 implies the following relation, t + B t+ r t+ = tb t r t + φx t gx t, x t+ + αhx t+, r t t + αφx k hx k+, r lk,t hx k+, r lk,t = tb t + φx t φx t r t + φx t gx t, x t+ + αhx t+, r t φx t r t t + αφx k hx k+, r lk,t hx k+, r lk,t. Thus iteration 9 is equivalent to r t+ =r t + t + B t+ φx t gx t, x t+ + αh x t+, r t φxt r t + t + B t+ t αφx k hx k+, r lk,t hx k+, r lk,t. 20 The idea is to reduce iteration 20 to the following form and study its convergence: r t+ = r t + t + B φx t gx t, x t+ + αh x t+, r t φxt r t + t + t, 2 where B = lim t Bt, and t is a noise sequence. It is worth to point out that the effect of window size m > will be neglected in our convergence analysis, and this does not contradict our favoring m > to m =, because in general the asymptotic convergence rate of the iterations with and without the noise term can differ from each other. We need the following result from the o.d.e. analysis of stochastic approximation, which only requires a rather weak assumption on the noise term. A General Convergence Result Consider the iteration r t+ = r t + γ t Hyt, r t + t, 22 where γ t is the stepsize deterministic or random; {y t } is the state sequence of a Markov process; Hy, r is a function of y, r; and t is the noise sequence. Let the norm of R s be any norm. We assume the following. Assumption. The function Hy, r is Lipschitz continuous in r for all y with the same Lipschitz constant. The stepsize γ t satisfies w.p., t=0 γ t =, t=0 γ2 t <, and γ t γ t for all t sufficiently large. The noise t satisfies where ɛ t is a scalar sequence that converges to 0 w.p., as t. t ɛ t + r t, w.p., 23 The convergence of iteration 22 under Assumption can be analyzed based on the analysis in Borkar [Bor06] on averaging of Markov noise, and the stability analysis in Borkar and Meyn [BM00] through the scaling limit o.d.e. We assume that y t is the state of a finite-state Markov chain. The analysis of [BM00, Bor06, Bor07] applies to a much more general setting. For a function fy, we denote by E 0 {fy } the 5

16 expectation over Y with respect to the invariant distribution of the Markov chain y t. For a positive scalar b, define Hy, b r Hy, b r H b y, r =, H y, r = lim, b b b where the existence of the limit defining H y, r for every y, r will be part of our assumption. We refer to ṙ = E 0 {HY, r} 24 as our basic o.d.e. We consider also the scaled version of our basic o.d.e., and the scaling limit o.d.e., ṙ = E 0 {H b Y, r}, 25 ṙ = E 0 {H Y, r}. 26 Proposition 4. Suppose that H exists, and the origin is an asymptotically stable equilibrium point of the scaling limit o.d.e. 26, and the basic o.d.e. 24 has a unique globally asymptotically stable equilibrium point r. Suppose also that Assumption holds. Then iteration 22 converges to r w.p.. Proof. Assume first that sup t r t < w.p.. Then the noise sequence t satisfies lim t t = 0 w.p.. Applying the averaging result in [Bor06] Corollary 3. and its preceding remark on noise, p. 44, we have that r t converges to r, the unique stable equilibrium of the basic o.d.e. ṙ = E 0 {HY, r}. To establish the boundedness of r t, we use the scaled o.d.e. 25 and the limit o.d.e. 26 together with the averaging argument of [Bor06]. The proof proceeds in three steps. i For any given positive number T, we define time intervals [t, k t ], where k t = min{k k > t, k j=t γ j T }. For every such interval, we consider the scaled sequence Then for j [t, k t, where t j = j b t ˆr t j = r j b t, j [t, k t ], where b t = max{ r t, }. ˆr t j+ = ˆrt j is the scaled noise and satisfies t j + γ j Hbt y j, ˆr t j + ɛ j + ˆr t j. t j Using the Lipschitz continuity of Hy, and the discrete Gronwall inequality Lemma 4.3 of [BM00], we have that ˆr t j is bounded on [t, k t ] with the bound independent of t. Also, as a consequence, t the noise satisfies j ɛ jc T with C T being some constant independent of t. These allow us to apply again the averaging analysis in [Bor06] to ˆr t j, j [t, k t ] and obtain our convergence result, as we show next. ii Let x t u be the solution of the scaled o.d.e. ṙ = E 0 {H bt Y, r} at time u with initial condition x0 = ˆr t t. Note that x t is a curve on R s whose argument is the natural continuous time. By applying the analysis in [Bor06] the proof of Lemma 2.2, Lemma 2.3 itself, and the proof of Lemma 3., p , we have in particular that where u t = k t j=t γ j and u t [T, T + for t sufficiently large. lim t ˆrt k t x t u t = 0, w.p., 27 6

17 iii Notice that x0 = ˆr t t is within the unit ball. Hence, by Lemma 4.4 of [BM00], for suitably chosen T, x t u t would be close to the origin 0, had b t been sufficiently large, because the origin 0 is the globally asymptotically stable equilibrium point of the scaling limit o.d.e. 26. Thus, using Eq. 27 and applying the analysis of [BM00] Theorem 2.i and its proof in Section 4., with Lemma and Lemma 4.6, p , we can establish that r t is bounded w.p.. Expressing the 2nd Variant as Iteration 22 We will show that our iteration 9, equivalently 20, can be written in the form 2 with the noise t satisfying Eq. 23. Iteration 2 is a special case of iteration 22 with the following identifications. Let y t be the pair of states x t, x t+. For y = y, y 2, let Hy, r be the mapping from R s to R s defined by Hy, r = B φy gy, y 2 + αh y 2, r φy r. Clearly, Hy, r is Lipschitz continuous in r uniformly for all y. Let the stepsize γ t be t+. We consider the associated o.d.e. and verify that they satisfy the corresponding assumptions of Prop. 4. For a function fy, E 0 {fy } = i,j πip ijf i, j. Consider our basic o.d.e. ṙ = E 0 {HY, r} and the mapping associated with its r.h.s., ΦE 0 {HY, r} = ΠF Φr Φr. Since ΠF Φr is a contraction mapping, its fixed point r is the globally asymptotically stable equilibrium point of the basic o.d.e. Consider the scaled o.d.e. ṙ = E 0 {H b Y, r}, and the scaling limit o.d.e., ṙ = E 0 {H Y, r}. It is easy to see that H exists, and ΦE 0 {H b Y, r} = Π g/b + αp min { c/b, Φr } Φr, ΦE 0 {H Y, r} = απp min{0, Φr} Φr, where g denotes the vector of per-stage costs whose components are defined by gi = j p ijgi, j. Since by Lemma, the mappings Π g/b + αp min { c/b, Φr } and απp min{0, Φr} are contractions with modulus α, the scaled o.d.e. 25 and the scaling limit 26 have globally asymptotically stable equilibrium points, which we denote by r b and r, respectively. We have in particular, r b = r for b = ; r = 0, the original of R s ; and r b converges to r as b. We now show that the the noise term t satisfies Eq. 23, when we reduce iteration 20, i.e., to iteration 2, i.e., r t+ =r t + t + B t+ φx t gx t, x t+ + αh x t+, r t φxt r t + t + B t+ t αφx k hx k+, r lk,t hx k+, r lk,t, r t+ = r t + t + B φx t gx t, x t+ + αh x t+, r t φxt r t + t + t. To simplify the notation, in the proofs and discussions we will use o to denote a scalar sequence that converges to 0 w.p.; we can write Eq. 23, for instance, as t = o + r t for short. Since Bt+ B w.p., as t, replacing Bt+ by B in the third term of the r.h.s. of Eq. 20 will only introduce a noise term of magnitude o + r t. We aim to show that the second term of the r.h.s. of Eq. 20 can also be treated as a noise term of magnitude o + r t. 7

18 Since r lk,t and r lk,t differ for only the last m values of k, for which r lk,t = r t and r lk,t = r t, it can be seen that for all t sufficiently large, t + t Bt+ αφx k hx k+, r lk,t hx k+, r lk,t C t + r t r t 28 for some constant C. Therefore it is sufficient to show that r t r t = o + r t. This is indicated in the following. Lemma 4. For all t sufficiently large and some path-dependent constant C, r t+ r t C t + + r t, r t+ r t C t + + r t+, w.p.. Proof. From Eqs. 20 and 28, it can be seen that for all t t 0, r t+ r t C r t r t + C 2 + r t 29 t + for some sufficiently large t 0 and suitably chosen positive constants C and C 2. Define K = C +C 2 and at, K =. Choose t t K 0 such that for all t t, at, K are positive and t+ at, K K t+, at, K C t This is possible, since at, K tends to and the expressions of the l.h.s. are decreasing to 0 as t increases. For t, Eq. 29 can be written as, using r t r t r t + r t, r t+ r t t + K r t + C r t + C 2 t + K r t + K 2 3 where K 2 is defined, together with some scalar, such that K 2 = C r t + C 2 +, K 2 2. Note that in the following argument, K and K 2 are fixed scalars do not depend on t. We will show by induction that Eq. 3 holds with t replaced by t for all t t. This will imply the first relation in the lemma with C = max{k, K 2 }. It is sufficient to verify Eq. 3 for t = t+, which we now set to do. From Eq. 3 we have r t+ r t t+ K r t r t+ + r t+ + t+ K2. Subtracting K t+ r t r t+ from both sides and multiplying by a t, K, we obtain using the definition of a t, K, r t+ r t a t, K K t+ r t+ + a t, K K2 t+ 32 r t+ + K 2 2 C, where the last inequality follows from relations in 30. Substituting into the r.h.s. of 29 for t = t +, we have r t+2 r t+ C r t+ r t + C 2 + r t+ t + 2 C r t+ + t K 2 + C 2 + r t+ K r t+ + t K 2 + C 2 K r t+ + K 2, t + 2 8

19 where the last inequality follows from 2 K 2 + C 2 + C 2 K 2 by our definitions of K 2 and. This completes our induction showing the first claim. The second relation of the lemma follows from the same induction: we have that for all t t, Eq. 32 holds with t replaced by t, i.e., r t+ r t at, K K t+ r t+ + at, K K2 t+, t t; and since at, K as t, the claim follows. This completes the proof of Prop An Alternative Direct Proof of Proposition 3 The method of proof uses a device that is typical of the o.d.e. approach: we define a sequence of times k j, where the length of the interval [k j, k j+ ] is increasing with j at the rate of a geometric progression. We then analyze the progress of the algorithm using a different clock under which the interval [k j, k j+ ] is treated as a unit time that is long enough for the contraction property of the algorithm to manifest itself and to allow a convergence proof. In particular, we will argue, roughly speaking, that r t changes slowly so that during a considerable amount of time after t, namely [t, t + δ t] for some small scalar δ, the terms h x k+, r lk,t are close to h xk+, r t. This will allow us to view Φr t+δt as the result of a contracting fixed point iteration applied on Φr t, plus stochastic noise. Based on this interpretation, we will first show that the subsequence r k at times k = + δt, + δ 2 t,... comes close to r within a distance that is a decreasing function of δ, and we will then show the convergence of the entire sequence. To be properly viewed as time indices, the real numbers δ t and + δt need to be rounded to the closest integers, but to be concise we will skip this step and treat them as if they were integers The error introduced can be absorbed in the o factor. Also, in various estimates we will simply denote by the Euclidean norm π on R n, except where noted. For convenience, we define the norm on the space of r by r = Φr π. Thus r = Φr in our notation. An Ergodicity Property In our convergence proof, we will argue that for a sample trajectory, the empirical state frequencies on the segments [t, t + δt approach the steady-state probabilities, as t increases to. This is fairly evident, but for completeness we include the proof. Lemma 5. Let δ be a positive number, let i and j be two states such that p ij > 0, and let n δ,t i, j be the empirical frequency of a transition i, j in the time interval [t, t + δ t. For any t > 0, let t k = + δ k t, k = 0,,... Then lim k n δ,t k i, j = πi p ij, w.p.. Proof. Fix the transition i, j. By renewal theory and the central limit theorem, for some constant C, By Chebyshev s inequality, E { n δ,tk i, j πi p ij 2} C δ t k. P { } C n δ,tk i, j πi p ij ɛ k ɛ 2 k δ t. k 9

20 Let ɛ k = t β/2 k for some β 0,. Thus lim ɛ k = 0, k ɛ 2 k t k = t β k. Define the events A k by Then for some constants C 2, C 3, P {A k } C 2 k= A k = { n δ,tk i, j πi p ij ɛ k }. k= t β k = C 3 + δ k β <, so by the Borel-Cantelli Lemma, w.p., only finitely many events A k occur, which proves our claim. k= Note that although Lemma 5 is stated for one fixed δ and one fixed t, the conclusion holds for a countable number of δ and t simultaneously. Thus it is valid to choose in the following proof any t, and any δ arbitrarily small, say, from the sequence 2 j. For conciseness, we will not mention this explicitly again. Estimates Within Trajectory Segments of the Form [t, t + δ t] Consider a sample trajectory from a set of probability for which the assertions in Lemmas 4 and 5 for a countable number of δ hold. We can thus omit the statement w.p. in the following analysis. Fix δ, which can be an arbitrarily small positive scalar. Lemma 6. For t sufficiently large and k [t, t + δ t] φi r k φi r t θ δ + r t, i =,..., n, 33 where θ δ is a scalar independent of t, and θ δ 0 as δ 0. Proof. For m, using r t+m = r t + m j= r t+j r t+j and using Lemma 4 to bound r t+j r t+j, we can bound r t+m r t for t sufficiently large by r t+m r t = m j= m j= m C t+j + r t+j C t+j + r t + r t+j r t C t+j j= + r t + m j= C t+j r t+j r t for some constant C, where the second inequality follows from the triangle inequality. By using the version of the discrete Gronwall inequality given as Lemma 4.3i of [BM00], we have for m δ t, P r t+m r t mc m C j= t + r t e t+j δce δc + r t. By the equivalence of norms, there exists C > 0 such that for all r, Φr C Φr π = C r. The claim then follows by choosing θ δ = C δce δc, which converges to 0 as δ 0. 20

21 Since l k,t+δt [t, t + δ t for k [t, t + δ t, inequality 33 implies that for all k [t, t + δ t, h xk+, r lk,t+δt h xk+, r t θδ + r t. 34 Using this, we now write Φr t+δt in terms of Φr t, ΠF Φr t, and residual terms. Lemma 7. For t sufficiently large, where δ,t satisfies Φr t+δt = + δ Φr t + δ + δ ΠF Φr t + δ,t, 35 δ,t δθδ + δ + ɛ t + Φr t, where θ δ is a scalar independent of t such that θ δ 0 as δ 0, and ɛ t is a scalar sequence that converges to 0 as t. Proof. Assume t is sufficiently large so that δt > m. We have by definition Φr t+δt = ΦB t+δt + ΦB t+δt t φx k gx k, x k+ + αhx k+, r k+m t + δ t t + δ t t+δt k=t φx k gx k, x k+ + αhx k+, r lk,t+δt. 36 We approximate separately the two terms on the r.h.s. Using an expression of r t similar to Eq. 9, we write the first term as ΦB t+δt where accounts for replacing r lk,t+δt = r k+m with r lk,t : and = ΦB t + δ t tb tr t + t + δ t = + δ Φr t + ɛ δ,t, 37 t αφx k hx k+, r k+m hx k+, r lk,t, t+δt ɛ δ,t = + δ Φ B t+δt B t I r t + t + δ t. As t, the differences r k+m r t, k [t m +, t ] are bounded by a multiple of + r t that diminishes to 0 by Lemma 4 and the proof of Lemma 6, hence C + r t, for some constant C and t sufficiently large. Also, B t+δt B t I, as t by Lemma 5. Hence, we have ɛ δ,t = o + r t. 38 We write the second term of Eq. 36 as δ t + δ t ΦB B t+δt B δt δt δ t t+δt k=t φx k gx k, x k+ + αhx k+, r t + ɛ 2 δ,t, 39 where B δt = t+δt φx k φx k, δ t k=t 2

22 and ɛ 2 δ,t accounts for the residual in substituting r t for r lk,t+δt in hx k+, r lk,t+δt for k [t, t+δ t. Thus ɛ 2 δ,t satisfies, by Eq. 34, ɛ 2 δ,t δ +δ θ δ + r t, 40 for some positive scalar θ δ independent of t converges to 0 as δ 0. We further approximate the first term of Eq. 39 as follows. Using Lemma 5, we have that as t, within the segment [t, t + δt, the term Φ B δt δ t t+δt k=t φx k gx k, x k+ + αhx k+, r t converges to ΠF Φr t. Using Lemma 5, we also have that B t+δt B δt I as t. Thus we can write the first term of Eq. 39 as δ + δ ΠF Φr t + ɛ 3 δ,t, 4 where ɛ 3 δ,t accounts for the residual in this approximation, and Putting Eqs together, we can write Φr t+δt as ɛ 3 δ,t = o + r t. 42 where The claim thus follows. Φr t+δt = + δ Φr t + δ + δ ΠF Φr t + ɛ δ,t + ɛ 2 δ,t + ɛ 3 δ,t ɛ δ,t + ɛ 2 δ,t + ɛ 3 δ,t δθδ + δ + o + Φr t. 43 A Contraction Argument Fix a scalar ᾱ α,. Let θ δ and ɛ t be as in the preceding lemma. Let δ be such that α + θ δ < ᾱ, and choose β δ such that + δ + α + θ δ δ + δ < β δ < + δ + ᾱ δ + δ. Using the contraction property ΠF Φr t Φr α Φr t Φr, the preceding lemma, and the triangle inequality Φr t Φr t Φr + Φr, we have Φr t+δt Φr + δ + α δ Φr t Φr δ + θ δ + δ + δ + ɛ t + Φr t + δ + α + θ δ δ + δ + ɛ t Φr t Φr δ + θ δ + δ + ɛ t + Φr. For any arbitrarily small positive scalar ɛ, we have ɛ t Φr + < ɛ and α + θ δ + ɛ t < ᾱ for t sufficiently large since ɛ t = o. Hence, inequality 44 implies that for any ɛ, and for some fixed t independent of ɛ, the sequence { Φr kj k j = + δ j t } satisfies lim sup Φr kj Φr j ɛ β δ + 44 δ +δ θ δ β δ + Φr

WE consider finite-state Markov decision processes

WE consider finite-state Markov decision processes IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 7, JULY 2009 1515 Convergence Results for Some Temporal Difference Methods Based on Least Squares Huizhen Yu and Dimitri P. Bertsekas Abstract We consider

More information

9 Improved Temporal Difference

9 Improved Temporal Difference 9 Improved Temporal Difference Methods with Linear Function Approximation DIMITRI P. BERTSEKAS and ANGELIA NEDICH Massachusetts Institute of Technology Alphatech, Inc. VIVEK S. BORKAR Tata Institute of

More information

Journal of Computational and Applied Mathematics. Projected equation methods for approximate solution of large linear systems

Journal of Computational and Applied Mathematics. Projected equation methods for approximate solution of large linear systems Journal of Computational and Applied Mathematics 227 2009) 27 50 Contents lists available at ScienceDirect Journal of Computational and Applied Mathematics journal homepage: www.elsevier.com/locate/cam

More information

Projected Equation Methods for Approximate Solution of Large Linear Systems 1

Projected Equation Methods for Approximate Solution of Large Linear Systems 1 June 2008 J. of Computational and Applied Mathematics (to appear) Projected Equation Methods for Approximate Solution of Large Linear Systems Dimitri P. Bertsekas 2 and Huizhen Yu 3 Abstract We consider

More information

Optimal Stopping Problems

Optimal Stopping Problems 2.997 Decision Making in Large Scale Systems March 3 MIT, Spring 2004 Handout #9 Lecture Note 5 Optimal Stopping Problems In the last lecture, we have analyzed the behavior of T D(λ) for approximating

More information

Basis Function Adaptation Methods for Cost Approximation in MDP

Basis Function Adaptation Methods for Cost Approximation in MDP Basis Function Adaptation Methods for Cost Approximation in MDP Huizhen Yu Department of Computer Science and HIIT University of Helsinki Helsinki 00014 Finland Email: janeyyu@cshelsinkifi Dimitri P Bertsekas

More information

Basis Function Adaptation Methods for Cost Approximation in MDP

Basis Function Adaptation Methods for Cost Approximation in MDP In Proc of IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning 2009 Nashville TN USA Basis Function Adaptation Methods for Cost Approximation in MDP Huizhen Yu Department

More information

Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications

Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications May 2012 Report LIDS - 2884 Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications Dimitri P. Bertsekas Abstract We consider a class of generalized dynamic programming

More information

On the Convergence of Optimistic Policy Iteration

On the Convergence of Optimistic Policy Iteration Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology

More information

Stochastic Shortest Path Problems

Stochastic Shortest Path Problems Chapter 8 Stochastic Shortest Path Problems 1 In this chapter, we study a stochastic version of the shortest path problem of chapter 2, where only probabilities of transitions along different arcs can

More information

Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming MATHEMATICS OF OPERATIONS RESEARCH Vol. 37, No. 1, February 2012, pp. 66 94 ISSN 0364-765X (print) ISSN 1526-5471 (online) http://dx.doi.org/10.1287/moor.1110.0532 2012 INFORMS Q-Learning and Enhanced

More information

6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE 6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE Review of Q-factors and Bellman equations for Q-factors VI and PI for Q-factors Q-learning - Combination of VI and sampling Q-learning and cost function

More information

On Regression-Based Stopping Times

On Regression-Based Stopping Times On Regression-Based Stopping Times Benjamin Van Roy Stanford University October 14, 2009 Abstract We study approaches that fit a linear combination of basis functions to the continuation value function

More information

Error Bounds for Approximations from Projected Linear Equations

Error Bounds for Approximations from Projected Linear Equations MATHEMATICS OF OPERATIONS RESEARCH Vol. 35, No. 2, May 2010, pp. 306 329 issn 0364-765X eissn 1526-5471 10 3502 0306 informs doi 10.1287/moor.1100.0441 2010 INFORMS Error Bounds for Approximations from

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

Approximate Policy Iteration: A Survey and Some New Methods

Approximate Policy Iteration: A Survey and Some New Methods April 2010 - Revised December 2010 and June 2011 Report LIDS - 2833 A version appears in Journal of Control Theory and Applications, 2011 Approximate Policy Iteration: A Survey and Some New Methods Dimitri

More information

Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming April 2010 (Revised October 2010) Report LIDS - 2831 Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming Dimitri P. Bertseas 1 and Huizhen Yu 2 Abstract We consider the classical

More information

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019 Reinforcement Learning and Optimal Control ASU, CSE 691, Winter 2019 Dimitri P. Bertsekas dimitrib@mit.edu Lecture 8 Bertsekas Reinforcement Learning 1 / 21 Outline 1 Review of Infinite Horizon Problems

More information

Value and Policy Iteration

Value and Policy Iteration Chapter 7 Value and Policy Iteration 1 For infinite horizon problems, we need to replace our basic computational tool, the DP algorithm, which we used to compute the optimal cost and policy for finite

More information

1 Problem Formulation

1 Problem Formulation Book Review Self-Learning Control of Finite Markov Chains by A. S. Poznyak, K. Najim, and E. Gómez-Ramírez Review by Benjamin Van Roy This book presents a collection of work on algorithms for learning

More information

LEAST SQUARES POLICY EVALUATION ALGORITHMS WITH LINEAR FUNCTION APPROXIMATION 1

LEAST SQUARES POLICY EVALUATION ALGORITHMS WITH LINEAR FUNCTION APPROXIMATION 1 July 31, 2002 To appear in J. of DEDS LEAST SQUARES POLICY EVALUATION ALGORITHMS WITH LINEAR FUNCTION APPROXIMATION 1 by A. Nedić and D. P. Bertsekas 2 Abstract We consider policy evaluation algorithms

More information

Bias-Variance Error Bounds for Temporal Difference Updates

Bias-Variance Error Bounds for Temporal Difference Updates Bias-Variance Bounds for Temporal Difference Updates Michael Kearns AT&T Labs mkearns@research.att.com Satinder Singh AT&T Labs baveja@research.att.com Abstract We give the first rigorous upper bounds

More information

Procedia Computer Science 00 (2011) 000 6

Procedia Computer Science 00 (2011) 000 6 Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-

More information

Reinforcement Learning and Optimal Control. Chapter 4 Infinite Horizon Reinforcement Learning DRAFT

Reinforcement Learning and Optimal Control. Chapter 4 Infinite Horizon Reinforcement Learning DRAFT Reinforcement Learning and Optimal Control by Dimitri P. Bertsekas Massachusetts Institute of Technology Chapter 4 Infinite Horizon Reinforcement Learning DRAFT This is Chapter 4 of the draft textbook

More information

Q-learning and policy iteration algorithms for stochastic shortest path problems

Q-learning and policy iteration algorithms for stochastic shortest path problems Q-learning and policy iteration algorithms for stochastic shortest path problems The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation

More information

Linear stochastic approximation driven by slowly varying Markov chains

Linear stochastic approximation driven by slowly varying Markov chains Available online at www.sciencedirect.com Systems & Control Letters 50 2003 95 102 www.elsevier.com/locate/sysconle Linear stochastic approximation driven by slowly varying Marov chains Viay R. Konda,

More information

Computation and Dynamic Programming

Computation and Dynamic Programming Computation and Dynamic Programming Huseyin Topaloglu School of Operations Research and Information Engineering, Cornell University, Ithaca, New York 14853, USA topaloglu@orie.cornell.edu June 25, 2010

More information

Reinforcement Learning and Optimal Control. Chapter 4 Infinite Horizon Reinforcement Learning DRAFT

Reinforcement Learning and Optimal Control. Chapter 4 Infinite Horizon Reinforcement Learning DRAFT Reinforcement Learning and Optimal Control by Dimitri P. Bertsekas Massachusetts Institute of Technology Chapter 4 Infinite Horizon Reinforcement Learning DRAFT This is Chapter 4 of the draft textbook

More information

A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies

A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies Huizhen Yu Lab for Information and Decision Systems Massachusetts Institute of Technology Cambridge,

More information

Monte Carlo Linear Algebra: A Review and Recent Results

Monte Carlo Linear Algebra: A Review and Recent Results Monte Carlo Linear Algebra: A Review and Recent Results Dimitri P. Bertsekas Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Caradache, France, 2012 *Athena

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement

More information

DIFFERENTIAL TRAINING OF 1 ROLLOUT POLICIES

DIFFERENTIAL TRAINING OF 1 ROLLOUT POLICIES Appears in Proc. of the 35th Allerton Conference on Communication, Control, and Computing, Allerton Park, Ill., October 1997 DIFFERENTIAL TRAINING OF 1 ROLLOUT POLICIES by Dimitri P. Bertsekas 2 Abstract

More information

Quasi Stochastic Approximation American Control Conference San Francisco, June 2011

Quasi Stochastic Approximation American Control Conference San Francisco, June 2011 Quasi Stochastic Approximation American Control Conference San Francisco, June 2011 Sean P. Meyn Joint work with Darshan Shirodkar and Prashant Mehta Coordinated Science Laboratory and the Department of

More information

Lecture notes for Analysis of Algorithms : Markov decision processes

Lecture notes for Analysis of Algorithms : Markov decision processes Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

Preconditioned Temporal Difference Learning

Preconditioned Temporal Difference Learning Hengshuai Yao Zhi-Qiang Liu School of Creative Media, City University of Hong Kong, Hong Kong, China hengshuai@gmail.com zq.liu@cityu.edu.hk Abstract This paper extends many of the recent popular policy

More information

Optimistic Policy Iteration and Q-learning in Dynamic Programming

Optimistic Policy Iteration and Q-learning in Dynamic Programming Optimistic Policy Iteration and Q-learning in Dynamic Programming Dimitri P. Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology November 2010 INFORMS, Austin,

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes

More information

Incremental Constraint Projection-Proximal Methods for Nonsmooth Convex Optimization

Incremental Constraint Projection-Proximal Methods for Nonsmooth Convex Optimization Massachusetts Institute of Technology, Cambridge, MA Laboratory for Information and Decision Systems Report LIDS-P-907, July 013 Incremental Constraint Projection-Proximal Methods for Nonsmooth Convex

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

On Average Versus Discounted Reward Temporal-Difference Learning

On Average Versus Discounted Reward Temporal-Difference Learning Machine Learning, 49, 179 191, 2002 c 2002 Kluwer Academic Publishers. Manufactured in The Netherlands. On Average Versus Discounted Reward Temporal-Difference Learning JOHN N. TSITSIKLIS Laboratory for

More information

HOW TO CHOOSE THE STATE RELEVANCE WEIGHT OF THE APPROXIMATE LINEAR PROGRAM?

HOW TO CHOOSE THE STATE RELEVANCE WEIGHT OF THE APPROXIMATE LINEAR PROGRAM? HOW O CHOOSE HE SAE RELEVANCE WEIGH OF HE APPROXIMAE LINEAR PROGRAM? YANN LE ALLEC AND HEOPHANE WEBER Abstract. he linear programming approach to approximate dynamic programming was introduced in [1].

More information

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs 2015 IEEE 54th Annual Conference on Decision and Control CDC December 15-18, 2015. Osaka, Japan An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs Abhishek Gupta Rahul Jain Peter

More information

Almost Sure Convergence of Two Time-Scale Stochastic Approximation Algorithms

Almost Sure Convergence of Two Time-Scale Stochastic Approximation Algorithms Almost Sure Convergence of Two Time-Scale Stochastic Approximation Algorithms Vladislav B. Tadić Abstract The almost sure convergence of two time-scale stochastic approximation algorithms is analyzed under

More information

Minimal Residual Approaches for Policy Evaluation in Large Sparse Markov Chains

Minimal Residual Approaches for Policy Evaluation in Large Sparse Markov Chains Minimal Residual Approaches for Policy Evaluation in Large Sparse Markov Chains Yao Hengshuai Liu Zhiqiang School of Creative Media, City University of Hong Kong at Chee Avenue, Kowloon, Hong Kong, S.A.R.,

More information

arxiv: v1 [math.oc] 9 Oct 2018

arxiv: v1 [math.oc] 9 Oct 2018 A Convex Optimization Approach to Dynamic Programming in Continuous State and Action Spaces Insoon Yang arxiv:1810.03847v1 [math.oc] 9 Oct 2018 Abstract A convex optimization-based method is proposed to

More information

Simulation-Based Optimization of Markov Reward Processes

Simulation-Based Optimization of Markov Reward Processes IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 46, NO. 2, FEBRUARY 2001 191 Simulation-Based Optimization of Markov Reward Processes Peter Marbach John N. Tsitsiklis, Fellow, IEEE Abstract This paper proposes

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

New Error Bounds for Approximations from Projected Linear Equations

New Error Bounds for Approximations from Projected Linear Equations Joint Technical Report of U.H. and M.I.T. Technical Report C-2008-43 Dept. Computer Science University of Helsinki July 2008; revised July 2009 and LIDS Report 2797 Dept. EECS M.I.T. New Error Bounds for

More information

ilstd: Eligibility Traces and Convergence Analysis

ilstd: Eligibility Traces and Convergence Analysis ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca

More information

Abstract Dynamic Programming

Abstract Dynamic Programming Abstract Dynamic Programming Dimitri P. Bertsekas Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Overview of the Research Monograph Abstract Dynamic Programming"

More information

arxiv: v1 [cs.lg] 23 Oct 2017

arxiv: v1 [cs.lg] 23 Oct 2017 Accelerated Reinforcement Learning K. Lakshmanan Department of Computer Science and Engineering Indian Institute of Technology (BHU), Varanasi, India Email: lakshmanank.cse@itbhu.ac.in arxiv:1710.08070v1

More information

A System Theoretic Perspective of Learning and Optimization

A System Theoretic Perspective of Learning and Optimization A System Theoretic Perspective of Learning and Optimization Xi-Ren Cao* Hong Kong University of Science and Technology Clear Water Bay, Kowloon, Hong Kong eecao@ee.ust.hk Abstract Learning and optimization

More information

INVERSE FUNCTION THEOREM and SURFACES IN R n

INVERSE FUNCTION THEOREM and SURFACES IN R n INVERSE FUNCTION THEOREM and SURFACES IN R n Let f C k (U; R n ), with U R n open. Assume df(a) GL(R n ), where a U. The Inverse Function Theorem says there is an open neighborhood V U of a in R n so that

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A.

More information

THE INVERSE FUNCTION THEOREM FOR LIPSCHITZ MAPS

THE INVERSE FUNCTION THEOREM FOR LIPSCHITZ MAPS THE INVERSE FUNCTION THEOREM FOR LIPSCHITZ MAPS RALPH HOWARD DEPARTMENT OF MATHEMATICS UNIVERSITY OF SOUTH CAROLINA COLUMBIA, S.C. 29208, USA HOWARD@MATH.SC.EDU Abstract. This is an edited version of a

More information

Stability theory is a fundamental topic in mathematics and engineering, that include every

Stability theory is a fundamental topic in mathematics and engineering, that include every Stability Theory Stability theory is a fundamental topic in mathematics and engineering, that include every branches of control theory. For a control system, the least requirement is that the system is

More information

1 The Observability Canonical Form

1 The Observability Canonical Form NONLINEAR OBSERVERS AND SEPARATION PRINCIPLE 1 The Observability Canonical Form In this Chapter we discuss the design of observers for nonlinear systems modelled by equations of the form ẋ = f(x, u) (1)

More information

I forgot to mention last time: in the Ito formula for two standard processes, putting

I forgot to mention last time: in the Ito formula for two standard processes, putting I forgot to mention last time: in the Ito formula for two standard processes, putting dx t = a t dt + b t db t dy t = α t dt + β t db t, and taking f(x, y = xy, one has f x = y, f y = x, and f xx = f yy

More information

Least-Squares λ Policy Iteration: Bias-Variance Trade-off in Control Problems

Least-Squares λ Policy Iteration: Bias-Variance Trade-off in Control Problems : Bias-Variance Trade-off in Control Problems Christophe Thiery Bruno Scherrer LORIA - INRIA Lorraine - Campus Scientifique - BP 239 5456 Vandœuvre-lès-Nancy CEDEX - FRANCE thierych@loria.fr scherrer@loria.fr

More information

Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS

Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS 63 2.1 Introduction In this chapter we describe the analytical tools used in this thesis. They are Markov Decision Processes(MDP), Markov Renewal process

More information

Linear Least-squares Dyna-style Planning

Linear Least-squares Dyna-style Planning Linear Least-squares Dyna-style Planning Hengshuai Yao Department of Computing Science University of Alberta Edmonton, AB, Canada T6G2E8 hengshua@cs.ualberta.ca Abstract World model is very important for

More information

MATH4406 Assignment 5

MATH4406 Assignment 5 MATH4406 Assignment 5 Patrick Laub (ID: 42051392) October 7, 2014 1 The machine replacement model 1.1 Real-world motivation Consider the machine to be the entire world. Over time the creator has running

More information

Near-Potential Games: Geometry and Dynamics

Near-Potential Games: Geometry and Dynamics Near-Potential Games: Geometry and Dynamics Ozan Candogan, Asuman Ozdaglar and Pablo A. Parrilo September 6, 2011 Abstract Potential games are a special class of games for which many adaptive user dynamics

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization Dimitri P. Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology February 2014

More information

THE INVERSE FUNCTION THEOREM

THE INVERSE FUNCTION THEOREM THE INVERSE FUNCTION THEOREM W. PATRICK HOOPER The implicit function theorem is the following result: Theorem 1. Let f be a C 1 function from a neighborhood of a point a R n into R n. Suppose A = Df(a)

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process

More information

A strongly polynomial algorithm for linear systems having a binary solution

A strongly polynomial algorithm for linear systems having a binary solution A strongly polynomial algorithm for linear systems having a binary solution Sergei Chubanov Institute of Information Systems at the University of Siegen, Germany e-mail: sergei.chubanov@uni-siegen.de 7th

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

On the Approximate Solution of POMDP and the Near-Optimality of Finite-State Controllers

On the Approximate Solution of POMDP and the Near-Optimality of Finite-State Controllers On the Approximate Solution of POMDP and the Near-Optimality of Finite-State Controllers Huizhen (Janey) Yu (janey@mit.edu) Dimitri Bertsekas (dimitrib@mit.edu) Lab for Information and Decision Systems,

More information

Introduction to the Numerical Solution of IVP for ODE

Introduction to the Numerical Solution of IVP for ODE Introduction to the Numerical Solution of IVP for ODE 45 Introduction to the Numerical Solution of IVP for ODE Consider the IVP: DE x = f(t, x), IC x(a) = x a. For simplicity, we will assume here that

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

APPENDIX C: Measure Theoretic Issues

APPENDIX C: Measure Theoretic Issues APPENDIX C: Measure Theoretic Issues A general theory of stochastic dynamic programming must deal with the formidable mathematical questions that arise from the presence of uncountable probability spaces.

More information

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Mengdi Wang Ethan X. Fang Han Liu Abstract Classical stochastic gradient methods are well suited

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

Markov Chains and Stochastic Sampling

Markov Chains and Stochastic Sampling Part I Markov Chains and Stochastic Sampling 1 Markov Chains and Random Walks on Graphs 1.1 Structure of Finite Markov Chains We shall only consider Markov chains with a finite, but usually very large,

More information

Infinite-Horizon Dynamic Programming

Infinite-Horizon Dynamic Programming 1/70 Infinite-Horizon Dynamic Programming http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html Acknowledgement: this slides is based on Prof. Mengdi Wang s and Prof. Dimitri Bertsekas lecture notes 2/70 作业

More information

Prediction-based adaptive control of a class of discrete-time nonlinear systems with nonlinear growth rate

Prediction-based adaptive control of a class of discrete-time nonlinear systems with nonlinear growth rate www.scichina.com info.scichina.com www.springerlin.com Prediction-based adaptive control of a class of discrete-time nonlinear systems with nonlinear growth rate WEI Chen & CHEN ZongJi School of Automation

More information

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9 MAT 570 REAL ANALYSIS LECTURE NOTES PROFESSOR: JOHN QUIGG SEMESTER: FALL 204 Contents. Sets 2 2. Functions 5 3. Countability 7 4. Axiom of choice 8 5. Equivalence relations 9 6. Real numbers 9 7. Extended

More information

Notes on Measure Theory and Markov Processes

Notes on Measure Theory and Markov Processes Notes on Measure Theory and Markov Processes Diego Daruich March 28, 2014 1 Preliminaries 1.1 Motivation The objective of these notes will be to develop tools from measure theory and probability to allow

More information

Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts.

Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts. An Upper Bound on the oss from Approximate Optimal-Value Functions Satinder P. Singh Richard C. Yee Department of Computer Science University of Massachusetts Amherst, MA 01003 singh@cs.umass.edu, yee@cs.umass.edu

More information

The convergence limit of the temporal difference learning

The convergence limit of the temporal difference learning The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector

More information

INTRODUCTION TO MARKOV DECISION PROCESSES

INTRODUCTION TO MARKOV DECISION PROCESSES INTRODUCTION TO MARKOV DECISION PROCESSES Balázs Csanád Csáji Research Fellow, The University of Melbourne Signals & Systems Colloquium, 29 April 2010 Department of Electrical and Electronic Engineering,

More information

Notes on Tabular Methods

Notes on Tabular Methods Notes on Tabular ethods Nan Jiang September 28, 208 Overview of the methods. Tabular certainty-equivalence Certainty-equivalence is a model-based RL algorithm, that is, it first estimates an DP model from

More information

Observer design for a general class of triangular systems

Observer design for a general class of triangular systems 1st International Symposium on Mathematical Theory of Networks and Systems July 7-11, 014. Observer design for a general class of triangular systems Dimitris Boskos 1 John Tsinias Abstract The paper deals

More information

NORMS ON SPACE OF MATRICES

NORMS ON SPACE OF MATRICES NORMS ON SPACE OF MATRICES. Operator Norms on Space of linear maps Let A be an n n real matrix and x 0 be a vector in R n. We would like to use the Picard iteration method to solve for the following system

More information

Regular Policies in Abstract Dynamic Programming

Regular Policies in Abstract Dynamic Programming August 2016 (Revised January 2017) Report LIDS-P-3173 Regular Policies in Abstract Dynamic Programming Dimitri P. Bertsekas Abstract We consider challenging dynamic programming models where the associated

More information

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical

More information

Stochastic Dynamic Programming. Jesus Fernandez-Villaverde University of Pennsylvania

Stochastic Dynamic Programming. Jesus Fernandez-Villaverde University of Pennsylvania Stochastic Dynamic Programming Jesus Fernande-Villaverde University of Pennsylvania 1 Introducing Uncertainty in Dynamic Programming Stochastic dynamic programming presents a very exible framework to handle

More information

5. Solving the Bellman Equation

5. Solving the Bellman Equation 5. Solving the Bellman Equation In the next two lectures, we will look at several methods to solve Bellman s Equation (BE) for the stochastic shortest path problem: Value Iteration, Policy Iteration and

More information

Lecture 12. F o s, (1.1) F t := s>t

Lecture 12. F o s, (1.1) F t := s>t Lecture 12 1 Brownian motion: the Markov property Let C := C(0, ), R) be the space of continuous functions mapping from 0, ) to R, in which a Brownian motion (B t ) t 0 almost surely takes its value. Let

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Q-Learning and Stochastic Approximation

Q-Learning and Stochastic Approximation MS&E338 Reinforcement Learning Lecture 4-04.11.018 Q-Learning and Stochastic Approximation Lecturer: Ben Van Roy Scribe: Christopher Lazarus Javier Sagastuy In this lecture we study the convergence of

More information

Online solution of the average cost Kullback-Leibler optimization problem

Online solution of the average cost Kullback-Leibler optimization problem Online solution of the average cost Kullback-Leibler optimization problem Joris Bierkens Radboud University Nijmegen j.bierkens@science.ru.nl Bert Kappen Radboud University Nijmegen b.kappen@science.ru.nl

More information

1 Stochastic Dynamic Programming

1 Stochastic Dynamic Programming 1 Stochastic Dynamic Programming Formally, a stochastic dynamic program has the same components as a deterministic one; the only modification is to the state transition equation. When events in the future

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Theorem 5.3. Let E/F, E = F (u), be a simple field extension. Then u is algebraic if and only if E/F is finite. In this case, [E : F ] = deg f u.

Theorem 5.3. Let E/F, E = F (u), be a simple field extension. Then u is algebraic if and only if E/F is finite. In this case, [E : F ] = deg f u. 5. Fields 5.1. Field extensions. Let F E be a subfield of the field E. We also describe this situation by saying that E is an extension field of F, and we write E/F to express this fact. If E/F is a field

More information

Convex Analysis and Optimization Chapter 2 Solutions

Convex Analysis and Optimization Chapter 2 Solutions Convex Analysis and Optimization Chapter 2 Solutions Dimitri P. Bertsekas with Angelia Nedić and Asuman E. Ozdaglar Massachusetts Institute of Technology Athena Scientific, Belmont, Massachusetts http://www.athenasc.com

More information