Quadratic approximate dynamic programming for input-affine systems

Size: px
Start display at page:

Download "Quadratic approximate dynamic programming for input-affine systems"

Transcription

1 INTERNATIONAL JOURNAL OF ROBUST AND NONLINEAR CONTROL Int. J. Robust. Nonlinear Control (0) Published online in Wiley Online Library (wileyonlinelibrary.com). DOI: 0.00/rnc.894 Quadratic approximate dynamic programming for input-affine systems Arezou Keshaarz *, and Stephen Boyd Electrical Engineering, Stanford Uniersity, Stanford, CA, USA SUMMARY We consider the use of quadratic approximate alue functions for stochastic control problems with inputaffine dynamics and conex stage cost and constraints. Ealuating the approximate dynamic programming policy in such cases requires the solution of an explicit conex optimization problem, such as a quadratic program, which can be carried out efficiently. We describe a simple and general method for approximate alue iteration that also relies on our ability to sole conex optimization problems, in this case, typically a semidefinite program. Although we hae no theoretical guarantee on the performance attained using our method, we obsere that ery good performance can be obtained in practice. Copyright 0 John Wiley & Sons, Ltd. Receied 0 February 0; Reised May 0; Accepted 9 July 0 KEY WORDS: approximate dynamic programming; stochastic control; conex optimization. INTRODUCTION We consider stochastic control problems with input-affine dynamics and conex stage costs, which arise in many applications in nonlinear control, supply chain, and finance. Such problems can be soled in principle using dynamic programming (DP), which expresses the optimal policy in terms of an optimization problem inoling the alue function (or a sequence of alue functions in the time-arying case), which is a function on the state space R n. The alue function (or functions) can be found, in principle, by an iteration inoling the Bellman operator, which maps functions on the state space into functions on the state space and inoles an expectation and a minimization step; see [] for an early work on DP and [ 4] for a more detailed oeriew. For the special case when the dynamics is linear and the stage cost is conex quadratic, DP gies a complete and explicit solution of the problem, because in this case, the alue function is also quadratic and the expectation and minimization steps both presere quadratic functions and can be carried out explicitly. For other input-affine conex stochastic control problems, DP is difficult to carry out. The basic problem is that there is no general way to (exactly) represent a function on R n, much less carry out the expectation and minimization steps that arise in the Bellman operator. For small state space dimension, say, n 6 4, the important region of state space can be gridded (or otherwise represented by a finite number of points), and functions on it can be represented by a finite-dimensional set of functions, such as piecewise affine. Brute force computation can then be used to find the alue function (or more accurately, a good approximation of it) and also the optimal policy. But this approach will not scale to larger state space dimensions, because, in general, the number of basis functions needed to represent a function to a gien accuracy grows exponentially in the state space dimension n (this is the curse of dimensionality). *Correspondence to: Arezou Keshaarz, Electrical Engineering, Stanford Uniersity, Packard 43, 350 Serra Mall, Stanford, CA 94305, USA. arezou@stanford.edu Copyright 0 John Wiley & Sons, Ltd.

2 A. KESHAVARZ AND S. BOYD For problems with larger dimensions, approximate dynamic programming (ADP) gies a method for finding a good, if not optimal, policy. In ADP, the control policy uses a surrogate or approximate alue function in place of the true alue function. The distinction with the brute force numerical method described preiously is that in ADP, we abandon the goal of approximating the true alue function well and, instead, only hope to capture some of its key attributes. The approximate alue function is also chosen so that the optimization problem that must be soled to ealuate the policy is tractable. It has been obsered that good performance can be obtained with ADP, een when the approximate alue function is not a particularly good approximation of the true alue function. (Of course, this problem depends on how the approximate alue function is chosen.) There are many methods for finding an appropriate approximate alue function; for an oeriew on ADP, see [3]. The approximate alue function can be chosen as the true alue function of a simplified ersion of the stochastic control problem, for which we can easily obtain the alue function. For example, we can form a linear quadratic approximation of the problem and use the (quadratic) alue function obtained for the approximation as our approximate alue function. Another example used in model predictie control (MPC) or certainty-equialent roll out (see, e.g., [5]) is to replace the stochastic terms in the original problem with constant alues, say, their expectations. This results in an ordinary optimization problem, which (if it is conex) we can sole to obtain the alue function. Recently, Wang et al. hae deeloped a method that uses conex optimization to compute a quadratic lower bound on the true alue function, which proides a good candidate for an approximate alue function and also gies a lower bound on the optimal performance [6, 7]. This was extended in [8] to use as approximate alue function the pointwise supremum of a family of quadratic lower bounds on the alue function. These methods, howeer, cannot be used for general input-affine conex problems. Approximate dynamic programming is closely related to inerse optimization, which has been studied in arious fields such as control [9, 0], robotics [, ], and economics [3 5]. In inerse optimization, the goal is to find (or estimate or impute) the objectie function in an underlying optimizing process, gien obserations of optimal (or nearly optimal) actions. In a stochastic control problem, the objectie is related to the alue function, so the goal is to approximate the alue function, gien samples of optimal or nearly optimal actions. Watkins et al. introduced an algorithm called Q-learning [6, 7], which obseres the action state pair at each step and updates a quality function on the basis of the obtained reward/cost. Model predictie control, also known as receding-horizon control or rolling-horizon planning [8 0], is also closely related to ADP. In MPC, at each step, we plan for the next T time steps, using some estimates or predictions of future unknown alues and then execute only the action for the current time step. At the next step, we sole the same problem again, now using the alue of the current state and updated alues of the predictions based on any new information aailable. MPC can be interpreted as ADP, where the approximate alue function is the optimal alue of an optimization problem. In this paper, we take a simple approach, by restricting the approximate alue functions to be quadratic. This implies that the expectation step can be carried out exactly and that ealuating the policy reduces to a tractable conex optimization problem, often a simple quadratic program (QP), in many practical cases. To choose a quadratic approximate alue function, we use the same Bellman operator iteration that would result in the true alue function (when some technical conditions hold), but we replace the Bellman operator with an approximation that maps quadratic functions to quadratic functions. The idea of alue iteration, followed by a projection step back to the subset of candidate approximate alue functions, is an old one, explored by many authors [3, Chapter 6;, Chapter ; ]. In the reinforcement learning community, two closely related methods are fitted alue iteration and fitted Q-iteration [3 7]. Most of these works consider problems with continuous state space and finite action space, and some of them come with theoretical guarantees on conergence and performance bounds. For instance, the authors in [3] proposed a family of tree-based methods for fitting approximate Q-functions and proided conergence guarantees and performance bounds for some of the proposed methods. Riedmiller considered the same framework but employed a multilayered perceptron to fit (model-free) Q-functions and proided empirical performance results [4]. Munos and Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc

3 QUADRATIC APPROXIMATE DYNAMIC PROGRAMMING FOR INPUT-AFFINE SYSTEMS Szepesari deeloped finite-time bounds for the fitted alue iteration for discounted problems with bounded rewards [6]. Antos et al. extended this work to continuous action space and deeloped performance bounds for a ariant of the fitted Q-iteration by using stationary stochastic policies, where greedy action selection is replaced with searching on a restricted set of candidate policies [7]. There are also approximate ersions of policy iteration and the linear programming formulation of DPs [8, 9], where the alue function is replaced by a linear combination of a set of basis functions. Lagoudakis et al. considered the discounted finite state case and proposed a projected policy iteration method [30] that minimizes the ` Bellman residual to obtain an approximate alue function and policy at each step of the algorithm. Tsitsiklis and Van Roy established strong theoretical bounds on the suboptimality of the ADP policy, for the discounted cost finite state case [3, 3]. Our contribution is to work out the details of a projected ADP method for the case of input-affine dynamics and conex stage costs, in the relatiely new context of aailability of extremely fast and reliable solers for conex optimization problems. We rely on our ability to (numerically) sole conex optimization problems with great speed and reliability. Recent adances hae shown that using custom-generated solers will speed up computation by orders of magnitude [33 38]. With solers that are orders of magnitude faster, we can carry out approximate alue iteration for many practical problems and obtain approximate alue functions that are simple yet achiee ery good performance. The method we propose comes with no theoretical guarantees. There is no theoretical guarantee that the modified Bellman iterations will conerge, and when they do (or when we simply terminate the iterations early), we do not hae any bound on how suboptimal the resulting control policy is. On the other hand, we hae found that the methods described in this paper work ery well in practice. In cases in which the performance bounding methods of Wang et al. can be applied, we can confirm that our quadratic ADP policies are ery good and, in many cases, nearly optimal, at least for specific cases. Our method has many of the same characteristics as proportional integral deriatie (PID) control, widely used in industrial control. There are no theoretical guarantees that PID control will work well; indeed, it is easy to create a plant for which no PID control results in a stable closed-loop system (which means the objectie alue is infinite). On the other hand, PID control, with some tuning of the parameters, can usually be made to work ery well, or at least well enough, in many (if not most) practical applications. Style of the paper. The style of the paper is informal but algorithmic and computationally oriented. We are ery informal and caalier in our mathematics, occasionally referring the reader to other work that coers the specific topic more formally and precisely. We do this so as to keep the ideas simple and clear, without becoming bogged down in technical details, and in any case, we do not make any mathematical claims about the methods described. The methods presented are offered as methods that can work ery well in practice and not as methods for which we make any more specific or precise claim. We also do not gie the most general formulation possible; instead, we stick with what we hope gies a good trade-off of clarity, simplicity, and practical utility. Outline. In Section, we describe three ariations on the stochastic control problem. In Section 3, we reiew the DP solution of the problems, including the special case when the dynamics is linear and the stage cost functions are conex quadratic, as well as quadratic ADP, which is the method we propose. We describe a projected alue iteration method for obtaining an approximate alue function in Section 4. In the remaining three sections, we describe numerical examples: a traditional control problem with bounded inputs (Section 5), a multiperiod portfolio optimization problem (Section 6), and a ehicle control problem (Section 7).. INPUT-AFFINE CONVEX STOCHASTIC CONTROL In this section, we set up three ariations on the input-affine conex stochastic control problem: the finite-horizon case, the discounted infinite-horizon case, and the aerage-cost case. We start with the assumptions and definitions that are common to all cases. Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc

4 A. KESHAVARZ AND S. BOYD.. The common model Input-affine random dynamics. We consider a dynamical system with state x t R n, input or action u t R m, and input-affine random dynamics x tc D f t.x t / C g t.x t /u t, where f t W R n! R nn and g t W R n! R nm. We assume that f t and g t are (possibly) random, that is, they depend on some random ariables, with.f t, g t / and.f, g / independent for t. The initial state x 0 is also a random ariable, independent of.f t, g t /. We say the dynamics is time-inariant if the random ariables.f t, g t / are identically distributed (and therefore IID). Linear dynamics. We say the dynamics is linear (or more accurately, affine) if f t is an affine function of x t and g t does not depend on x t, in which case we can write the dynamics in the more familiar form x tc D A t x t C B t u t C c t. Here, A t R nn, B t R nm,andc t R n are random. Closed-loop dynamics. We consider state feedback policies, that is, u t D t.x t /, t D 0,, :::, where t W R n! R m is the policy in time t. The closed-loop dynamics are then x tc D f t.x t / C g t.x t / t.x t /, which recursiely defines a stochastic process for x t (and u t D t.x t /). We say the policy is time-inariant if it does not depend on t, in which case we denote it by. Stage cost. The stage cost is gien by `t., /, where`t W R n R m! R [¹ºis an extended alued closed conex function. We use infinite stage cost to denote unallowed state input pairs, that is, (joint) constraints on and. We will assume that for each state, there exists an input with finite stage cost. The stage cost is time-inariant if `t does not depend on t, in which case we write it as `. Conex quadratic stage cost. An important special case is when the stage cost is a (conex) quadratic function and has the form `t., / D.=/ T Q t S t q t St T 7 R t r t , qt T rt T s t where that is, it is positie semidefinite. " Qt S t S T t R t # 0, Quadratic program-representable stage cost. A stage cost is QP-representable if `t is a conex quadratic plus conex piecewise linear function, subject to polyhedral constraint (which can be represented in the cost function using an indicator function). Minimizing such a function can be carried out by soling a QP. Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc

5 QUADRATIC APPROXIMATE DYNAMIC PROGRAMMING FOR INPUT-AFFINE SYSTEMS Stochastic control problem. In the problems we consider, the oerall objectie function, which we denote by J, is a sum or aerage of the expected stage costs. The associated stochastic control problem is to choose the policies t (or policy when the policy is time-inariant) so as to minimize J.WeletJ? denote the optimal (minimal) alue of the objectie, and we let t? denote an optimal policy at time t. When the dynamics is linear, we call the stochastic control problem a linear conex stochastic control problem. When the dynamics is linear and the stage cost is conex quadratic, we call the stochastic control problem a linear quadratic stochastic control problem... The problems Finite horizon. In the finite-horizon problem, the objectie is the expected alue of the sum of the stage costs up to a horizon T : J D TX E`t.x t, u t /. td0 Discounted infinite horizon. In the discounted infinite-horizon problem, we assume that the dynamics, stage cost, and policy are time-inariant. The objectie in this case is where.0, / is a discount factor. X J D E t`.x t, u t /, td0 Aerage cost. In the aerage-cost problem, we assume that the dynamics, stage cost, and policy are time-inariant, and take as objectie the aerage stage cost, J D lim T! T C TX E`.x t, u t /. We will simply assume that the expectation exists in all three objecties, the sum exists in the discounted aerage cost, and the limit exists in the aerage cost. 3. DYNAMIC AND APPROXIMATE DYNAMIC PROGRAMMING The stochastic control problems described earlier can be soled in principle using DP, which we sketch here for future reference. See [3, 4, 39] for (much) more details. DP makes use of a function (time-arying in the finite-horizon case) V W R n! R that characterizes the cost of starting from that state, when an optimal policy is used. The definition of V, its characterization, and methods for computing it differ (slightly) for the three stochastic control problems we consider. Finite horizon. In the finite-horizon case, we hae a alue function V t for each t D 0, :::, T.The alue function V t. / is the minimum cost-to-go starting from state x t D at time t, using an optimal policy for times t, :::, T : V t. / D td0 TX E.`.x,?.x // ˇˇ xt D /, t D 0, :::, T. Dt The optimal cost is gien by J? D EV 0.x 0 /, where the expectation is oer x 0. Although our definition of V t refers to an optimal policy? t, we can find V t (in principle) using a backward recursion that does not require knowledge of an optimal policy. We start with V T. / D min `T., / Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc

6 A. KESHAVARZ AND S. BOYD and then find V T, V T, :::, V 0 from the recursion V t. / D min.`t., / C EV tc.f t. / C g t. ///, for t D T, T, :::, 0. (The expectation here is oer.f t, g t /.) Defining the Bellman operator T t for the finite-horizon problem as.t t h/. / D min.`t., / C Eh.f t. / C g t. ///, for h W R n! R [¹º, we can express the aforementioned recursion compactly as V t D T t V tc, t D T, T, :::, 0, () with V T C D 0. We can express an optimal policy in terms of the alue functions as? t. / D arg min.`t., / C EV tc.f t. / C g t. ///, t D 0, :::, T. Discounted infinite horizon. For the discounted infinite-horizon problem, the alue function is timeinariant, that is, it does not depend on t. The alue function V. / is the minimum cost-to-go from state x 0 D at t D 0, using an optimal policy: V. /D X E t`.x t,?.x t // ˇˇ x0 D. td0 Thus, J? D EV.x 0 /. The alue function cannot be constructed using a simple recursion, as in the finite-horizon case, but it is characterized as the unique solution of the Bellman fixed-point equation T V D V,where the Bellman operator T for the discounted infinite-horizon case is.t h/. / D min.`., / C Eh.f. / C g. ///, for h W R n! R [¹º. The alue function can be found (in principle) by seeral methods, including iteration of the Bellman operator, V kc D T V k, k D,, :::, () which is called alue iteration. Under certain technical conditions, this iteration conerges to V []; see [39, Chapter 9] for a detailed analysis of the conergence of alue iteration. We can express an optimal policy in terms of the alue function as?. / D arg min.`., / C EV.f. /C g. ///. Aerage cost. For the aerage-cost problem, the alue function is not the cost-to-go starting from a gien state (which would hae the same alue J? for all states). Instead, it represents the differential sum of costs, that is, V. /D X E.`.x t,?.x t // J ˇˇ? x0 D /. td0 We can characterize the alue function (up to an additie constant) and the optimal aerage cost as a solution of the fixed-point equation V C J? D T V,whereT, the Bellman operator T for the aerage-cost problem, is defined as.t h/. / D min.`., / C E h.f. / C g. /// Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc

7 QUADRATIC APPROXIMATE DYNAMIC PROGRAMMING FOR INPUT-AFFINE SYSTEMS for h W R n! R [¹º. Notice that if V and J? satisfy the fixed-point equation, so do V C ˇ and J? for any ˇ R. Thus, we can choose a reference state x ref and assume V.x ref / D 0 without loss of generality. Under some technical assumptions [3, 4], the alue function V and the optimal cost J? can be found (in principle) by seeral methods, including alue iteration, which has the form J kc D T V k x ref, V kc D T V k J kc, (3) for k D, :::. V k conerges to V,andJ conerges to J? [3, 40]. We can express an optimal policy in terms of V as?. / D arg min.`., / C EV.f. /C g. ///. 3.. Linear conex dynamic programming When the dynamics is linear, the alue function (or functions, in the finite-horizon case) is conex. To see this, we first note that the Bellman operator maps conex functions to conex functions. The (random) function f t. / C g t. / D A t C B t C c t is an affine function of., /; it follows that when h is conex, h.f t. / C g t. // D h.a t C B t C c t / is a conex function of., /. Expectation preseres conexity, so is a conex function of., /. Finally, Eh.f t. / C g t. // D Eh.A t C B t C c t / min.`., / C Eh.A t C B t C c t // is conex because ` is conex, and conexity is presered under partial minimization [4, Section 3..5]. It follows that the alue function is conex. In the finite-horizon case, the aforementioned argument shows that each V t is conex. In the infinite-horizon and aerage-cost cases, the aforementioned argument shows that alue iteration preseres conexity, so if we start with a conex initial guess (say, 0), all iterates are conex. The limit of a sequence of conex functions is conex, so we conclude that V is conex. Ealuating the optimal policy at x t, that is, minimizing `t., / C EV t.f t. / C g t. // oer, inoles soling a conex optimization problem. In the finite-horizon and aerage-cost cases, the discount factor is. In the infinite-horizon cases (discounted and aerage cost), the stage cost and alue function are time-inariant. 3.. Linear quadratic dynamic programming For linear quadratic stochastic control problems, we can effectiely sole the stochastic control problem by using DP. The reason is simple: The Bellman operator maps conex quadratic functions to conex quadratic functions, so the alue function is also conex quadratic (and we can compute it by alue iteration). The argument is similar to the conex case: conex quadratic functions are presered under expectation and partial minimization. One big difference, howeer, is that the Bellman iteration can actually be carried out in the linear quadratic case, using an explicit representation of conex quadratic functions, and standard linear Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc

8 A. KESHAVARZ AND S. BOYD algebra operations. The optimal policy is affine, with coefficients we can compute. To simplify notation, we consider the discounted infinite-horizon case; the other two cases are ery similar. We denote the set of conex quadratic functions on R n, that is, those with the form h. / D.=/ T H where H 0,byQ n. With linear dynamics, we hae T At C B h.a t C B t C c t / D.=/ t C c t H h T h p h T h p, At C B t C c t where.a t, B t, c t / is a random ariable. Conex quadratic functions are closed under expectation, so Eh.A t C B t C c t / is conex quadratic; the details (including formulas for the coefficients) are gien in the Appendix. The stage cost ` is conex and quadratic, that is, `., / D.=/ T Q S q S T 7 R r 5 q T r T s where " Q S # S T R 0. Thus, `., / C Eh.A t C B t C c t / is conex and quadratic. We write `., / C Eh.A t C B t C c t / D.=/ 4 3T 5 M where M M M M D 4 M T M M 3 7 5, " M M M T M 4 3 5, 4 # 0. M3 T M3 T M 33 Partial minimization of `., / C Eh.A t C B t C c t / oer results in the quadratic function " # T S S.T h/. / D min.`., / C Eh.A t C B t C c t // D.=/, S T S where " # " # " # S S M M 3 M D M S T S M3 T M T M M 3 33 is the Schur complement of M (with respect to its (, ) block); when M is singular, we can replace M with the pseudo-inerse M [4, Section A5.5]. Furthermore, because " # M M 0, M T M we can conclude that S 0; thus,.t h/. / is a conex quadratic function, that is, T h Q n. This shows that quadratic conex functions are inariant under the Bellman operator T.From this, it follows that the alue function is conex and quadratic. Again, in the finite-horizon case, this shows that each V t is conex and quadratic. In the infinite-horizon and aerage-cost cases, this argument shows that a conex quadratic function is presered under alue iteration, so if the initial guess is conex and quadratic, all iterates are conex and quadratic. The limit of a set of conex quadratic functions is a conex quadratic; thus, we can conclude that V is a conex quadratic function. Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc M T 3 3 5,,

9 QUADRATIC APPROXIMATE DYNAMIC PROGRAMMING FOR INPUT-AFFINE SYSTEMS 3.3. Approximate dynamic programming Approximate dynamic programming is a general approach for obtaining a good suboptimal policy. The basic idea is to replace the true alue function V (or V t in the finite-horizon case) with an approximation OV ( OV t ), which yields the ADP policies O t.x/ D arg min `t.x, u/ C E OV tc.f t.x/ C g t.x/u/ u for the finite-horizon case, O.x/ D arg min u for the discounted infinite-horizon case, and O.x/ D arg min u `.x, u/ C E OV.f.x/Cg.x/u/ `.x, u/ C E OV.f.x/C g.x/u/ for the aerage-cost case. These ADP policies reduce to the optimal policy for the choice OV t D V t. The goal is to choose these approximate alue functions so that the minimizations needed to compute O.x/ can be efficiently carried out; and the objectie JO with the resulting policy is small, which is hoped to be close to J?. It has been obsered in practice that good policies (i.e., ones with small objectie alues) can result een when OV is not a particularly good approximation of V Quadratic approximate dynamic programming In quadratic ADP, we take OV Q n (or OV t Q n in the time-arying case), that is, T P p OV. /D.=/ p T, q where P 0. With input-affine dynamics, we hae OV.f t. / C g t. // D.=/ ft. / C g t. / T P p T p q ft. / C g t. /, where.f t, g t / is a random ariable with known mean and coariance matrix. Thus, the expectation E V.f O t. / C g t. // is a conex quadratic function of, with coefficients that can be computed using the first and second moments of.f t, g t /; see the Appendix. A special case is when the cost function `., / (and `t., / for the time-arying case) is QP-representable. Then, for a gien, the quadratic ADP policy O. / D arg min can be ealuated by soling a QP. `., / C E OV.f t. / C g t. // 4. PROJECTED VALUE ITERATION Projected alue iteration is a standard method that can be used to obtain an approximate alue function, from a gien (finite-dimensional) set of candidate approximate alue functions, such as Q n. The idea is to carry out alue iteration but to follow each Bellman operator step with a step that approximates the result with an element from the gien set of candidate approximate alue functions. The approximation step is typically called a projection and is denoted, because it maps into the set of candidate approximate alue functions, but it need not be a projection (i.e., it need Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc

10 A. KESHAVARZ AND S. BOYD not minimize any distance measure). The resulting method for computing an approximate alue function is called projected alue iteration. Projected alue iteration has been explored by many authors [3, Chapter 6;, Chapter ; ; 5; 7]. Here, we consider scenarios in which the state space and action space are continuous and establish a framework to carry out projected alue iteration for those problems. Projected alue iteration has the following form for our three problems. For the finite-horizon problem, we start from OV T C D 0 and compute approximate alue functions as OV t D T t OV tc, t D T, T, :::,. For the discounted infinite-horizon problem, we hae the iteration OV.kC/ D T OV.k/, k D,, :::. There is no reason to beliee that this iteration always conerges, although as a practical matter, it typically does. In any case, we terminate after some number of iterations, and we judge the whole method not in terms of conergence of projected alue iteration but in terms of performance of the policy obtained. For the aerage-cost problem, projected alue iteration has the form J O.kC/ D T OV.k/ x ref OV.kC/ D T V O.k/ J O.kC/, k D,, :::. Here too, we cannot guarantee conergence, although it typically does in practice. In the aforementioned expressions, OV k are elements in the gien set of candidate approximate alue functions, but T OV k is usually not; indeed, we hae no general way of representing T OV k.the iterations gien preiously can be carried out, howeer, because only requires the ealuation of T OV k x.i/ (which are numbers) at a finite number of points x.i/. 4.. Quadratic projected iteration We now consider the case of quadratic ADP, that is, our subset of candidate approximate Lyapuno functions is Q n. In this case, the expectation appearing in the Bellman operator can be carried out exactly. We can ealuate T OV k.x/ for any x, by soling a conex optimization problem; when the stage cost is QP-representable, we can ealuate T OV k.x/ by soling a QP. We form T OV k as follows. We choose a set of sampling points x./, :::, x.n / R n and ealuate.i/ D T OV k.i/ x,fori D, :::, N.Wetake T OV k as any solution of the least squares fitting problem P minimize N id V x.i/.i/ subject to V Q n, with ariable V. In other words, we simply fit a conex quadratic function to the alues of the Bellman operator at the sample points. This requires the solution of a semidefinite program [0, 4]. Many ariations on this basic projection method are possible. We can use any conex fitting criterion instead of a least squares criterion, for example, the sum of absolute errors. We can also add additional constraints on V that can represent prior knowledge (beyond conexity). For example, we may hae a quadratic lower bound on the true alue function, and we can add this condition as a stronger constraint on V than simply conexity. (Quadratic lower bounds can be found using the methods described in [6, 43].) Another ariation on the simple fitting method described earlier adds proximal regularization to the fitting step. This means that we add a term of the form.=/kv V k k F to the objectie, which Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc

11 QUADRATIC APPROXIMATE DYNAMIC PROGRAMMING FOR INPUT-AFFINE SYSTEMS penalizes deiations of V from the preious alue. Here, > 0 is a parameter, and the norm is the Frobenius norm of the coefficient matrices, " # kv V k P p P k p k k F D p T q p k T. q k F Choice of sampling points. The choice of the ealuation points affects our projection method, so it is important to use at least a reasonable choice. An ideal choice would be to sample points from the state in that period, under the final ADP policy chosen. But this cannot be done: We need to run projected alue iteration to hae the ADP policy, and we need the ADP policy to sample the points to use for. A good method is to start with a reasonable choice of sample points and use these to construct an ADP policy. Then we run this ADP policy to obtain a new sampling of x t for each t. We now repeat the projected alue iteration, using these sample points. This sometimes gies an improement in performance. 5. INPUT-CONSTRAINED LINEAR QUADRATIC CONTROL Our first example is a traditional time-inariant linear control system, with dynamics which has the form used in this paper with x tc D Ax t C B t u t C w t, f t. / D A C w t, g t. / D B. Here, A R nn and B R nm are known, and w t is an IID random ariable with zero mean Ew t D 0 and coariance Ew t wt T D W. The stage cost is the traditional quadratic one, plus a constraint on the input amplitude: ² x T `.x t, u t / D t Qx t C u T t Ru t ku t k 6 U max, ku t k >U max, where Q 0 and R 0. We consider the aerage-cost problem. Numerical instance. We consider a problem with n D 0, m D, U max D, Q D 0I,andR D I. The entries of A are chosen from N.0, I/, and the entries of B are chosen from a uniform distribution on Œ 0.5, 0.5. The matrix A is then scaled so that max.a/ D. The noise distribution is normal, w t N.0, I/. Method parameters. We start from the x 0 D 0 at iteration 0, and OV./ D V quad,wherev quad is the (true) quadratic alue function when the stage cost is replaced with `quad., / D T Q C T R for all., /. In each iteration of projected alue iteration, we run the ADP policy for N D 000 time steps and ealuate T OV.k/ for each of these sample points. We then fit a new quadratic approximate alue function OV.kC/, enforcing a lower bound V quad and a proximal term with D 0.. Here, the proximal term has little effect on the performance of the algorithm, and we could hae eliminated the proximal at not much cost to the performance of the algorithm. We use x N to start the next iteration. For comparison, we also carry out our method with the more naie initial choice OV./ D Q,andwe drop the lower bound V quad. Results. Figure shows the performance J k of the ADP policy (ealuated by a simulation run of 0,000 steps), after each round of projected alue iteration. The performance of the policies obtained using the naie parameter choices are denoted by J.k/ naie. For this problem, we can ealuate a lower bound J lb D on J by using the method of [6]. This shows that the performance of our ADP Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc

12 A. KESHAVARZ AND S. BOYD naie iterations Figure. J.k/ (solid line) and J.k/ naie (dashed line). policy at termination is at most 4% suboptimal (but likely much closer to optimal). The plots show that the naie initialization only results in a few more projected alue iteration steps. Close examination of the plots shows that the performance need not improe in each step. For example, the best performance (but only by a ery small amount) with the more sophisticated initialization is obtained in iteration 9. Computation timing. The timing results here are reported for a 3.4-GHz Intel Xeon processor running Linux. Ealuating the ADP policy requires soling a small QP with 0 ariables and 8 constraints. We use CVGXEN [38] to generate custom C code for this QP, which executes in around 50 s. The simulation of 000 time steps, which we use in each projected alue iteration, requires around 0.05 s. The fitting is carried out using CVX [44] and takes around s (this time could be speeded up considerably). 6. MULTIPERIOD PORTFOLIO OPTIMIZATION We consider a discounted infinite-horizon problem of optimizing a portfolio of n assets with discount factor. The positions at time t is denoted by x t R n,where.x t / i denotes the dollar alue of asset i at the beginning of time t. At each time t, we can buy and sell assets. We denote the trades by u t R n,where.u t / i denotes the trade for asset i. A positie.u t / i means that we buy asset i, andanegatie.u t / i means that we sell asset i at time t. The position eoles according to the dynamics with x tc D diag.r t /.x t C u t /, where r t is the ector of asset returns in period t. We express this in our form as f t. / D diag.r t /, g t. / D diag.r t /. The return ectors are IID with mean E r t DNr and coariance E.r t Nr/.r t Nr/ T D. The stage cost consists of the following terms: the total gross cash put in, which is T u t ;arisk penalty (a multiple of the ariance of the post-trade portfolio); a quadratic transaction cost; and the constraint that the portfolio is long-only (which is x t C u t > 0). It is gien by ² `.x, u/ D T u C.x C u/ T.x C u/ C u T diag.s/u x C u > 0, otherwise, where > 0 is the risk aersion parameter and s i > 0 models the price-impact cost for asset i. We assume that the initial portfolio x 0 is gien. Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc

13 QUADRATIC APPROXIMATE DYNAMIC PROGRAMMING FOR INPUT-AFFINE SYSTEMS Numerical instance. We will consider n D 0 assets, a discount factor of D 0.99, and an initial portfolio of x 0 D 0. The returns r t are IID and hae a lognormal distribution, that is, log r t N., Q /. The means i are chosen from a N.0, 0.0 / distribution; the diagonal entries of Q are chosen from a uniform Œ0, 0.0 distribution, and the off-diagonal entries are generated using Q ij D C ij. ii jj / =,wherec is a correlation matrix, generated randomly. Because the returns are lognormal, we hae Nr D exp. C.=/ diag. Q //, ij DNr i Nr j.exp Q ij /. We take D 0.5 and choose the entries of s from a uniform Œ0, distribution. Method parameters. We start with OV./ D V quad,wherev quad is the (true) quadratic alue function when the long-only constraint is ignored. Furthermore, V quad is a lower bound on the true alue function V. We choose the ealuation points in each iteration of projected alue iteration by starting from x 0 and running the policy induced by OV.k/ for 000 steps. We then fit a quadratic approximate alue function, enforcing a lower bound V quad and a proximal term with D 0.. The proximal term in this problem has a much more significant effect on the performance of projected alue iteration smaller alues of require many more iterations of projected alue iteration to achiee the same performance. Results. Figure demonstrates the performance of the ADP policies, where J.k/ corresponds to the ADP policy at iteration k of projected alue iteration. In each case, we run 000 Monte Carlo simulations, each for 000 steps, and report the aerage total discounted cost across the 000 time steps. We also plot a lower bound J lb obtained using the methods described in [43, 45]. We can see that after around 80 steps of projected alue iteration, we arrie at a policy that is nearly optimal. It is also interesting to note that it takes around 40 projected alue iteration steps before we produce a policy that is profitable (i.e., that has J 6 0). Computation timing. Ealuating the ADP policy requires soling a small QP with 0 ariables and 0 constraints. We use CVGXEN [38] to generate custom C code for this QP, which executes in around s. The simulation of 000 time steps, which we use in each projected alue iteration, requires around 0.0 s. The fitting is carried out using CVX [44] and takes around. s (this time could be speeded up considerably) Figure. O J.k/ (solid line) and a lower bound J lb (dashed line). Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc

14 A. KESHAVARZ AND S. BOYD 7. VEHICLE CONTROL We will consider a rigid body ehicle moing in two dimensions, with continuous time state x c D.p,,,!/, wherep R is the position, R is the elocity, R is the angle (orientation) of the ehicle, and! is the angular elocity. The ehicle is subject to control forces u c R k, which put a net force and torque on the ehicle. The ehicle dynamics in continuous time is gien by Px c D A c x c C B c.x c /u c C w c, where w c is a continuous time random noise process and A c D, B c.x/ D R.x 5 /C=m 0 D=I 3 7 5, with R./ D sin./ cos./ cos./ sin./. The matrices C R k and D R k depend on where the forces are applied with respect to the center of mass of the ehicle. We consider a discretized forward Euler approximation of the ehicle dynamics with discretization epoch h. With x c.ht/ D x t, u c.ht/ D u t,andw c.ht/ D w t, the discretized dynamics has the form which has an input-affine form with x tc D.I C ha c /x t C hb c.x t /u t C hw t, f t. / D.I C ha c / C hw t, g t. / D hb c. /. We will assume that w t are IID with mean Nw and coariance matrix W. We consider a finite-horizon trajectory tracking problem, with desired trajectory gien as x des t D, :::, T. The stage cost at time t is ² `t., / D T R C xt des T Q x des t juj 6 u max juj >u max, where R S k C, Q Rn n, u max R k C, and the inequalities in u are element-wise. Numerical instance. We consider a ehicle with unit mass m D and moment of inertia I D =50. There are k D 4 inputs applied to the ehicle, shown in Figure 3, at distance r D =40 to the center of mass of the ehicle. We consider T D 5 time steps, with a discretization epoch of h D =50. The matrices C and D are C D , D D r p 0 0. The maximum allowable input is u max D..5,.5,.5,.5/, and the input cost matrix is R D I=000. The state cost matrix Q is a diagonal matrix, chosen as follows: we first choose the entries of Q to normalize the range of the state ariables, based on x des. Then we multiply the weight of the first and second states by 0 to penalize a deiation in the position more than a deiation in the speed or orientation of the ehicle. t, Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc

15 QUADRATIC APPROXIMATE DYNAMIC PROGRAMMING FOR INPUT-AFFINE SYSTEMS Figure 3. Vehicle diagram. The forces are applied at two positions, shown with large circles. At each position, there are two forces applied: a lateral force (u and u 3 ) and a longitudinal force (u and u 4 ). Method parameters. We start at the end of the horizon, with V T D min `., /, which is quadratic and conex in. At each step of projected alue iteration t, we choose N D 000 ealuation points, generated using a N.xt des, 0.5I / distribution. We ealuate T t OV tc for each ealuation point and fit the quadratic approximate alue function OV t. Results. We calculate the aerage total cost by using Monte Carlo simulation with 000 samples, that is, 000 simulations of 5 steps. We compare the performance of two ADP policies: the ADP policy resulting from OV t results in J D 06, and the ADP policy resulting from V quad t results in J quad D 45. Here, V quad t is the quadratic alue function obtained by linearizing the dynamics along the desired trajectory and ignoring the actuator limits. Figure 4 shows two sample trajectories generated using OV t and V quad t. Computation timing. Ealuating the ADP policy requires soling a small QP with four ariables and eight constraints. We use CVGXEN [38] to generate custom C code for this QP, which executes in around 0 s. The simulation of 000 time steps, which we use in each projected alue iteration, requires around 0.0 s. The fitting is carried out using CVX [44] and takes around 0.5 s (this time could be speeded up considerably) Figure 4. Desired trajectory (dashed line), two sample trajectories obtained using OV t, t D, :::, T (left), and two sample trajectories obtained using V quad t (right). Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc

16 A. KESHAVARZ AND S. BOYD 8. CONCLUSIONS We hae shown how to carry out projected alue iteration with quadratic approximate alue functions, for input-affine systems. Unlike alue iteration (when the technical conditions hold), projected alue iteration need not conerge, but it typically yields a policy with good performance in a reasonable number of steps. To ealuate the quadratic ADP policy requires the solution of a (typically small) conex optimization problem, or more specifically a QP when the stage cost is QP-representable. Recently deeloped fast methods, and code generation techniques, allow such problems to be soled ery quickly, which means that quadratic ADP can be carried out at kilohertz (or faster) rates. Our ability to ealuate the policy ery quickly makes projected alue iteration practical, because many ealuations can be carried out in reasonable time. Although we offer no theoretical guarantees on the performance attained, we hae obsered that the performance is typically ery good. APPENDIX: EXPECTATION OF QUADRATIC FUNCTION In this appendix, we will show that when h W R n! R is a conex quadratic function, so is E h.f t. / C g t. //. We will compute the explicit coefficients A, a,andb in the expression T A a E h.f t. / C g t. // D.=/ a T. (4) b These coefficients depend only on the first and second moments of.f t, g t /. To simplify notation, we write f t. / as f and g t. / as g. Suppose h has the form T P h. / D.=/ p T with F 0. With input-affine dynamics, we hae T f C g P h.f C g/ D.=/ p T p q p q, f C g Taking expectation, we see that E h.f C g/ takes the form of (4), with coefficients gien by, A ij D X k,l P kl E.g ki g lj /, i, j D, :::, n a i D X j p j E.g ji /, i D, :::, n b D q C X ij P ij E.f i f j / C X i p i E.f i /. ACKNOWLEDGEMENTS We would like to thank an anonymous reiewer for pointing us to the literature in reinforcement learning on fitted alue iteration and fitted Q-iteration algorithms. REFERENCES. Bellman R. Dynamic Programming. Princeton Uniersity Press: Princeton, NJ, USA, Bertsekas D. Dynamic Programming and Optimal Control: Volume. Athena Scientific: Nashua, NH, USA, Bertsekas D. Dynamic Programming and Optimal Control: Volume. Athena Scientific: Nashua, NH, USA, Puterman M. Marko Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc.: New York, NY, USA, 994. Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc

17 QUADRATIC APPROXIMATE DYNAMIC PROGRAMMING FOR INPUT-AFFINE SYSTEMS 5. Bertsekas DP, Casta nón DA. Rollout Algorithms for Stochastic Scheduling Problems. Journal of Heuristics 999; 5: Wang Y, Boyd S. Performance bounds for linear stochastic control. Systems and Control Letters March 009; 58(3): Wang Y, Boyd S. Approximate dynamic programming ia iterated bellman inequalities, 0. Manuscript. 8. O Donoghue B, Wang Y, Boyd S. Min-max approximate dynamic programming. In Proceedings IEEE Multiconference on Systems and Control, September 0; Keshaarz A, Wang Y, Boyd S. Imputing a conex objectie function. In Proceedings of the IEEE Multi-Conference on Systems and Control, September 0; Boyd S, El Ghaoui L, Feron E, Balakrishnan V. Linear matrix inequalities in systems and control theory. SIAM Books: Philadelphia, Ng A, Russell S. Algorithms for inerse reinforcement learning. In Proceedings of 7th International Conference on Machine Learning. Morgan Kauffman: San Francisco, CA, USA, 000; Abbeel P, Coates A, Ng A. Autonomous helicopter aerobatics through apprenticeship learning. The International Journal of Robotics Research 00; 9: Ackerberg D, Benkard C, Berry S, Pakes A. Econometric tools for analyzing market outcomes. In Handbook of Econometrics, olume 6 of Handbook of Econometrics, Heckman J, Leamer E (eds), chapter 63. Elseier: North Holland, Bajari P, Benkard C, Lein J. Estimating dynamic models of imperfect competition. Econometrica 007; 75(5): Nielsen T, Jensen F. Learning a decision maker s utility function from (possibly) inconsistent behaior. Artificial Intelligence 004; 60: Watkins C. Learning from delayed rewards. Ph.D. Thesis, Cambridge Uniersity, Cambridge, England, Watkins J, Dayan P. Technical note: Q-learning. Machine Learning 99; 8(3): Kwon WH, Han S. Receding Horizon Control. Springer-Verlag: London, UK, Whittle P. Optimization Oer Time. John Wiley & Sons, Inc.: New York, NY, USA, Goodwin GC, Seron MM, De Doná JA. Constrained Control and Estimation. Springer: New York, USA, Judd KL. Numerical Methods in Economics. MIT Press: Cambridge, Massachusetts, USA, Powell W. Approximate Dynamic Programming: Soling the Curses of Dimensionality, Wiley Series in Probability and Statistics. John Wiley & Sons: Hoboken, NJ, USA, Ernst D, Geurts P, Wehenkel L. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research 005; 6: Riedmiller M. Neural fitted Q iteration first experiences with a data efficient neural reinforcement learning method. In 6th European Conference on Machine Learning. Springer: Porto, Portugal, 005; Munos R. Performance bounds in L p -norm for approximate alue iteration. SIAM Journal on Control and Optimization 007; 46(): Munos R, Szepesari C. Finite-time bounds for fitted alue iteration. Journal of Machine Learning Research 008; 9: Antos A, Munos R, Szepesari C. Fitted Q-iteration in continuous action-space MDPs. In Adances in Neural Information Processing Systems 0, 008; De Farias D, Van Roy B. The linear programming approach to approximate dynamic programming. Operations Research 003; 5(6): Desai V, Farias V, Moallemi C. Approximate dynamic programming ia a smoothed linear program. Operations Research May June 0; 60(3): Lagoudakis M, Parr R, Bartlett L. Least-squares policy iteration. Journal of Machine Learning Research 003; 4: Tsitsiklis JN, Roy B. V. An analysis of temporal difference learning with function approximation. IEEE Transactions on Automatic Control 997; 4: Van Roy B. Learning and alue function approximation in complex decision processes. Ph.D. Thesis, Department of EECS, MIT, Bemporad A, Morari M, Dua V, Pistikopoulos E. The explicit linear quadratic regulator for constrained systems. Automatica January 00; 38(): Wang Y, Boyd S. Fast model predictie control using online optimization. In Proceedings of the 7th IFAC World Congress, 008; Mattingley JE, Boyd S. Real-time conex optimization in signal processing. IEEE Signal Processing Magazine 009; 3(3): Mattingley J, Wang Y, Boyd S. Code generation for receding horizon control. In IEEE Multi-Conference on Systems and Control, 00; Mattingley JE, Boyd S. Automatic code generation for real-time conex optimization. In Conex Optimization in Signal Processing and Communications, Palomar DP, Eldar YC (eds). Cambridge Uniersity Press: New York, USA, 00; Mattingley J, Boyd S. CVXGEN: a code generator for embedded conex optimization. Optimization and Engineering 0; 3(): 7. Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Notes on Linear Minimum Mean Square Error Estimators

Notes on Linear Minimum Mean Square Error Estimators Notes on Linear Minimum Mean Square Error Estimators Ça gatay Candan January, 0 Abstract Some connections between linear minimum mean square error estimators, maximum output SNR filters and the least square

More information

Position in the xy plane y position x position

Position in the xy plane y position x position Robust Control of an Underactuated Surface Vessel with Thruster Dynamics K. Y. Pettersen and O. Egeland Department of Engineering Cybernetics Norwegian Uniersity of Science and Technology N- Trondheim,

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019 Reinforcement Learning and Optimal Control ASU, CSE 691, Winter 2019 Dimitri P. Bertsekas dimitrib@mit.edu Lecture 8 Bertsekas Reinforcement Learning 1 / 21 Outline 1 Review of Infinite Horizon Problems

More information

Procedia Computer Science 00 (2011) 000 6

Procedia Computer Science 00 (2011) 000 6 Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement

More information

Online Companion to Pricing Services Subject to Congestion: Charge Per-Use Fees or Sell Subscriptions?

Online Companion to Pricing Services Subject to Congestion: Charge Per-Use Fees or Sell Subscriptions? Online Companion to Pricing Serices Subject to Congestion: Charge Per-Use Fees or Sell Subscriptions? Gérard P. Cachon Pnina Feldman Operations and Information Management, The Wharton School, Uniersity

More information

Astrometric Errors Correlated Strongly Across Multiple SIRTF Images

Astrometric Errors Correlated Strongly Across Multiple SIRTF Images Astrometric Errors Correlated Strongly Across Multiple SIRTF Images John Fowler 28 March 23 The possibility exists that after pointing transfer has been performed for each BCD (i.e. a calibrated image

More information

Linear-Quadratic Optimal Control: Full-State Feedback

Linear-Quadratic Optimal Control: Full-State Feedback Chapter 4 Linear-Quadratic Optimal Control: Full-State Feedback 1 Linear quadratic optimization is a basic method for designing controllers for linear (and often nonlinear) dynamical systems and is actually

More information

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G. In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and J. Alspector, (Eds.). Morgan Kaufmann Publishers, San Fancisco, CA. 1994. Convergence of Indirect Adaptive Asynchronous

More information

Probabilistic Engineering Design

Probabilistic Engineering Design Probabilistic Engineering Design Chapter One Introduction 1 Introduction This chapter discusses the basics of probabilistic engineering design. Its tutorial-style is designed to allow the reader to quickly

More information

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs 2015 IEEE 54th Annual Conference on Decision and Control CDC December 15-18, 2015. Osaka, Japan An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs Abhishek Gupta Rahul Jain Peter

More information

A Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley

A Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley A Tour of Reinforcement Learning The View from Continuous Control Benjamin Recht University of California, Berkeley trustable, scalable, predictable Control Theory! Reinforcement Learning is the study

More information

Trajectory Estimation for Tactical Ballistic Missiles in Terminal Phase Using On-line Input Estimator

Trajectory Estimation for Tactical Ballistic Missiles in Terminal Phase Using On-line Input Estimator Proc. Natl. Sci. Counc. ROC(A) Vol. 23, No. 5, 1999. pp. 644-653 Trajectory Estimation for Tactical Ballistic Missiles in Terminal Phase Using On-line Input Estimator SOU-CHEN LEE, YU-CHAO HUANG, AND CHENG-YU

More information

EE C128 / ME C134 Feedback Control Systems

EE C128 / ME C134 Feedback Control Systems EE C128 / ME C134 Feedback Control Systems Lecture Additional Material Introduction to Model Predictive Control Maximilian Balandat Department of Electrical Engineering & Computer Science University of

More information

Alternative non-linear predictive control under constraints applied to a two-wheeled nonholonomic mobile robot

Alternative non-linear predictive control under constraints applied to a two-wheeled nonholonomic mobile robot Alternatie non-linear predictie control under constraints applied to a to-heeled nonholonomic mobile robot Gaspar Fontineli Dantas Júnior *, João Ricardo Taares Gadelha *, Carlor Eduardo Trabuco Dórea

More information

Optimal Control. McGill COMP 765 Oct 3 rd, 2017

Optimal Control. McGill COMP 765 Oct 3 rd, 2017 Optimal Control McGill COMP 765 Oct 3 rd, 2017 Classical Control Quiz Question 1: Can a PID controller be used to balance an inverted pendulum: A) That starts upright? B) That must be swung-up (perhaps

More information

ilstd: Eligibility Traces and Convergence Analysis

ilstd: Eligibility Traces and Convergence Analysis ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca

More information

Proceedings of the International Conference on Neural Networks, Orlando Florida, June Leemon C. Baird III

Proceedings of the International Conference on Neural Networks, Orlando Florida, June Leemon C. Baird III Proceedings of the International Conference on Neural Networks, Orlando Florida, June 1994. REINFORCEMENT LEARNING IN CONTINUOUS TIME: ADVANTAGE UPDATING Leemon C. Baird III bairdlc@wl.wpafb.af.mil Wright

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

arxiv: v1 [stat.ml] 15 Feb 2018

arxiv: v1 [stat.ml] 15 Feb 2018 1 : A New Algorithm for Streaming PCA arxi:1802.054471 [stat.ml] 15 Feb 2018 Puyudi Yang, Cho-Jui Hsieh, Jane-Ling Wang Uniersity of California, Dais pydyang, chohsieh, janelwang@ucdais.edu Abstract In

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process

More information

On the Convergence of Optimistic Policy Iteration

On the Convergence of Optimistic Policy Iteration Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology

More information

An Optimal Split-Plot Design for Performing a Mixture-Process Experiment

An Optimal Split-Plot Design for Performing a Mixture-Process Experiment Science Journal of Applied Mathematics and Statistics 217; 5(1): 15-23 http://www.sciencepublishinggroup.com/j/sjams doi: 1.11648/j.sjams.21751.13 ISSN: 2376-9491 (Print); ISSN: 2376-9513 (Online) An Optimal

More information

UNDERSTAND MOTION IN ONE AND TWO DIMENSIONS

UNDERSTAND MOTION IN ONE AND TWO DIMENSIONS SUBAREA I. COMPETENCY 1.0 UNDERSTAND MOTION IN ONE AND TWO DIMENSIONS MECHANICS Skill 1.1 Calculating displacement, aerage elocity, instantaneous elocity, and acceleration in a gien frame of reference

More information

Estimation of Efficiency with the Stochastic Frontier Cost. Function and Heteroscedasticity: A Monte Carlo Study

Estimation of Efficiency with the Stochastic Frontier Cost. Function and Heteroscedasticity: A Monte Carlo Study Estimation of Efficiency ith the Stochastic Frontier Cost Function and Heteroscedasticity: A Monte Carlo Study By Taeyoon Kim Graduate Student Oklahoma State Uniersity Department of Agricultural Economics

More information

Mean-variance receding horizon control for discrete time linear stochastic systems

Mean-variance receding horizon control for discrete time linear stochastic systems Proceedings o the 17th World Congress The International Federation o Automatic Control Seoul, Korea, July 6 11, 008 Mean-ariance receding horizon control or discrete time linear stochastic systems Mar

More information

Econometrics II - EXAM Outline Solutions All questions have 25pts Answer each question in separate sheets

Econometrics II - EXAM Outline Solutions All questions have 25pts Answer each question in separate sheets Econometrics II - EXAM Outline Solutions All questions hae 5pts Answer each question in separate sheets. Consider the two linear simultaneous equations G with two exogeneous ariables K, y γ + y γ + x δ

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

Dynamic Vehicle Routing with Moving Demands Part II: High speed demands or low arrival rates

Dynamic Vehicle Routing with Moving Demands Part II: High speed demands or low arrival rates Dynamic Vehicle Routing with Moing Demands Part II: High speed demands or low arrial rates Stephen L. Smith Shaunak D. Bopardikar Francesco Bullo João P. Hespanha Abstract In the companion paper we introduced

More information

Optimal Joint Detection and Estimation in Linear Models

Optimal Joint Detection and Estimation in Linear Models Optimal Joint Detection and Estimation in Linear Models Jianshu Chen, Yue Zhao, Andrea Goldsmith, and H. Vincent Poor Abstract The problem of optimal joint detection and estimation in linear models with

More information

Asymptotic Normality of an Entropy Estimator with Exponentially Decaying Bias

Asymptotic Normality of an Entropy Estimator with Exponentially Decaying Bias Asymptotic Normality of an Entropy Estimator with Exponentially Decaying Bias Zhiyi Zhang Department of Mathematics and Statistics Uniersity of North Carolina at Charlotte Charlotte, NC 28223 Abstract

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

IMPROVED MPC DESIGN BASED ON SATURATING CONTROL LAWS

IMPROVED MPC DESIGN BASED ON SATURATING CONTROL LAWS IMPROVED MPC DESIGN BASED ON SATURATING CONTROL LAWS D. Limon, J.M. Gomes da Silva Jr., T. Alamo and E.F. Camacho Dpto. de Ingenieria de Sistemas y Automática. Universidad de Sevilla Camino de los Descubrimientos

More information

WE consider finite-state Markov decision processes

WE consider finite-state Markov decision processes IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 7, JULY 2009 1515 Convergence Results for Some Temporal Difference Methods Based on Least Squares Huizhen Yu and Dimitri P. Bertsekas Abstract We consider

More information

A Globally Stabilizing Receding Horizon Controller for Neutrally Stable Linear Systems with Input Constraints 1

A Globally Stabilizing Receding Horizon Controller for Neutrally Stable Linear Systems with Input Constraints 1 A Globally Stabilizing Receding Horizon Controller for Neutrally Stable Linear Systems with Input Constraints 1 Ali Jadbabaie, Claudio De Persis, and Tae-Woong Yoon 2 Department of Electrical Engineering

More information

In: Proc. BENELEARN-98, 8th Belgian-Dutch Conference on Machine Learning, pp 9-46, 998 Linear Quadratic Regulation using Reinforcement Learning Stephan ten Hagen? and Ben Krose Department of Mathematics,

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

Adaptive linear quadratic control using policy. iteration. Steven J. Bradtke. University of Massachusetts.

Adaptive linear quadratic control using policy. iteration. Steven J. Bradtke. University of Massachusetts. Adaptive linear quadratic control using policy iteration Steven J. Bradtke Computer Science Department University of Massachusetts Amherst, MA 01003 bradtke@cs.umass.edu B. Erik Ydstie Department of Chemical

More information

Approximate dynamic programming for stochastic reachability

Approximate dynamic programming for stochastic reachability Approximate dynamic programming for stochastic reachability Nikolaos Kariotoglou, Sean Summers, Tyler Summers, Maryam Kamgarpour and John Lygeros Abstract In this work we illustrate how approximate dynamic

More information

Approximate active fault detection and control

Approximate active fault detection and control Approximate active fault detection and control Jan Škach Ivo Punčochář Miroslav Šimandl Department of Cybernetics Faculty of Applied Sciences University of West Bohemia Pilsen, Czech Republic 11th European

More information

6.231 DYNAMIC PROGRAMMING LECTURE 9 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 9 LECTURE OUTLINE 6.231 DYNAMIC PROGRAMMING LECTURE 9 LECTURE OUTLINE Rollout algorithms Policy improvement property Discrete deterministic problems Approximations of rollout algorithms Model Predictive Control (MPC) Discretization

More information

SUPPLEMENTARY MATERIAL. Authors: Alan A. Stocker (1) and Eero P. Simoncelli (2)

SUPPLEMENTARY MATERIAL. Authors: Alan A. Stocker (1) and Eero P. Simoncelli (2) SUPPLEMENTARY MATERIAL Authors: Alan A. Stocker () and Eero P. Simoncelli () Affiliations: () Dept. of Psychology, Uniersity of Pennsylania 34 Walnut Street 33C Philadelphia, PA 94-68 U.S.A. () Howard

More information

Residual migration in VTI media using anisotropy continuation

Residual migration in VTI media using anisotropy continuation Stanford Exploration Project, Report SERGEY, Noember 9, 2000, pages 671?? Residual migration in VTI media using anisotropy continuation Tariq Alkhalifah Sergey Fomel 1 ABSTRACT We introduce anisotropy

More information

Distributed and Real-time Predictive Control

Distributed and Real-time Predictive Control Distributed and Real-time Predictive Control Melanie Zeilinger Christian Conte (ETH) Alexander Domahidi (ETH) Ye Pu (EPFL) Colin Jones (EPFL) Challenges in modern control systems Power system: - Frequency

More information

Lecture 21: Physical Brownian Motion II

Lecture 21: Physical Brownian Motion II Lecture 21: Physical Brownian Motion II Scribe: Ken Kamrin Department of Mathematics, MIT May 3, 25 Resources An instructie applet illustrating physical Brownian motion can be found at: http://www.phy.ntnu.edu.tw/jaa/gas2d/gas2d.html

More information

MATH4406 (Control Theory) Unit 6: The Linear Quadratic Regulator (LQR) and Model Predictive Control (MPC) Prepared by Yoni Nazarathy, Artem

MATH4406 (Control Theory) Unit 6: The Linear Quadratic Regulator (LQR) and Model Predictive Control (MPC) Prepared by Yoni Nazarathy, Artem MATH4406 (Control Theory) Unit 6: The Linear Quadratic Regulator (LQR) and Model Predictive Control (MPC) Prepared by Yoni Nazarathy, Artem Pulemotov, September 12, 2012 Unit Outline Goal 1: Outline linear

More information

Alleviating tuning sensitivity in Approximate Dynamic Programming

Alleviating tuning sensitivity in Approximate Dynamic Programming Alleviating tuning sensitivity in Approximate Dynamic Programming Paul Beuchat, Angelos Georghiou and John Lygeros Abstract Approximate Dynamic Programming offers benefits for large-scale systems compared

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A.

More information

A014 Uncertainty Analysis of Velocity to Resistivity Transforms for Near Field Exploration

A014 Uncertainty Analysis of Velocity to Resistivity Transforms for Near Field Exploration A014 Uncertainty Analysis of Velocity to Resistiity Transforms for Near Field Exploration D. Werthmüller* (Uniersity of Edinburgh), A.M. Ziolkowski (Uniersity of Edinburgh) & D.A. Wright (Uniersity of

More information

Control Theory : Course Summary

Control Theory : Course Summary Control Theory : Course Summary Author: Joshua Volkmann Abstract There are a wide range of problems which involve making decisions over time in the face of uncertainty. Control theory draws from the fields

More information

Math Mathematical Notation

Math Mathematical Notation Math 160 - Mathematical Notation Purpose One goal in any course is to properly use the language of that subject. Finite Mathematics is no different and may often seem like a foreign language. These notations

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

Introduction to Approximate Dynamic Programming

Introduction to Approximate Dynamic Programming Introduction to Approximate Dynamic Programming Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Approximate Dynamic Programming 1 Key References Bertsekas, D.P.

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts.

Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts. An Upper Bound on the oss from Approximate Optimal-Value Functions Satinder P. Singh Richard C. Yee Department of Computer Science University of Massachusetts Amherst, MA 01003 singh@cs.umass.edu, yee@cs.umass.edu

More information

Policy Gradient Reinforcement Learning for Robotics

Policy Gradient Reinforcement Learning for Robotics Policy Gradient Reinforcement Learning for Robotics Michael C. Koval mkoval@cs.rutgers.edu Michael L. Littman mlittman@cs.rutgers.edu May 9, 211 1 Introduction Learning in an environment with a continuous

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Temporal-Difference Q-learning in Active Fault Diagnosis

Temporal-Difference Q-learning in Active Fault Diagnosis Temporal-Difference Q-learning in Active Fault Diagnosis Jan Škach 1 Ivo Punčochář 1 Frank L. Lewis 2 1 Identification and Decision Making Research Group (IDM) European Centre of Excellence - NTIS University

More information

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides

More information

Linearly-solvable Markov decision problems

Linearly-solvable Markov decision problems Advances in Neural Information Processing Systems 2 Linearly-solvable Markov decision problems Emanuel Todorov Department of Cognitive Science University of California San Diego todorov@cogsci.ucsd.edu

More information

NOTES ON THE REGULAR E-OPTIMAL SPRING BALANCE WEIGHING DESIGNS WITH CORRE- LATED ERRORS

NOTES ON THE REGULAR E-OPTIMAL SPRING BALANCE WEIGHING DESIGNS WITH CORRE- LATED ERRORS REVSTAT Statistical Journal Volume 3, Number 2, June 205, 9 29 NOTES ON THE REGULAR E-OPTIMAL SPRING BALANCE WEIGHING DESIGNS WITH CORRE- LATED ERRORS Authors: Bronis law Ceranka Department of Mathematical

More information

FINITE HORIZON ROBUST MODEL PREDICTIVE CONTROL USING LINEAR MATRIX INEQUALITIES. Danlei Chu, Tongwen Chen, Horacio J. Marquez

FINITE HORIZON ROBUST MODEL PREDICTIVE CONTROL USING LINEAR MATRIX INEQUALITIES. Danlei Chu, Tongwen Chen, Horacio J. Marquez FINITE HORIZON ROBUST MODEL PREDICTIVE CONTROL USING LINEAR MATRIX INEQUALITIES Danlei Chu Tongwen Chen Horacio J Marquez Department of Electrical and Computer Engineering University of Alberta Edmonton

More information

Chapter 2 Optimal Control Problem

Chapter 2 Optimal Control Problem Chapter 2 Optimal Control Problem Optimal control of any process can be achieved either in open or closed loop. In the following two chapters we concentrate mainly on the first class. The first chapter

More information

Optimal Stopping Problems

Optimal Stopping Problems 2.997 Decision Making in Large Scale Systems March 3 MIT, Spring 2004 Handout #9 Lecture Note 5 Optimal Stopping Problems In the last lecture, we have analyzed the behavior of T D(λ) for approximating

More information

arxiv: v1 [math.ds] 19 Jul 2016

arxiv: v1 [math.ds] 19 Jul 2016 Mutually Quadratically Inariant Information Structures in Two-Team Stochastic Dynamic Games Marcello Colombino Roy S Smith and Tyler H Summers arxi:160705461 mathds 19 Jul 016 Abstract We formulate a two-team

More information

1 Problem Formulation

1 Problem Formulation Book Review Self-Learning Control of Finite Markov Chains by A. S. Poznyak, K. Najim, and E. Gómez-Ramírez Review by Benjamin Van Roy This book presents a collection of work on algorithms for learning

More information

Bayesian Active Learning With Basis Functions

Bayesian Active Learning With Basis Functions Bayesian Active Learning With Basis Functions Ilya O. Ryzhov Warren B. Powell Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA IEEE ADPRL April 13, 2011 1 / 29

More information

The Art of Sequential Optimization via Simulations

The Art of Sequential Optimization via Simulations The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern California (Based on joint

More information

Towards Green Distributed Storage Systems

Towards Green Distributed Storage Systems Towards Green Distributed Storage Systems Abdelrahman M. Ibrahim, Ahmed A. Zewail, and Aylin Yener Wireless Communications and Networking Laboratory (WCAN) Electrical Engineering Department The Pennsylania

More information

Modeling Highway Traffic Volumes

Modeling Highway Traffic Volumes Modeling Highway Traffic Volumes Tomáš Šingliar1 and Miloš Hauskrecht 1 Computer Science Dept, Uniersity of Pittsburgh, Pittsburgh, PA 15260 {tomas, milos}@cs.pitt.edu Abstract. Most traffic management

More information

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.

More information

Static Output Feedback Stabilisation with H Performance for a Class of Plants

Static Output Feedback Stabilisation with H Performance for a Class of Plants Static Output Feedback Stabilisation with H Performance for a Class of Plants E. Prempain and I. Postlethwaite Control and Instrumentation Research, Department of Engineering, University of Leicester,

More information

An Elastic Contact Problem with Normal Compliance and Memory Term

An Elastic Contact Problem with Normal Compliance and Memory Term Machine Dynamics Research 2012, Vol. 36, No 1, 15 25 Abstract An Elastic Contact Problem with Normal Compliance and Memory Term Mikäel Barboteu, Ahmad Ramadan, Mircea Sofonea Uniersité de Perpignan, France

More information

Bias-Variance Error Bounds for Temporal Difference Updates

Bias-Variance Error Bounds for Temporal Difference Updates Bias-Variance Bounds for Temporal Difference Updates Michael Kearns AT&T Labs mkearns@research.att.com Satinder Singh AT&T Labs baveja@research.att.com Abstract We give the first rigorous upper bounds

More information

Dynamic Vehicle Routing with Heterogeneous Demands

Dynamic Vehicle Routing with Heterogeneous Demands Dynamic Vehicle Routing with Heterogeneous Demands Stephen L. Smith Marco Paone Francesco Bullo Emilio Frazzoli Abstract In this paper we study a ariation of the Dynamic Traeling Repairperson Problem DTRP

More information

Notes on Tabular Methods

Notes on Tabular Methods Notes on Tabular ethods Nan Jiang September 28, 208 Overview of the methods. Tabular certainty-equivalence Certainty-equivalence is a model-based RL algorithm, that is, it first estimates an DP model from

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Reinforcement Learning and Control

Reinforcement Learning and Control CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

More information

Q-Learning and SARSA: Machine learningbased stochastic control approaches for financial trading

Q-Learning and SARSA: Machine learningbased stochastic control approaches for financial trading Q-Learning and SARSA: Machine learningbased stochastic control approaches for financial trading Marco CORAZZA (corazza@unive.it) Department of Economics Ca' Foscari University of Venice CONFERENCE ON COMPUTATIONAL

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

Dynamic Vehicle Routing with Moving Demands Part II: High speed demands or low arrival rates

Dynamic Vehicle Routing with Moving Demands Part II: High speed demands or low arrival rates ACC 9, Submitted St. Louis, MO Dynamic Vehicle Routing with Moing Demands Part II: High speed demands or low arrial rates Stephen L. Smith Shaunak D. Bopardikar Francesco Bullo João P. Hespanha Abstract

More information

On Regression-Based Stopping Times

On Regression-Based Stopping Times On Regression-Based Stopping Times Benjamin Van Roy Stanford University October 14, 2009 Abstract We study approaches that fit a linear combination of basis functions to the continuation value function

More information

Duality in Robust Dynamic Programming: Pricing Convertibles, Stochastic Games and Control

Duality in Robust Dynamic Programming: Pricing Convertibles, Stochastic Games and Control Duality in Robust Dynamic Programming: Pricing Convertibles, Stochastic Games and Control Shyam S Chandramouli Abstract Many decision making problems that arise in inance, Economics, Inventory etc. can

More information

A Regularization Framework for Learning from Graph Data

A Regularization Framework for Learning from Graph Data A Regularization Framework for Learning from Graph Data Dengyong Zhou Max Planck Institute for Biological Cybernetics Spemannstr. 38, 7076 Tuebingen, Germany Bernhard Schölkopf Max Planck Institute for

More information

Project Discussions: SNL/ADMM, MDP/Randomization, Quadratic Regularization, and Online Linear Programming

Project Discussions: SNL/ADMM, MDP/Randomization, Quadratic Regularization, and Online Linear Programming Project Discussions: SNL/ADMM, MDP/Randomization, Quadratic Regularization, and Online Linear Programming Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305,

More information

Online solution of the average cost Kullback-Leibler optimization problem

Online solution of the average cost Kullback-Leibler optimization problem Online solution of the average cost Kullback-Leibler optimization problem Joris Bierkens Radboud University Nijmegen j.bierkens@science.ru.nl Bert Kappen Radboud University Nijmegen b.kappen@science.ru.nl

More information

Reinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil

Reinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil Reinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil Charles W. Anderson 1, Douglas C. Hittle 2, Alon D. Katz 2, and R. Matt Kretchmar 1 1 Department of Computer Science Colorado

More information

Optimal Switching of DC-DC Power Converters using Approximate Dynamic Programming

Optimal Switching of DC-DC Power Converters using Approximate Dynamic Programming Optimal Switching of DC-DC Power Conerters using Approximate Dynamic Programming Ali Heydari, Member, IEEE Abstract Optimal switching between different topologies in step-down DC-DC oltage conerters, with

More information

Dynamics of a Bertrand Game with Heterogeneous Players and Different Delay Structures. 1 Introduction

Dynamics of a Bertrand Game with Heterogeneous Players and Different Delay Structures. 1 Introduction ISSN 749-889 (print), 749-897 (online) International Journal of Nonlinear Science Vol.5(8) No., pp.9-8 Dynamics of a Bertrand Game with Heterogeneous Players and Different Delay Structures Huanhuan Liu,

More information

Fast Linear Iterations for Distributed Averaging 1

Fast Linear Iterations for Distributed Averaging 1 Fast Linear Iterations for Distributed Averaging 1 Lin Xiao Stephen Boyd Information Systems Laboratory, Stanford University Stanford, CA 943-91 lxiao@stanford.edu, boyd@stanford.edu Abstract We consider

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

A Geometric Review of Linear Algebra

A Geometric Review of Linear Algebra A Geometric Reiew of Linear Algebra The following is a compact reiew of the primary concepts of linear algebra. The order of presentation is unconentional, with emphasis on geometric intuition rather than

More information

A Hierarchy of Suboptimal Policies for the Multi-period, Multi-echelon, Robust Inventory Problem

A Hierarchy of Suboptimal Policies for the Multi-period, Multi-echelon, Robust Inventory Problem A Hierarchy of Suboptimal Policies for the Multi-period, Multi-echelon, Robust Inventory Problem Dimitris J. Bertsimas Dan A. Iancu Pablo A. Parrilo Sloan School of Management and Operations Research Center,

More information