Quadratic approximate dynamic programming for input-affine systems

Size: px

Start display at page:

Download "Quadratic approximate dynamic programming for input-affine systems"

Brendan Patterson
6 years ago
Views:

1 INTERNATIONAL JOURNAL OF ROBUST AND NONLINEAR CONTROL Int. J. Robust. Nonlinear Control (0) Published online in Wiley Online Library (wileyonlinelibrary.com). DOI: 0.00/rnc.894 Quadratic approximate dynamic programming for input-affine systems Arezou Keshaarz *, and Stephen Boyd Electrical Engineering, Stanford Uniersity, Stanford, CA, USA SUMMARY We consider the use of quadratic approximate alue functions for stochastic control problems with inputaffine dynamics and conex stage cost and constraints. Ealuating the approximate dynamic programming policy in such cases requires the solution of an explicit conex optimization problem, such as a quadratic program, which can be carried out efficiently. We describe a simple and general method for approximate alue iteration that also relies on our ability to sole conex optimization problems, in this case, typically a semidefinite program. Although we hae no theoretical guarantee on the performance attained using our method, we obsere that ery good performance can be obtained in practice. Copyright 0 John Wiley & Sons, Ltd. Receied 0 February 0; Reised May 0; Accepted 9 July 0 KEY WORDS: approximate dynamic programming; stochastic control; conex optimization. INTRODUCTION We consider stochastic control problems with input-affine dynamics and conex stage costs, which arise in many applications in nonlinear control, supply chain, and finance. Such problems can be soled in principle using dynamic programming (DP), which expresses the optimal policy in terms of an optimization problem inoling the alue function (or a sequence of alue functions in the time-arying case), which is a function on the state space R n. The alue function (or functions) can be found, in principle, by an iteration inoling the Bellman operator, which maps functions on the state space into functions on the state space and inoles an expectation and a minimization step; see [] for an early work on DP and [ 4] for a more detailed oeriew. For the special case when the dynamics is linear and the stage cost is conex quadratic, DP gies a complete and explicit solution of the problem, because in this case, the alue function is also quadratic and the expectation and minimization steps both presere quadratic functions and can be carried out explicitly. For other input-affine conex stochastic control problems, DP is difficult to carry out. The basic problem is that there is no general way to (exactly) represent a function on R n, much less carry out the expectation and minimization steps that arise in the Bellman operator. For small state space dimension, say, n 6 4, the important region of state space can be gridded (or otherwise represented by a finite number of points), and functions on it can be represented by a finite-dimensional set of functions, such as piecewise affine. Brute force computation can then be used to find the alue function (or more accurately, a good approximation of it) and also the optimal policy. But this approach will not scale to larger state space dimensions, because, in general, the number of basis functions needed to represent a function to a gien accuracy grows exponentially in the state space dimension n (this is the curse of dimensionality). *Correspondence to: Arezou Keshaarz, Electrical Engineering, Stanford Uniersity, Packard 43, 350 Serra Mall, Stanford, CA 94305, USA. arezou@stanford.edu Copyright 0 John Wiley & Sons, Ltd.

2 A. KESHAVARZ AND S. BOYD For problems with larger dimensions, approximate dynamic programming (ADP) gies a method for finding a good, if not optimal, policy. In ADP, the control policy uses a surrogate or approximate alue function in place of the true alue function. The distinction with the brute force numerical method described preiously is that in ADP, we abandon the goal of approximating the true alue function well and, instead, only hope to capture some of its key attributes. The approximate alue function is also chosen so that the optimization problem that must be soled to ealuate the policy is tractable. It has been obsered that good performance can be obtained with ADP, een when the approximate alue function is not a particularly good approximation of the true alue function. (Of course, this problem depends on how the approximate alue function is chosen.) There are many methods for finding an appropriate approximate alue function; for an oeriew on ADP, see [3]. The approximate alue function can be chosen as the true alue function of a simplified ersion of the stochastic control problem, for which we can easily obtain the alue function. For example, we can form a linear quadratic approximation of the problem and use the (quadratic) alue function obtained for the approximation as our approximate alue function. Another example used in model predictie control (MPC) or certainty-equialent roll out (see, e.g., [5]) is to replace the stochastic terms in the original problem with constant alues, say, their expectations. This results in an ordinary optimization problem, which (if it is conex) we can sole to obtain the alue function. Recently, Wang et al. hae deeloped a method that uses conex optimization to compute a quadratic lower bound on the true alue function, which proides a good candidate for an approximate alue function and also gies a lower bound on the optimal performance [6, 7]. This was extended in [8] to use as approximate alue function the pointwise supremum of a family of quadratic lower bounds on the alue function. These methods, howeer, cannot be used for general input-affine conex problems. Approximate dynamic programming is closely related to inerse optimization, which has been studied in arious fields such as control [9, 0], robotics [, ], and economics [3 5]. In inerse optimization, the goal is to find (or estimate or impute) the objectie function in an underlying optimizing process, gien obserations of optimal (or nearly optimal) actions. In a stochastic control problem, the objectie is related to the alue function, so the goal is to approximate the alue function, gien samples of optimal or nearly optimal actions. Watkins et al. introduced an algorithm called Q-learning [6, 7], which obseres the action state pair at each step and updates a quality function on the basis of the obtained reward/cost. Model predictie control, also known as receding-horizon control or rolling-horizon planning [8 0], is also closely related to ADP. In MPC, at each step, we plan for the next T time steps, using some estimates or predictions of future unknown alues and then execute only the action for the current time step. At the next step, we sole the same problem again, now using the alue of the current state and updated alues of the predictions based on any new information aailable. MPC can be interpreted as ADP, where the approximate alue function is the optimal alue of an optimization problem. In this paper, we take a simple approach, by restricting the approximate alue functions to be quadratic. This implies that the expectation step can be carried out exactly and that ealuating the policy reduces to a tractable conex optimization problem, often a simple quadratic program (QP), in many practical cases. To choose a quadratic approximate alue function, we use the same Bellman operator iteration that would result in the true alue function (when some technical conditions hold), but we replace the Bellman operator with an approximation that maps quadratic functions to quadratic functions. The idea of alue iteration, followed by a projection step back to the subset of candidate approximate alue functions, is an old one, explored by many authors [3, Chapter 6;, Chapter ; ]. In the reinforcement learning community, two closely related methods are fitted alue iteration and fitted Q-iteration [3 7]. Most of these works consider problems with continuous state space and finite action space, and some of them come with theoretical guarantees on conergence and performance bounds. For instance, the authors in [3] proposed a family of tree-based methods for fitting approximate Q-functions and proided conergence guarantees and performance bounds for some of the proposed methods. Riedmiller considered the same framework but employed a multilayered perceptron to fit (model-free) Q-functions and proided empirical performance results [4]. Munos and Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc

3 QUADRATIC APPROXIMATE DYNAMIC PROGRAMMING FOR INPUT-AFFINE SYSTEMS Szepesari deeloped finite-time bounds for the fitted alue iteration for discounted problems with bounded rewards [6]. Antos et al. extended this work to continuous action space and deeloped performance bounds for a ariant of the fitted Q-iteration by using stationary stochastic policies, where greedy action selection is replaced with searching on a restricted set of candidate policies [7]. There are also approximate ersions of policy iteration and the linear programming formulation of DPs [8, 9], where the alue function is replaced by a linear combination of a set of basis functions. Lagoudakis et al. considered the discounted finite state case and proposed a projected policy iteration method [30] that minimizes the ` Bellman residual to obtain an approximate alue function and policy at each step of the algorithm. Tsitsiklis and Van Roy established strong theoretical bounds on the suboptimality of the ADP policy, for the discounted cost finite state case [3, 3]. Our contribution is to work out the details of a projected ADP method for the case of input-affine dynamics and conex stage costs, in the relatiely new context of aailability of extremely fast and reliable solers for conex optimization problems. We rely on our ability to (numerically) sole conex optimization problems with great speed and reliability. Recent adances hae shown that using custom-generated solers will speed up computation by orders of magnitude [33 38]. With solers that are orders of magnitude faster, we can carry out approximate alue iteration for many practical problems and obtain approximate alue functions that are simple yet achiee ery good performance. The method we propose comes with no theoretical guarantees. There is no theoretical guarantee that the modified Bellman iterations will conerge, and when they do (or when we simply terminate the iterations early), we do not hae any bound on how suboptimal the resulting control policy is. On the other hand, we hae found that the methods described in this paper work ery well in practice. In cases in which the performance bounding methods of Wang et al. can be applied, we can confirm that our quadratic ADP policies are ery good and, in many cases, nearly optimal, at least for specific cases. Our method has many of the same characteristics as proportional integral deriatie (PID) control, widely used in industrial control. There are no theoretical guarantees that PID control will work well; indeed, it is easy to create a plant for which no PID control results in a stable closed-loop system (which means the objectie alue is infinite). On the other hand, PID control, with some tuning of the parameters, can usually be made to work ery well, or at least well enough, in many (if not most) practical applications. Style of the paper. The style of the paper is informal but algorithmic and computationally oriented. We are ery informal and caalier in our mathematics, occasionally referring the reader to other work that coers the specific topic more formally and precisely. We do this so as to keep the ideas simple and clear, without becoming bogged down in technical details, and in any case, we do not make any mathematical claims about the methods described. The methods presented are offered as methods that can work ery well in practice and not as methods for which we make any more specific or precise claim. We also do not gie the most general formulation possible; instead, we stick with what we hope gies a good trade-off of clarity, simplicity, and practical utility. Outline. In Section, we describe three ariations on the stochastic control problem. In Section 3, we reiew the DP solution of the problems, including the special case when the dynamics is linear and the stage cost functions are conex quadratic, as well as quadratic ADP, which is the method we propose. We describe a projected alue iteration method for obtaining an approximate alue function in Section 4. In the remaining three sections, we describe numerical examples: a traditional control problem with bounded inputs (Section 5), a multiperiod portfolio optimization problem (Section 6), and a ehicle control problem (Section 7).. INPUT-AFFINE CONVEX STOCHASTIC CONTROL In this section, we set up three ariations on the input-affine conex stochastic control problem: the finite-horizon case, the discounted infinite-horizon case, and the aerage-cost case. We start with the assumptions and definitions that are common to all cases. Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc

4 A. KESHAVARZ AND S. BOYD.. The common model Input-affine random dynamics. We consider a dynamical system with state x t R n, input or action u t R m, and input-affine random dynamics x tc D f t.x t / C g t.x t /u t, where f t W R n! R nn and g t W R n! R nm. We assume that f t and g t are (possibly) random, that is, they depend on some random ariables, with.f t, g t / and.f, g / independent for t. The initial state x 0 is also a random ariable, independent of.f t, g t /. We say the dynamics is time-inariant if the random ariables.f t, g t / are identically distributed (and therefore IID). Linear dynamics. We say the dynamics is linear (or more accurately, affine) if f t is an affine function of x t and g t does not depend on x t, in which case we can write the dynamics in the more familiar form x tc D A t x t C B t u t C c t. Here, A t R nn, B t R nm,andc t R n are random. Closed-loop dynamics. We consider state feedback policies, that is, u t D t.x t /, t D 0,, :::, where t W R n! R m is the policy in time t. The closed-loop dynamics are then x tc D f t.x t / C g t.x t / t.x t /, which recursiely defines a stochastic process for x t (and u t D t.x t /). We say the policy is time-inariant if it does not depend on t, in which case we denote it by. Stage cost. The stage cost is gien by `t., /, where`t W R n R m! R [¹ºis an extended alued closed conex function. We use infinite stage cost to denote unallowed state input pairs, that is, (joint) constraints on and. We will assume that for each state, there exists an input with finite stage cost. The stage cost is time-inariant if `t does not depend on t, in which case we write it as `. Conex quadratic stage cost. An important special case is when the stage cost is a (conex) quadratic function and has the form `t., / D.=/ T Q t S t q t St T 7 R t r t , qt T rt T s t where that is, it is positie semidefinite. " Qt S t S T t R t # 0, Quadratic program-representable stage cost. A stage cost is QP-representable if `t is a conex quadratic plus conex piecewise linear function, subject to polyhedral constraint (which can be represented in the cost function using an indicator function). Minimizing such a function can be carried out by soling a QP. Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc

5 QUADRATIC APPROXIMATE DYNAMIC PROGRAMMING FOR INPUT-AFFINE SYSTEMS Stochastic control problem. In the problems we consider, the oerall objectie function, which we denote by J, is a sum or aerage of the expected stage costs. The associated stochastic control problem is to choose the policies t (or policy when the policy is time-inariant) so as to minimize J.WeletJ? denote the optimal (minimal) alue of the objectie, and we let t? denote an optimal policy at time t. When the dynamics is linear, we call the stochastic control problem a linear conex stochastic control problem. When the dynamics is linear and the stage cost is conex quadratic, we call the stochastic control problem a linear quadratic stochastic control problem... The problems Finite horizon. In the finite-horizon problem, the objectie is the expected alue of the sum of the stage costs up to a horizon T : J D TX E`t.x t, u t /. td0 Discounted infinite horizon. In the discounted infinite-horizon problem, we assume that the dynamics, stage cost, and policy are time-inariant. The objectie in this case is where.0, / is a discount factor. X J D E t`.x t, u t /, td0 Aerage cost. In the aerage-cost problem, we assume that the dynamics, stage cost, and policy are time-inariant, and take as objectie the aerage stage cost, J D lim T! T C TX E`.x t, u t /. We will simply assume that the expectation exists in all three objecties, the sum exists in the discounted aerage cost, and the limit exists in the aerage cost. 3. DYNAMIC AND APPROXIMATE DYNAMIC PROGRAMMING The stochastic control problems described earlier can be soled in principle using DP, which we sketch here for future reference. See [3, 4, 39] for (much) more details. DP makes use of a function (time-arying in the finite-horizon case) V W R n! R that characterizes the cost of starting from that state, when an optimal policy is used. The definition of V, its characterization, and methods for computing it differ (slightly) for the three stochastic control problems we consider. Finite horizon. In the finite-horizon case, we hae a alue function V t for each t D 0, :::, T.The alue function V t. / is the minimum cost-to-go starting from state x t D at time t, using an optimal policy for times t, :::, T : V t. / D td0 TX E.`.x,?.x // ˇˇ xt D /, t D 0, :::, T. Dt The optimal cost is gien by J? D EV 0.x 0 /, where the expectation is oer x 0. Although our definition of V t refers to an optimal policy? t, we can find V t (in principle) using a backward recursion that does not require knowledge of an optimal policy. We start with V T. / D min `T., / Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc

6 A. KESHAVARZ AND S. BOYD and then find V T, V T, :::, V 0 from the recursion V t. / D min.`t., / C EV tc.f t. / C g t. ///, for t D T, T, :::, 0. (The expectation here is oer.f t, g t /.) Defining the Bellman operator T t for the finite-horizon problem as.t t h/. / D min.`t., / C Eh.f t. / C g t. ///, for h W R n! R [¹º, we can express the aforementioned recursion compactly as V t D T t V tc, t D T, T, :::, 0, () with V T C D 0. We can express an optimal policy in terms of the alue functions as? t. / D arg min.`t., / C EV tc.f t. / C g t. ///, t D 0, :::, T. Discounted infinite horizon. For the discounted infinite-horizon problem, the alue function is timeinariant, that is, it does not depend on t. The alue function V. / is the minimum cost-to-go from state x 0 D at t D 0, using an optimal policy: V. /D X E t`.x t,?.x t // ˇˇ x0 D. td0 Thus, J? D EV.x 0 /. The alue function cannot be constructed using a simple recursion, as in the finite-horizon case, but it is characterized as the unique solution of the Bellman fixed-point equation T V D V,where the Bellman operator T for the discounted infinite-horizon case is.t h/. / D min.`., / C Eh.f. / C g. ///, for h W R n! R [¹º. The alue function can be found (in principle) by seeral methods, including iteration of the Bellman operator, V kc D T V k, k D,, :::, () which is called alue iteration. Under certain technical conditions, this iteration conerges to V []; see [39, Chapter 9] for a detailed analysis of the conergence of alue iteration. We can express an optimal policy in terms of the alue function as?. / D arg min.`., / C EV.f. /C g. ///. Aerage cost. For the aerage-cost problem, the alue function is not the cost-to-go starting from a gien state (which would hae the same alue J? for all states). Instead, it represents the differential sum of costs, that is, V. /D X E.`.x t,?.x t // J ˇˇ? x0 D /. td0 We can characterize the alue function (up to an additie constant) and the optimal aerage cost as a solution of the fixed-point equation V C J? D T V,whereT, the Bellman operator T for the aerage-cost problem, is defined as.t h/. / D min.`., / C E h.f. / C g. /// Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc

7 QUADRATIC APPROXIMATE DYNAMIC PROGRAMMING FOR INPUT-AFFINE SYSTEMS for h W R n! R [¹º. Notice that if V and J? satisfy the fixed-point equation, so do V C ˇ and J? for any ˇ R. Thus, we can choose a reference state x ref and assume V.x ref / D 0 without loss of generality. Under some technical assumptions [3, 4], the alue function V and the optimal cost J? can be found (in principle) by seeral methods, including alue iteration, which has the form J kc D T V k x ref, V kc D T V k J kc, (3) for k D, :::. V k conerges to V,andJ conerges to J? [3, 40]. We can express an optimal policy in terms of V as?. / D arg min.`., / C EV.f. /C g. ///. 3.. Linear conex dynamic programming When the dynamics is linear, the alue function (or functions, in the finite-horizon case) is conex. To see this, we first note that the Bellman operator maps conex functions to conex functions. The (random) function f t. / C g t. / D A t C B t C c t is an affine function of., /; it follows that when h is conex, h.f t. / C g t. // D h.a t C B t C c t / is a conex function of., /. Expectation preseres conexity, so is a conex function of., /. Finally, Eh.f t. / C g t. // D Eh.A t C B t C c t / min.`., / C Eh.A t C B t C c t // is conex because ` is conex, and conexity is presered under partial minimization [4, Section 3..5]. It follows that the alue function is conex. In the finite-horizon case, the aforementioned argument shows that each V t is conex. In the infinite-horizon and aerage-cost cases, the aforementioned argument shows that alue iteration preseres conexity, so if we start with a conex initial guess (say, 0), all iterates are conex. The limit of a sequence of conex functions is conex, so we conclude that V is conex. Ealuating the optimal policy at x t, that is, minimizing `t., / C EV t.f t. / C g t. // oer, inoles soling a conex optimization problem. In the finite-horizon and aerage-cost cases, the discount factor is. In the infinite-horizon cases (discounted and aerage cost), the stage cost and alue function are time-inariant. 3.. Linear quadratic dynamic programming For linear quadratic stochastic control problems, we can effectiely sole the stochastic control problem by using DP. The reason is simple: The Bellman operator maps conex quadratic functions to conex quadratic functions, so the alue function is also conex quadratic (and we can compute it by alue iteration). The argument is similar to the conex case: conex quadratic functions are presered under expectation and partial minimization. One big difference, howeer, is that the Bellman iteration can actually be carried out in the linear quadratic case, using an explicit representation of conex quadratic functions, and standard linear Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc

8 A. KESHAVARZ AND S. BOYD algebra operations. The optimal policy is affine, with coefficients we can compute. To simplify notation, we consider the discounted infinite-horizon case; the other two cases are ery similar. We denote the set of conex quadratic functions on R n, that is, those with the form h. / D.=/ T H where H 0,byQ n. With linear dynamics, we hae T At C B h.a t C B t C c t / D.=/ t C c t H h T h p h T h p, At C B t C c t where.a t, B t, c t / is a random ariable. Conex quadratic functions are closed under expectation, so Eh.A t C B t C c t / is conex quadratic; the details (including formulas for the coefficients) are gien in the Appendix. The stage cost ` is conex and quadratic, that is, `., / D.=/ T Q S q S T 7 R r 5 q T r T s where " Q S # S T R 0. Thus, `., / C Eh.A t C B t C c t / is conex and quadratic. We write `., / C Eh.A t C B t C c t / D.=/ 4 3T 5 M where M M M M D 4 M T M M 3 7 5, " M M M T M 4 3 5, 4 # 0. M3 T M3 T M 33 Partial minimization of `., / C Eh.A t C B t C c t / oer results in the quadratic function " # T S S.T h/. / D min.`., / C Eh.A t C B t C c t // D.=/, S T S where " # " # " # S S M M 3 M D M S T S M3 T M T M M 3 33 is the Schur complement of M (with respect to its (, ) block); when M is singular, we can replace M with the pseudo-inerse M [4, Section A5.5]. Furthermore, because " # M M 0, M T M we can conclude that S 0; thus,.t h/. / is a conex quadratic function, that is, T h Q n. This shows that quadratic conex functions are inariant under the Bellman operator T.From this, it follows that the alue function is conex and quadratic. Again, in the finite-horizon case, this shows that each V t is conex and quadratic. In the infinite-horizon and aerage-cost cases, this argument shows that a conex quadratic function is presered under alue iteration, so if the initial guess is conex and quadratic, all iterates are conex and quadratic. The limit of a set of conex quadratic functions is a conex quadratic; thus, we can conclude that V is a conex quadratic function. Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc M T 3 3 5,,

9 QUADRATIC APPROXIMATE DYNAMIC PROGRAMMING FOR INPUT-AFFINE SYSTEMS 3.3. Approximate dynamic programming Approximate dynamic programming is a general approach for obtaining a good suboptimal policy. The basic idea is to replace the true alue function V (or V t in the finite-horizon case) with an approximation OV ( OV t ), which yields the ADP policies O t.x/ D arg min `t.x, u/ C E OV tc.f t.x/ C g t.x/u/ u for the finite-horizon case, O.x/ D arg min u for the discounted infinite-horizon case, and O.x/ D arg min u `.x, u/ C E OV.f.x/Cg.x/u/ `.x, u/ C E OV.f.x/C g.x/u/ for the aerage-cost case. These ADP policies reduce to the optimal policy for the choice OV t D V t. The goal is to choose these approximate alue functions so that the minimizations needed to compute O.x/ can be efficiently carried out; and the objectie JO with the resulting policy is small, which is hoped to be close to J?. It has been obsered in practice that good policies (i.e., ones with small objectie alues) can result een when OV is not a particularly good approximation of V Quadratic approximate dynamic programming In quadratic ADP, we take OV Q n (or OV t Q n in the time-arying case), that is, T P p OV. /D.=/ p T, q where P 0. With input-affine dynamics, we hae OV.f t. / C g t. // D.=/ ft. / C g t. / T P p T p q ft. / C g t. /, where.f t, g t / is a random ariable with known mean and coariance matrix. Thus, the expectation E V.f O t. / C g t. // is a conex quadratic function of, with coefficients that can be computed using the first and second moments of.f t, g t /; see the Appendix. A special case is when the cost function `., / (and `t., / for the time-arying case) is QP-representable. Then, for a gien, the quadratic ADP policy O. / D arg min can be ealuated by soling a QP. `., / C E OV.f t. / C g t. // 4. PROJECTED VALUE ITERATION Projected alue iteration is a standard method that can be used to obtain an approximate alue function, from a gien (finite-dimensional) set of candidate approximate alue functions, such as Q n. The idea is to carry out alue iteration but to follow each Bellman operator step with a step that approximates the result with an element from the gien set of candidate approximate alue functions. The approximation step is typically called a projection and is denoted, because it maps into the set of candidate approximate alue functions, but it need not be a projection (i.e., it need Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc

10 A. KESHAVARZ AND S. BOYD not minimize any distance measure). The resulting method for computing an approximate alue function is called projected alue iteration. Projected alue iteration has been explored by many authors [3, Chapter 6;, Chapter ; ; 5; 7]. Here, we consider scenarios in which the state space and action space are continuous and establish a framework to carry out projected alue iteration for those problems. Projected alue iteration has the following form for our three problems. For the finite-horizon problem, we start from OV T C D 0 and compute approximate alue functions as OV t D T t OV tc, t D T, T, :::,. For the discounted infinite-horizon problem, we hae the iteration OV.kC/ D T OV.k/, k D,, :::. There is no reason to beliee that this iteration always conerges, although as a practical matter, it typically does. In any case, we terminate after some number of iterations, and we judge the whole method not in terms of conergence of projected alue iteration but in terms of performance of the policy obtained. For the aerage-cost problem, projected alue iteration has the form J O.kC/ D T OV.k/ x ref OV.kC/ D T V O.k/ J O.kC/, k D,, :::. Here too, we cannot guarantee conergence, although it typically does in practice. In the aforementioned expressions, OV k are elements in the gien set of candidate approximate alue functions, but T OV k is usually not; indeed, we hae no general way of representing T OV k.the iterations gien preiously can be carried out, howeer, because only requires the ealuation of T OV k x.i/ (which are numbers) at a finite number of points x.i/. 4.. Quadratic projected iteration We now consider the case of quadratic ADP, that is, our subset of candidate approximate Lyapuno functions is Q n. In this case, the expectation appearing in the Bellman operator can be carried out exactly. We can ealuate T OV k.x/ for any x, by soling a conex optimization problem; when the stage cost is QP-representable, we can ealuate T OV k.x/ by soling a QP. We form T OV k as follows. We choose a set of sampling points x./, :::, x.n / R n and ealuate.i/ D T OV k.i/ x,fori D, :::, N.Wetake T OV k as any solution of the least squares fitting problem P minimize N id V x.i/.i/ subject to V Q n, with ariable V. In other words, we simply fit a conex quadratic function to the alues of the Bellman operator at the sample points. This requires the solution of a semidefinite program [0, 4]. Many ariations on this basic projection method are possible. We can use any conex fitting criterion instead of a least squares criterion, for example, the sum of absolute errors. We can also add additional constraints on V that can represent prior knowledge (beyond conexity). For example, we may hae a quadratic lower bound on the true alue function, and we can add this condition as a stronger constraint on V than simply conexity. (Quadratic lower bounds can be found using the methods described in [6, 43].) Another ariation on the simple fitting method described earlier adds proximal regularization to the fitting step. This means that we add a term of the form.=/kv V k k F to the objectie, which Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc

11 QUADRATIC APPROXIMATE DYNAMIC PROGRAMMING FOR INPUT-AFFINE SYSTEMS penalizes deiations of V from the preious alue. Here, > 0 is a parameter, and the norm is the Frobenius norm of the coefficient matrices, " # kv V k P p P k p k k F D p T q p k T. q k F Choice of sampling points. The choice of the ealuation points affects our projection method, so it is important to use at least a reasonable choice. An ideal choice would be to sample points from the state in that period, under the final ADP policy chosen. But this cannot be done: We need to run projected alue iteration to hae the ADP policy, and we need the ADP policy to sample the points to use for. A good method is to start with a reasonable choice of sample points and use these to construct an ADP policy. Then we run this ADP policy to obtain a new sampling of x t for each t. We now repeat the projected alue iteration, using these sample points. This sometimes gies an improement in performance. 5. INPUT-CONSTRAINED LINEAR QUADRATIC CONTROL Our first example is a traditional time-inariant linear control system, with dynamics which has the form used in this paper with x tc D Ax t C B t u t C w t, f t. / D A C w t, g t. / D B. Here, A R nn and B R nm are known, and w t is an IID random ariable with zero mean Ew t D 0 and coariance Ew t wt T D W. The stage cost is the traditional quadratic one, plus a constraint on the input amplitude: ² x T `.x t, u t / D t Qx t C u T t Ru t ku t k 6 U max, ku t k >U max, where Q 0 and R 0. We consider the aerage-cost problem. Numerical instance. We consider a problem with n D 0, m D, U max D, Q D 0I,andR D I. The entries of A are chosen from N.0, I/, and the entries of B are chosen from a uniform distribution on Œ 0.5, 0.5. The matrix A is then scaled so that max.a/ D. The noise distribution is normal, w t N.0, I/. Method parameters. We start from the x 0 D 0 at iteration 0, and OV./ D V quad,wherev quad is the (true) quadratic alue function when the stage cost is replaced with `quad., / D T Q C T R for all., /. In each iteration of projected alue iteration, we run the ADP policy for N D 000 time steps and ealuate T OV.k/ for each of these sample points. We then fit a new quadratic approximate alue function OV.kC/, enforcing a lower bound V quad and a proximal term with D 0.. Here, the proximal term has little effect on the performance of the algorithm, and we could hae eliminated the proximal at not much cost to the performance of the algorithm. We use x N to start the next iteration. For comparison, we also carry out our method with the more naie initial choice OV./ D Q,andwe drop the lower bound V quad. Results. Figure shows the performance J k of the ADP policy (ealuated by a simulation run of 0,000 steps), after each round of projected alue iteration. The performance of the policies obtained using the naie parameter choices are denoted by J.k/ naie. For this problem, we can ealuate a lower bound J lb D on J by using the method of [6]. This shows that the performance of our ADP Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc

12 A. KESHAVARZ AND S. BOYD naie iterations Figure. J.k/ (solid line) and J.k/ naie (dashed line). policy at termination is at most 4% suboptimal (but likely much closer to optimal). The plots show that the naie initialization only results in a few more projected alue iteration steps. Close examination of the plots shows that the performance need not improe in each step. For example, the best performance (but only by a ery small amount) with the more sophisticated initialization is obtained in iteration 9. Computation timing. The timing results here are reported for a 3.4-GHz Intel Xeon processor running Linux. Ealuating the ADP policy requires soling a small QP with 0 ariables and 8 constraints. We use CVGXEN [38] to generate custom C code for this QP, which executes in around 50 s. The simulation of 000 time steps, which we use in each projected alue iteration, requires around 0.05 s. The fitting is carried out using CVX [44] and takes around s (this time could be speeded up considerably). 6. MULTIPERIOD PORTFOLIO OPTIMIZATION We consider a discounted infinite-horizon problem of optimizing a portfolio of n assets with discount factor. The positions at time t is denoted by x t R n,where.x t / i denotes the dollar alue of asset i at the beginning of time t. At each time t, we can buy and sell assets. We denote the trades by u t R n,where.u t / i denotes the trade for asset i. A positie.u t / i means that we buy asset i, andanegatie.u t / i means that we sell asset i at time t. The position eoles according to the dynamics with x tc D diag.r t /.x t C u t /, where r t is the ector of asset returns in period t. We express this in our form as f t. / D diag.r t /, g t. / D diag.r t /. The return ectors are IID with mean E r t DNr and coariance E.r t Nr/.r t Nr/ T D. The stage cost consists of the following terms: the total gross cash put in, which is T u t ;arisk penalty (a multiple of the ariance of the post-trade portfolio); a quadratic transaction cost; and the constraint that the portfolio is long-only (which is x t C u t > 0). It is gien by ² `.x, u/ D T u C.x C u/ T.x C u/ C u T diag.s/u x C u > 0, otherwise, where > 0 is the risk aersion parameter and s i > 0 models the price-impact cost for asset i. We assume that the initial portfolio x 0 is gien. Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc

13 QUADRATIC APPROXIMATE DYNAMIC PROGRAMMING FOR INPUT-AFFINE SYSTEMS Numerical instance. We will consider n D 0 assets, a discount factor of D 0.99, and an initial portfolio of x 0 D 0. The returns r t are IID and hae a lognormal distribution, that is, log r t N., Q /. The means i are chosen from a N.0, 0.0 / distribution; the diagonal entries of Q are chosen from a uniform Œ0, 0.0 distribution, and the off-diagonal entries are generated using Q ij D C ij. ii jj / =,wherec is a correlation matrix, generated randomly. Because the returns are lognormal, we hae Nr D exp. C.=/ diag. Q //, ij DNr i Nr j.exp Q ij /. We take D 0.5 and choose the entries of s from a uniform Œ0, distribution. Method parameters. We start with OV./ D V quad,wherev quad is the (true) quadratic alue function when the long-only constraint is ignored. Furthermore, V quad is a lower bound on the true alue function V. We choose the ealuation points in each iteration of projected alue iteration by starting from x 0 and running the policy induced by OV.k/ for 000 steps. We then fit a quadratic approximate alue function, enforcing a lower bound V quad and a proximal term with D 0.. The proximal term in this problem has a much more significant effect on the performance of projected alue iteration smaller alues of require many more iterations of projected alue iteration to achiee the same performance. Results. Figure demonstrates the performance of the ADP policies, where J.k/ corresponds to the ADP policy at iteration k of projected alue iteration. In each case, we run 000 Monte Carlo simulations, each for 000 steps, and report the aerage total discounted cost across the 000 time steps. We also plot a lower bound J lb obtained using the methods described in [43, 45]. We can see that after around 80 steps of projected alue iteration, we arrie at a policy that is nearly optimal. It is also interesting to note that it takes around 40 projected alue iteration steps before we produce a policy that is profitable (i.e., that has J 6 0). Computation timing. Ealuating the ADP policy requires soling a small QP with 0 ariables and 0 constraints. We use CVGXEN [38] to generate custom C code for this QP, which executes in around s. The simulation of 000 time steps, which we use in each projected alue iteration, requires around 0.0 s. The fitting is carried out using CVX [44] and takes around. s (this time could be speeded up considerably) Figure. O J.k/ (solid line) and a lower bound J lb (dashed line). Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc

14 A. KESHAVARZ AND S. BOYD 7. VEHICLE CONTROL We will consider a rigid body ehicle moing in two dimensions, with continuous time state x c D.p,,,!/, wherep R is the position, R is the elocity, R is the angle (orientation) of the ehicle, and! is the angular elocity. The ehicle is subject to control forces u c R k, which put a net force and torque on the ehicle. The ehicle dynamics in continuous time is gien by Px c D A c x c C B c.x c /u c C w c, where w c is a continuous time random noise process and A c D, B c.x/ D R.x 5 /C=m 0 D=I 3 7 5, with R./ D sin./ cos./ cos./ sin./. The matrices C R k and D R k depend on where the forces are applied with respect to the center of mass of the ehicle. We consider a discretized forward Euler approximation of the ehicle dynamics with discretization epoch h. With x c.ht/ D x t, u c.ht/ D u t,andw c.ht/ D w t, the discretized dynamics has the form which has an input-affine form with x tc D.I C ha c /x t C hb c.x t /u t C hw t, f t. / D.I C ha c / C hw t, g t. / D hb c. /. We will assume that w t are IID with mean Nw and coariance matrix W. We consider a finite-horizon trajectory tracking problem, with desired trajectory gien as x des t D, :::, T. The stage cost at time t is ² `t., / D T R C xt des T Q x des t juj 6 u max juj >u max, where R S k C, Q Rn n, u max R k C, and the inequalities in u are element-wise. Numerical instance. We consider a ehicle with unit mass m D and moment of inertia I D =50. There are k D 4 inputs applied to the ehicle, shown in Figure 3, at distance r D =40 to the center of mass of the ehicle. We consider T D 5 time steps, with a discretization epoch of h D =50. The matrices C and D are C D , D D r p 0 0. The maximum allowable input is u max D..5,.5,.5,.5/, and the input cost matrix is R D I=000. The state cost matrix Q is a diagonal matrix, chosen as follows: we first choose the entries of Q to normalize the range of the state ariables, based on x des. Then we multiply the weight of the first and second states by 0 to penalize a deiation in the position more than a deiation in the speed or orientation of the ehicle. t, Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc

15 QUADRATIC APPROXIMATE DYNAMIC PROGRAMMING FOR INPUT-AFFINE SYSTEMS Figure 3. Vehicle diagram. The forces are applied at two positions, shown with large circles. At each position, there are two forces applied: a lateral force (u and u 3 ) and a longitudinal force (u and u 4 ). Method parameters. We start at the end of the horizon, with V T D min `., /, which is quadratic and conex in. At each step of projected alue iteration t, we choose N D 000 ealuation points, generated using a N.xt des, 0.5I / distribution. We ealuate T t OV tc for each ealuation point and fit the quadratic approximate alue function OV t. Results. We calculate the aerage total cost by using Monte Carlo simulation with 000 samples, that is, 000 simulations of 5 steps. We compare the performance of two ADP policies: the ADP policy resulting from OV t results in J D 06, and the ADP policy resulting from V quad t results in J quad D 45. Here, V quad t is the quadratic alue function obtained by linearizing the dynamics along the desired trajectory and ignoring the actuator limits. Figure 4 shows two sample trajectories generated using OV t and V quad t. Computation timing. Ealuating the ADP policy requires soling a small QP with four ariables and eight constraints. We use CVGXEN [38] to generate custom C code for this QP, which executes in around 0 s. The simulation of 000 time steps, which we use in each projected alue iteration, requires around 0.0 s. The fitting is carried out using CVX [44] and takes around 0.5 s (this time could be speeded up considerably) Figure 4. Desired trajectory (dashed line), two sample trajectories obtained using OV t, t D, :::, T (left), and two sample trajectories obtained using V quad t (right). Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc

16 A. KESHAVARZ AND S. BOYD 8. CONCLUSIONS We hae shown how to carry out projected alue iteration with quadratic approximate alue functions, for input-affine systems. Unlike alue iteration (when the technical conditions hold), projected alue iteration need not conerge, but it typically yields a policy with good performance in a reasonable number of steps. To ealuate the quadratic ADP policy requires the solution of a (typically small) conex optimization problem, or more specifically a QP when the stage cost is QP-representable. Recently deeloped fast methods, and code generation techniques, allow such problems to be soled ery quickly, which means that quadratic ADP can be carried out at kilohertz (or faster) rates. Our ability to ealuate the policy ery quickly makes projected alue iteration practical, because many ealuations can be carried out in reasonable time. Although we offer no theoretical guarantees on the performance attained, we hae obsered that the performance is typically ery good. APPENDIX: EXPECTATION OF QUADRATIC FUNCTION In this appendix, we will show that when h W R n! R is a conex quadratic function, so is E h.f t. / C g t. //. We will compute the explicit coefficients A, a,andb in the expression T A a E h.f t. / C g t. // D.=/ a T. (4) b These coefficients depend only on the first and second moments of.f t, g t /. To simplify notation, we write f t. / as f and g t. / as g. Suppose h has the form T P h. / D.=/ p T with F 0. With input-affine dynamics, we hae T f C g P h.f C g/ D.=/ p T p q p q, f C g Taking expectation, we see that E h.f C g/ takes the form of (4), with coefficients gien by, A ij D X k,l P kl E.g ki g lj /, i, j D, :::, n a i D X j p j E.g ji /, i D, :::, n b D q C X ij P ij E.f i f j / C X i p i E.f i /. ACKNOWLEDGEMENTS We would like to thank an anonymous reiewer for pointing us to the literature in reinforcement learning on fitted alue iteration and fitted Q-iteration algorithms. REFERENCES. Bellman R. Dynamic Programming. Princeton Uniersity Press: Princeton, NJ, USA, Bertsekas D. Dynamic Programming and Optimal Control: Volume. Athena Scientific: Nashua, NH, USA, Bertsekas D. Dynamic Programming and Optimal Control: Volume. Athena Scientific: Nashua, NH, USA, Puterman M. Marko Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc.: New York, NY, USA, 994. Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc

17 QUADRATIC APPROXIMATE DYNAMIC PROGRAMMING FOR INPUT-AFFINE SYSTEMS 5. Bertsekas DP, Casta nón DA. Rollout Algorithms for Stochastic Scheduling Problems. Journal of Heuristics 999; 5: Wang Y, Boyd S. Performance bounds for linear stochastic control. Systems and Control Letters March 009; 58(3): Wang Y, Boyd S. Approximate dynamic programming ia iterated bellman inequalities, 0. Manuscript. 8. O Donoghue B, Wang Y, Boyd S. Min-max approximate dynamic programming. In Proceedings IEEE Multiconference on Systems and Control, September 0; Keshaarz A, Wang Y, Boyd S. Imputing a conex objectie function. In Proceedings of the IEEE Multi-Conference on Systems and Control, September 0; Boyd S, El Ghaoui L, Feron E, Balakrishnan V. Linear matrix inequalities in systems and control theory. SIAM Books: Philadelphia, Ng A, Russell S. Algorithms for inerse reinforcement learning. In Proceedings of 7th International Conference on Machine Learning. Morgan Kauffman: San Francisco, CA, USA, 000; Abbeel P, Coates A, Ng A. Autonomous helicopter aerobatics through apprenticeship learning. The International Journal of Robotics Research 00; 9: Ackerberg D, Benkard C, Berry S, Pakes A. Econometric tools for analyzing market outcomes. In Handbook of Econometrics, olume 6 of Handbook of Econometrics, Heckman J, Leamer E (eds), chapter 63. Elseier: North Holland, Bajari P, Benkard C, Lein J. Estimating dynamic models of imperfect competition. Econometrica 007; 75(5): Nielsen T, Jensen F. Learning a decision maker s utility function from (possibly) inconsistent behaior. Artificial Intelligence 004; 60: Watkins C. Learning from delayed rewards. Ph.D. Thesis, Cambridge Uniersity, Cambridge, England, Watkins J, Dayan P. Technical note: Q-learning. Machine Learning 99; 8(3): Kwon WH, Han S. Receding Horizon Control. Springer-Verlag: London, UK, Whittle P. Optimization Oer Time. John Wiley & Sons, Inc.: New York, NY, USA, Goodwin GC, Seron MM, De Doná JA. Constrained Control and Estimation. Springer: New York, USA, Judd KL. Numerical Methods in Economics. MIT Press: Cambridge, Massachusetts, USA, Powell W. Approximate Dynamic Programming: Soling the Curses of Dimensionality, Wiley Series in Probability and Statistics. John Wiley & Sons: Hoboken, NJ, USA, Ernst D, Geurts P, Wehenkel L. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research 005; 6: Riedmiller M. Neural fitted Q iteration first experiences with a data efficient neural reinforcement learning method. In 6th European Conference on Machine Learning. Springer: Porto, Portugal, 005; Munos R. Performance bounds in L p -norm for approximate alue iteration. SIAM Journal on Control and Optimization 007; 46(): Munos R, Szepesari C. Finite-time bounds for fitted alue iteration. Journal of Machine Learning Research 008; 9: Antos A, Munos R, Szepesari C. Fitted Q-iteration in continuous action-space MDPs. In Adances in Neural Information Processing Systems 0, 008; De Farias D, Van Roy B. The linear programming approach to approximate dynamic programming. Operations Research 003; 5(6): Desai V, Farias V, Moallemi C. Approximate dynamic programming ia a smoothed linear program. Operations Research May June 0; 60(3): Lagoudakis M, Parr R, Bartlett L. Least-squares policy iteration. Journal of Machine Learning Research 003; 4: Tsitsiklis JN, Roy B. V. An analysis of temporal difference learning with function approximation. IEEE Transactions on Automatic Control 997; 4: Van Roy B. Learning and alue function approximation in complex decision processes. Ph.D. Thesis, Department of EECS, MIT, Bemporad A, Morari M, Dua V, Pistikopoulos E. The explicit linear quadratic regulator for constrained systems. Automatica January 00; 38(): Wang Y, Boyd S. Fast model predictie control using online optimization. In Proceedings of the 7th IFAC World Congress, 008; Mattingley JE, Boyd S. Real-time conex optimization in signal processing. IEEE Signal Processing Magazine 009; 3(3): Mattingley J, Wang Y, Boyd S. Code generation for receding horizon control. In IEEE Multi-Conference on Systems and Control, 00; Mattingley JE, Boyd S. Automatic code generation for real-time conex optimization. In Conex Optimization in Signal Processing and Communications, Palomar DP, Eldar YC (eds). Cambridge Uniersity Press: New York, USA, 00; Mattingley J, Boyd S. CVXGEN: a code generator for embedded conex optimization. Optimization and Engineering 0; 3(): 7. Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process