Quadratic approximate dynamic programming for input-affine systems
|
|
- Brendan Patterson
- 6 years ago
- Views:
Transcription
1 INTERNATIONAL JOURNAL OF ROBUST AND NONLINEAR CONTROL Int. J. Robust. Nonlinear Control (0) Published online in Wiley Online Library (wileyonlinelibrary.com). DOI: 0.00/rnc.894 Quadratic approximate dynamic programming for input-affine systems Arezou Keshaarz *, and Stephen Boyd Electrical Engineering, Stanford Uniersity, Stanford, CA, USA SUMMARY We consider the use of quadratic approximate alue functions for stochastic control problems with inputaffine dynamics and conex stage cost and constraints. Ealuating the approximate dynamic programming policy in such cases requires the solution of an explicit conex optimization problem, such as a quadratic program, which can be carried out efficiently. We describe a simple and general method for approximate alue iteration that also relies on our ability to sole conex optimization problems, in this case, typically a semidefinite program. Although we hae no theoretical guarantee on the performance attained using our method, we obsere that ery good performance can be obtained in practice. Copyright 0 John Wiley & Sons, Ltd. Receied 0 February 0; Reised May 0; Accepted 9 July 0 KEY WORDS: approximate dynamic programming; stochastic control; conex optimization. INTRODUCTION We consider stochastic control problems with input-affine dynamics and conex stage costs, which arise in many applications in nonlinear control, supply chain, and finance. Such problems can be soled in principle using dynamic programming (DP), which expresses the optimal policy in terms of an optimization problem inoling the alue function (or a sequence of alue functions in the time-arying case), which is a function on the state space R n. The alue function (or functions) can be found, in principle, by an iteration inoling the Bellman operator, which maps functions on the state space into functions on the state space and inoles an expectation and a minimization step; see [] for an early work on DP and [ 4] for a more detailed oeriew. For the special case when the dynamics is linear and the stage cost is conex quadratic, DP gies a complete and explicit solution of the problem, because in this case, the alue function is also quadratic and the expectation and minimization steps both presere quadratic functions and can be carried out explicitly. For other input-affine conex stochastic control problems, DP is difficult to carry out. The basic problem is that there is no general way to (exactly) represent a function on R n, much less carry out the expectation and minimization steps that arise in the Bellman operator. For small state space dimension, say, n 6 4, the important region of state space can be gridded (or otherwise represented by a finite number of points), and functions on it can be represented by a finite-dimensional set of functions, such as piecewise affine. Brute force computation can then be used to find the alue function (or more accurately, a good approximation of it) and also the optimal policy. But this approach will not scale to larger state space dimensions, because, in general, the number of basis functions needed to represent a function to a gien accuracy grows exponentially in the state space dimension n (this is the curse of dimensionality). *Correspondence to: Arezou Keshaarz, Electrical Engineering, Stanford Uniersity, Packard 43, 350 Serra Mall, Stanford, CA 94305, USA. arezou@stanford.edu Copyright 0 John Wiley & Sons, Ltd.
2 A. KESHAVARZ AND S. BOYD For problems with larger dimensions, approximate dynamic programming (ADP) gies a method for finding a good, if not optimal, policy. In ADP, the control policy uses a surrogate or approximate alue function in place of the true alue function. The distinction with the brute force numerical method described preiously is that in ADP, we abandon the goal of approximating the true alue function well and, instead, only hope to capture some of its key attributes. The approximate alue function is also chosen so that the optimization problem that must be soled to ealuate the policy is tractable. It has been obsered that good performance can be obtained with ADP, een when the approximate alue function is not a particularly good approximation of the true alue function. (Of course, this problem depends on how the approximate alue function is chosen.) There are many methods for finding an appropriate approximate alue function; for an oeriew on ADP, see [3]. The approximate alue function can be chosen as the true alue function of a simplified ersion of the stochastic control problem, for which we can easily obtain the alue function. For example, we can form a linear quadratic approximation of the problem and use the (quadratic) alue function obtained for the approximation as our approximate alue function. Another example used in model predictie control (MPC) or certainty-equialent roll out (see, e.g., [5]) is to replace the stochastic terms in the original problem with constant alues, say, their expectations. This results in an ordinary optimization problem, which (if it is conex) we can sole to obtain the alue function. Recently, Wang et al. hae deeloped a method that uses conex optimization to compute a quadratic lower bound on the true alue function, which proides a good candidate for an approximate alue function and also gies a lower bound on the optimal performance [6, 7]. This was extended in [8] to use as approximate alue function the pointwise supremum of a family of quadratic lower bounds on the alue function. These methods, howeer, cannot be used for general input-affine conex problems. Approximate dynamic programming is closely related to inerse optimization, which has been studied in arious fields such as control [9, 0], robotics [, ], and economics [3 5]. In inerse optimization, the goal is to find (or estimate or impute) the objectie function in an underlying optimizing process, gien obserations of optimal (or nearly optimal) actions. In a stochastic control problem, the objectie is related to the alue function, so the goal is to approximate the alue function, gien samples of optimal or nearly optimal actions. Watkins et al. introduced an algorithm called Q-learning [6, 7], which obseres the action state pair at each step and updates a quality function on the basis of the obtained reward/cost. Model predictie control, also known as receding-horizon control or rolling-horizon planning [8 0], is also closely related to ADP. In MPC, at each step, we plan for the next T time steps, using some estimates or predictions of future unknown alues and then execute only the action for the current time step. At the next step, we sole the same problem again, now using the alue of the current state and updated alues of the predictions based on any new information aailable. MPC can be interpreted as ADP, where the approximate alue function is the optimal alue of an optimization problem. In this paper, we take a simple approach, by restricting the approximate alue functions to be quadratic. This implies that the expectation step can be carried out exactly and that ealuating the policy reduces to a tractable conex optimization problem, often a simple quadratic program (QP), in many practical cases. To choose a quadratic approximate alue function, we use the same Bellman operator iteration that would result in the true alue function (when some technical conditions hold), but we replace the Bellman operator with an approximation that maps quadratic functions to quadratic functions. The idea of alue iteration, followed by a projection step back to the subset of candidate approximate alue functions, is an old one, explored by many authors [3, Chapter 6;, Chapter ; ]. In the reinforcement learning community, two closely related methods are fitted alue iteration and fitted Q-iteration [3 7]. Most of these works consider problems with continuous state space and finite action space, and some of them come with theoretical guarantees on conergence and performance bounds. For instance, the authors in [3] proposed a family of tree-based methods for fitting approximate Q-functions and proided conergence guarantees and performance bounds for some of the proposed methods. Riedmiller considered the same framework but employed a multilayered perceptron to fit (model-free) Q-functions and proided empirical performance results [4]. Munos and Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc
3 QUADRATIC APPROXIMATE DYNAMIC PROGRAMMING FOR INPUT-AFFINE SYSTEMS Szepesari deeloped finite-time bounds for the fitted alue iteration for discounted problems with bounded rewards [6]. Antos et al. extended this work to continuous action space and deeloped performance bounds for a ariant of the fitted Q-iteration by using stationary stochastic policies, where greedy action selection is replaced with searching on a restricted set of candidate policies [7]. There are also approximate ersions of policy iteration and the linear programming formulation of DPs [8, 9], where the alue function is replaced by a linear combination of a set of basis functions. Lagoudakis et al. considered the discounted finite state case and proposed a projected policy iteration method [30] that minimizes the ` Bellman residual to obtain an approximate alue function and policy at each step of the algorithm. Tsitsiklis and Van Roy established strong theoretical bounds on the suboptimality of the ADP policy, for the discounted cost finite state case [3, 3]. Our contribution is to work out the details of a projected ADP method for the case of input-affine dynamics and conex stage costs, in the relatiely new context of aailability of extremely fast and reliable solers for conex optimization problems. We rely on our ability to (numerically) sole conex optimization problems with great speed and reliability. Recent adances hae shown that using custom-generated solers will speed up computation by orders of magnitude [33 38]. With solers that are orders of magnitude faster, we can carry out approximate alue iteration for many practical problems and obtain approximate alue functions that are simple yet achiee ery good performance. The method we propose comes with no theoretical guarantees. There is no theoretical guarantee that the modified Bellman iterations will conerge, and when they do (or when we simply terminate the iterations early), we do not hae any bound on how suboptimal the resulting control policy is. On the other hand, we hae found that the methods described in this paper work ery well in practice. In cases in which the performance bounding methods of Wang et al. can be applied, we can confirm that our quadratic ADP policies are ery good and, in many cases, nearly optimal, at least for specific cases. Our method has many of the same characteristics as proportional integral deriatie (PID) control, widely used in industrial control. There are no theoretical guarantees that PID control will work well; indeed, it is easy to create a plant for which no PID control results in a stable closed-loop system (which means the objectie alue is infinite). On the other hand, PID control, with some tuning of the parameters, can usually be made to work ery well, or at least well enough, in many (if not most) practical applications. Style of the paper. The style of the paper is informal but algorithmic and computationally oriented. We are ery informal and caalier in our mathematics, occasionally referring the reader to other work that coers the specific topic more formally and precisely. We do this so as to keep the ideas simple and clear, without becoming bogged down in technical details, and in any case, we do not make any mathematical claims about the methods described. The methods presented are offered as methods that can work ery well in practice and not as methods for which we make any more specific or precise claim. We also do not gie the most general formulation possible; instead, we stick with what we hope gies a good trade-off of clarity, simplicity, and practical utility. Outline. In Section, we describe three ariations on the stochastic control problem. In Section 3, we reiew the DP solution of the problems, including the special case when the dynamics is linear and the stage cost functions are conex quadratic, as well as quadratic ADP, which is the method we propose. We describe a projected alue iteration method for obtaining an approximate alue function in Section 4. In the remaining three sections, we describe numerical examples: a traditional control problem with bounded inputs (Section 5), a multiperiod portfolio optimization problem (Section 6), and a ehicle control problem (Section 7).. INPUT-AFFINE CONVEX STOCHASTIC CONTROL In this section, we set up three ariations on the input-affine conex stochastic control problem: the finite-horizon case, the discounted infinite-horizon case, and the aerage-cost case. We start with the assumptions and definitions that are common to all cases. Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc
4 A. KESHAVARZ AND S. BOYD.. The common model Input-affine random dynamics. We consider a dynamical system with state x t R n, input or action u t R m, and input-affine random dynamics x tc D f t.x t / C g t.x t /u t, where f t W R n! R nn and g t W R n! R nm. We assume that f t and g t are (possibly) random, that is, they depend on some random ariables, with.f t, g t / and.f, g / independent for t. The initial state x 0 is also a random ariable, independent of.f t, g t /. We say the dynamics is time-inariant if the random ariables.f t, g t / are identically distributed (and therefore IID). Linear dynamics. We say the dynamics is linear (or more accurately, affine) if f t is an affine function of x t and g t does not depend on x t, in which case we can write the dynamics in the more familiar form x tc D A t x t C B t u t C c t. Here, A t R nn, B t R nm,andc t R n are random. Closed-loop dynamics. We consider state feedback policies, that is, u t D t.x t /, t D 0,, :::, where t W R n! R m is the policy in time t. The closed-loop dynamics are then x tc D f t.x t / C g t.x t / t.x t /, which recursiely defines a stochastic process for x t (and u t D t.x t /). We say the policy is time-inariant if it does not depend on t, in which case we denote it by. Stage cost. The stage cost is gien by `t., /, where`t W R n R m! R [¹ºis an extended alued closed conex function. We use infinite stage cost to denote unallowed state input pairs, that is, (joint) constraints on and. We will assume that for each state, there exists an input with finite stage cost. The stage cost is time-inariant if `t does not depend on t, in which case we write it as `. Conex quadratic stage cost. An important special case is when the stage cost is a (conex) quadratic function and has the form `t., / D.=/ T Q t S t q t St T 7 R t r t , qt T rt T s t where that is, it is positie semidefinite. " Qt S t S T t R t # 0, Quadratic program-representable stage cost. A stage cost is QP-representable if `t is a conex quadratic plus conex piecewise linear function, subject to polyhedral constraint (which can be represented in the cost function using an indicator function). Minimizing such a function can be carried out by soling a QP. Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc
5 QUADRATIC APPROXIMATE DYNAMIC PROGRAMMING FOR INPUT-AFFINE SYSTEMS Stochastic control problem. In the problems we consider, the oerall objectie function, which we denote by J, is a sum or aerage of the expected stage costs. The associated stochastic control problem is to choose the policies t (or policy when the policy is time-inariant) so as to minimize J.WeletJ? denote the optimal (minimal) alue of the objectie, and we let t? denote an optimal policy at time t. When the dynamics is linear, we call the stochastic control problem a linear conex stochastic control problem. When the dynamics is linear and the stage cost is conex quadratic, we call the stochastic control problem a linear quadratic stochastic control problem... The problems Finite horizon. In the finite-horizon problem, the objectie is the expected alue of the sum of the stage costs up to a horizon T : J D TX E`t.x t, u t /. td0 Discounted infinite horizon. In the discounted infinite-horizon problem, we assume that the dynamics, stage cost, and policy are time-inariant. The objectie in this case is where.0, / is a discount factor. X J D E t`.x t, u t /, td0 Aerage cost. In the aerage-cost problem, we assume that the dynamics, stage cost, and policy are time-inariant, and take as objectie the aerage stage cost, J D lim T! T C TX E`.x t, u t /. We will simply assume that the expectation exists in all three objecties, the sum exists in the discounted aerage cost, and the limit exists in the aerage cost. 3. DYNAMIC AND APPROXIMATE DYNAMIC PROGRAMMING The stochastic control problems described earlier can be soled in principle using DP, which we sketch here for future reference. See [3, 4, 39] for (much) more details. DP makes use of a function (time-arying in the finite-horizon case) V W R n! R that characterizes the cost of starting from that state, when an optimal policy is used. The definition of V, its characterization, and methods for computing it differ (slightly) for the three stochastic control problems we consider. Finite horizon. In the finite-horizon case, we hae a alue function V t for each t D 0, :::, T.The alue function V t. / is the minimum cost-to-go starting from state x t D at time t, using an optimal policy for times t, :::, T : V t. / D td0 TX E.`.x,?.x // ˇˇ xt D /, t D 0, :::, T. Dt The optimal cost is gien by J? D EV 0.x 0 /, where the expectation is oer x 0. Although our definition of V t refers to an optimal policy? t, we can find V t (in principle) using a backward recursion that does not require knowledge of an optimal policy. We start with V T. / D min `T., / Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc
6 A. KESHAVARZ AND S. BOYD and then find V T, V T, :::, V 0 from the recursion V t. / D min.`t., / C EV tc.f t. / C g t. ///, for t D T, T, :::, 0. (The expectation here is oer.f t, g t /.) Defining the Bellman operator T t for the finite-horizon problem as.t t h/. / D min.`t., / C Eh.f t. / C g t. ///, for h W R n! R [¹º, we can express the aforementioned recursion compactly as V t D T t V tc, t D T, T, :::, 0, () with V T C D 0. We can express an optimal policy in terms of the alue functions as? t. / D arg min.`t., / C EV tc.f t. / C g t. ///, t D 0, :::, T. Discounted infinite horizon. For the discounted infinite-horizon problem, the alue function is timeinariant, that is, it does not depend on t. The alue function V. / is the minimum cost-to-go from state x 0 D at t D 0, using an optimal policy: V. /D X E t`.x t,?.x t // ˇˇ x0 D. td0 Thus, J? D EV.x 0 /. The alue function cannot be constructed using a simple recursion, as in the finite-horizon case, but it is characterized as the unique solution of the Bellman fixed-point equation T V D V,where the Bellman operator T for the discounted infinite-horizon case is.t h/. / D min.`., / C Eh.f. / C g. ///, for h W R n! R [¹º. The alue function can be found (in principle) by seeral methods, including iteration of the Bellman operator, V kc D T V k, k D,, :::, () which is called alue iteration. Under certain technical conditions, this iteration conerges to V []; see [39, Chapter 9] for a detailed analysis of the conergence of alue iteration. We can express an optimal policy in terms of the alue function as?. / D arg min.`., / C EV.f. /C g. ///. Aerage cost. For the aerage-cost problem, the alue function is not the cost-to-go starting from a gien state (which would hae the same alue J? for all states). Instead, it represents the differential sum of costs, that is, V. /D X E.`.x t,?.x t // J ˇˇ? x0 D /. td0 We can characterize the alue function (up to an additie constant) and the optimal aerage cost as a solution of the fixed-point equation V C J? D T V,whereT, the Bellman operator T for the aerage-cost problem, is defined as.t h/. / D min.`., / C E h.f. / C g. /// Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc
7 QUADRATIC APPROXIMATE DYNAMIC PROGRAMMING FOR INPUT-AFFINE SYSTEMS for h W R n! R [¹º. Notice that if V and J? satisfy the fixed-point equation, so do V C ˇ and J? for any ˇ R. Thus, we can choose a reference state x ref and assume V.x ref / D 0 without loss of generality. Under some technical assumptions [3, 4], the alue function V and the optimal cost J? can be found (in principle) by seeral methods, including alue iteration, which has the form J kc D T V k x ref, V kc D T V k J kc, (3) for k D, :::. V k conerges to V,andJ conerges to J? [3, 40]. We can express an optimal policy in terms of V as?. / D arg min.`., / C EV.f. /C g. ///. 3.. Linear conex dynamic programming When the dynamics is linear, the alue function (or functions, in the finite-horizon case) is conex. To see this, we first note that the Bellman operator maps conex functions to conex functions. The (random) function f t. / C g t. / D A t C B t C c t is an affine function of., /; it follows that when h is conex, h.f t. / C g t. // D h.a t C B t C c t / is a conex function of., /. Expectation preseres conexity, so is a conex function of., /. Finally, Eh.f t. / C g t. // D Eh.A t C B t C c t / min.`., / C Eh.A t C B t C c t // is conex because ` is conex, and conexity is presered under partial minimization [4, Section 3..5]. It follows that the alue function is conex. In the finite-horizon case, the aforementioned argument shows that each V t is conex. In the infinite-horizon and aerage-cost cases, the aforementioned argument shows that alue iteration preseres conexity, so if we start with a conex initial guess (say, 0), all iterates are conex. The limit of a sequence of conex functions is conex, so we conclude that V is conex. Ealuating the optimal policy at x t, that is, minimizing `t., / C EV t.f t. / C g t. // oer, inoles soling a conex optimization problem. In the finite-horizon and aerage-cost cases, the discount factor is. In the infinite-horizon cases (discounted and aerage cost), the stage cost and alue function are time-inariant. 3.. Linear quadratic dynamic programming For linear quadratic stochastic control problems, we can effectiely sole the stochastic control problem by using DP. The reason is simple: The Bellman operator maps conex quadratic functions to conex quadratic functions, so the alue function is also conex quadratic (and we can compute it by alue iteration). The argument is similar to the conex case: conex quadratic functions are presered under expectation and partial minimization. One big difference, howeer, is that the Bellman iteration can actually be carried out in the linear quadratic case, using an explicit representation of conex quadratic functions, and standard linear Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc
8 A. KESHAVARZ AND S. BOYD algebra operations. The optimal policy is affine, with coefficients we can compute. To simplify notation, we consider the discounted infinite-horizon case; the other two cases are ery similar. We denote the set of conex quadratic functions on R n, that is, those with the form h. / D.=/ T H where H 0,byQ n. With linear dynamics, we hae T At C B h.a t C B t C c t / D.=/ t C c t H h T h p h T h p, At C B t C c t where.a t, B t, c t / is a random ariable. Conex quadratic functions are closed under expectation, so Eh.A t C B t C c t / is conex quadratic; the details (including formulas for the coefficients) are gien in the Appendix. The stage cost ` is conex and quadratic, that is, `., / D.=/ T Q S q S T 7 R r 5 q T r T s where " Q S # S T R 0. Thus, `., / C Eh.A t C B t C c t / is conex and quadratic. We write `., / C Eh.A t C B t C c t / D.=/ 4 3T 5 M where M M M M D 4 M T M M 3 7 5, " M M M T M 4 3 5, 4 # 0. M3 T M3 T M 33 Partial minimization of `., / C Eh.A t C B t C c t / oer results in the quadratic function " # T S S.T h/. / D min.`., / C Eh.A t C B t C c t // D.=/, S T S where " # " # " # S S M M 3 M D M S T S M3 T M T M M 3 33 is the Schur complement of M (with respect to its (, ) block); when M is singular, we can replace M with the pseudo-inerse M [4, Section A5.5]. Furthermore, because " # M M 0, M T M we can conclude that S 0; thus,.t h/. / is a conex quadratic function, that is, T h Q n. This shows that quadratic conex functions are inariant under the Bellman operator T.From this, it follows that the alue function is conex and quadratic. Again, in the finite-horizon case, this shows that each V t is conex and quadratic. In the infinite-horizon and aerage-cost cases, this argument shows that a conex quadratic function is presered under alue iteration, so if the initial guess is conex and quadratic, all iterates are conex and quadratic. The limit of a set of conex quadratic functions is a conex quadratic; thus, we can conclude that V is a conex quadratic function. Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc M T 3 3 5,,
9 QUADRATIC APPROXIMATE DYNAMIC PROGRAMMING FOR INPUT-AFFINE SYSTEMS 3.3. Approximate dynamic programming Approximate dynamic programming is a general approach for obtaining a good suboptimal policy. The basic idea is to replace the true alue function V (or V t in the finite-horizon case) with an approximation OV ( OV t ), which yields the ADP policies O t.x/ D arg min `t.x, u/ C E OV tc.f t.x/ C g t.x/u/ u for the finite-horizon case, O.x/ D arg min u for the discounted infinite-horizon case, and O.x/ D arg min u `.x, u/ C E OV.f.x/Cg.x/u/ `.x, u/ C E OV.f.x/C g.x/u/ for the aerage-cost case. These ADP policies reduce to the optimal policy for the choice OV t D V t. The goal is to choose these approximate alue functions so that the minimizations needed to compute O.x/ can be efficiently carried out; and the objectie JO with the resulting policy is small, which is hoped to be close to J?. It has been obsered in practice that good policies (i.e., ones with small objectie alues) can result een when OV is not a particularly good approximation of V Quadratic approximate dynamic programming In quadratic ADP, we take OV Q n (or OV t Q n in the time-arying case), that is, T P p OV. /D.=/ p T, q where P 0. With input-affine dynamics, we hae OV.f t. / C g t. // D.=/ ft. / C g t. / T P p T p q ft. / C g t. /, where.f t, g t / is a random ariable with known mean and coariance matrix. Thus, the expectation E V.f O t. / C g t. // is a conex quadratic function of, with coefficients that can be computed using the first and second moments of.f t, g t /; see the Appendix. A special case is when the cost function `., / (and `t., / for the time-arying case) is QP-representable. Then, for a gien, the quadratic ADP policy O. / D arg min can be ealuated by soling a QP. `., / C E OV.f t. / C g t. // 4. PROJECTED VALUE ITERATION Projected alue iteration is a standard method that can be used to obtain an approximate alue function, from a gien (finite-dimensional) set of candidate approximate alue functions, such as Q n. The idea is to carry out alue iteration but to follow each Bellman operator step with a step that approximates the result with an element from the gien set of candidate approximate alue functions. The approximation step is typically called a projection and is denoted, because it maps into the set of candidate approximate alue functions, but it need not be a projection (i.e., it need Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc
10 A. KESHAVARZ AND S. BOYD not minimize any distance measure). The resulting method for computing an approximate alue function is called projected alue iteration. Projected alue iteration has been explored by many authors [3, Chapter 6;, Chapter ; ; 5; 7]. Here, we consider scenarios in which the state space and action space are continuous and establish a framework to carry out projected alue iteration for those problems. Projected alue iteration has the following form for our three problems. For the finite-horizon problem, we start from OV T C D 0 and compute approximate alue functions as OV t D T t OV tc, t D T, T, :::,. For the discounted infinite-horizon problem, we hae the iteration OV.kC/ D T OV.k/, k D,, :::. There is no reason to beliee that this iteration always conerges, although as a practical matter, it typically does. In any case, we terminate after some number of iterations, and we judge the whole method not in terms of conergence of projected alue iteration but in terms of performance of the policy obtained. For the aerage-cost problem, projected alue iteration has the form J O.kC/ D T OV.k/ x ref OV.kC/ D T V O.k/ J O.kC/, k D,, :::. Here too, we cannot guarantee conergence, although it typically does in practice. In the aforementioned expressions, OV k are elements in the gien set of candidate approximate alue functions, but T OV k is usually not; indeed, we hae no general way of representing T OV k.the iterations gien preiously can be carried out, howeer, because only requires the ealuation of T OV k x.i/ (which are numbers) at a finite number of points x.i/. 4.. Quadratic projected iteration We now consider the case of quadratic ADP, that is, our subset of candidate approximate Lyapuno functions is Q n. In this case, the expectation appearing in the Bellman operator can be carried out exactly. We can ealuate T OV k.x/ for any x, by soling a conex optimization problem; when the stage cost is QP-representable, we can ealuate T OV k.x/ by soling a QP. We form T OV k as follows. We choose a set of sampling points x./, :::, x.n / R n and ealuate.i/ D T OV k.i/ x,fori D, :::, N.Wetake T OV k as any solution of the least squares fitting problem P minimize N id V x.i/.i/ subject to V Q n, with ariable V. In other words, we simply fit a conex quadratic function to the alues of the Bellman operator at the sample points. This requires the solution of a semidefinite program [0, 4]. Many ariations on this basic projection method are possible. We can use any conex fitting criterion instead of a least squares criterion, for example, the sum of absolute errors. We can also add additional constraints on V that can represent prior knowledge (beyond conexity). For example, we may hae a quadratic lower bound on the true alue function, and we can add this condition as a stronger constraint on V than simply conexity. (Quadratic lower bounds can be found using the methods described in [6, 43].) Another ariation on the simple fitting method described earlier adds proximal regularization to the fitting step. This means that we add a term of the form.=/kv V k k F to the objectie, which Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc
11 QUADRATIC APPROXIMATE DYNAMIC PROGRAMMING FOR INPUT-AFFINE SYSTEMS penalizes deiations of V from the preious alue. Here, > 0 is a parameter, and the norm is the Frobenius norm of the coefficient matrices, " # kv V k P p P k p k k F D p T q p k T. q k F Choice of sampling points. The choice of the ealuation points affects our projection method, so it is important to use at least a reasonable choice. An ideal choice would be to sample points from the state in that period, under the final ADP policy chosen. But this cannot be done: We need to run projected alue iteration to hae the ADP policy, and we need the ADP policy to sample the points to use for. A good method is to start with a reasonable choice of sample points and use these to construct an ADP policy. Then we run this ADP policy to obtain a new sampling of x t for each t. We now repeat the projected alue iteration, using these sample points. This sometimes gies an improement in performance. 5. INPUT-CONSTRAINED LINEAR QUADRATIC CONTROL Our first example is a traditional time-inariant linear control system, with dynamics which has the form used in this paper with x tc D Ax t C B t u t C w t, f t. / D A C w t, g t. / D B. Here, A R nn and B R nm are known, and w t is an IID random ariable with zero mean Ew t D 0 and coariance Ew t wt T D W. The stage cost is the traditional quadratic one, plus a constraint on the input amplitude: ² x T `.x t, u t / D t Qx t C u T t Ru t ku t k 6 U max, ku t k >U max, where Q 0 and R 0. We consider the aerage-cost problem. Numerical instance. We consider a problem with n D 0, m D, U max D, Q D 0I,andR D I. The entries of A are chosen from N.0, I/, and the entries of B are chosen from a uniform distribution on Œ 0.5, 0.5. The matrix A is then scaled so that max.a/ D. The noise distribution is normal, w t N.0, I/. Method parameters. We start from the x 0 D 0 at iteration 0, and OV./ D V quad,wherev quad is the (true) quadratic alue function when the stage cost is replaced with `quad., / D T Q C T R for all., /. In each iteration of projected alue iteration, we run the ADP policy for N D 000 time steps and ealuate T OV.k/ for each of these sample points. We then fit a new quadratic approximate alue function OV.kC/, enforcing a lower bound V quad and a proximal term with D 0.. Here, the proximal term has little effect on the performance of the algorithm, and we could hae eliminated the proximal at not much cost to the performance of the algorithm. We use x N to start the next iteration. For comparison, we also carry out our method with the more naie initial choice OV./ D Q,andwe drop the lower bound V quad. Results. Figure shows the performance J k of the ADP policy (ealuated by a simulation run of 0,000 steps), after each round of projected alue iteration. The performance of the policies obtained using the naie parameter choices are denoted by J.k/ naie. For this problem, we can ealuate a lower bound J lb D on J by using the method of [6]. This shows that the performance of our ADP Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc
12 A. KESHAVARZ AND S. BOYD naie iterations Figure. J.k/ (solid line) and J.k/ naie (dashed line). policy at termination is at most 4% suboptimal (but likely much closer to optimal). The plots show that the naie initialization only results in a few more projected alue iteration steps. Close examination of the plots shows that the performance need not improe in each step. For example, the best performance (but only by a ery small amount) with the more sophisticated initialization is obtained in iteration 9. Computation timing. The timing results here are reported for a 3.4-GHz Intel Xeon processor running Linux. Ealuating the ADP policy requires soling a small QP with 0 ariables and 8 constraints. We use CVGXEN [38] to generate custom C code for this QP, which executes in around 50 s. The simulation of 000 time steps, which we use in each projected alue iteration, requires around 0.05 s. The fitting is carried out using CVX [44] and takes around s (this time could be speeded up considerably). 6. MULTIPERIOD PORTFOLIO OPTIMIZATION We consider a discounted infinite-horizon problem of optimizing a portfolio of n assets with discount factor. The positions at time t is denoted by x t R n,where.x t / i denotes the dollar alue of asset i at the beginning of time t. At each time t, we can buy and sell assets. We denote the trades by u t R n,where.u t / i denotes the trade for asset i. A positie.u t / i means that we buy asset i, andanegatie.u t / i means that we sell asset i at time t. The position eoles according to the dynamics with x tc D diag.r t /.x t C u t /, where r t is the ector of asset returns in period t. We express this in our form as f t. / D diag.r t /, g t. / D diag.r t /. The return ectors are IID with mean E r t DNr and coariance E.r t Nr/.r t Nr/ T D. The stage cost consists of the following terms: the total gross cash put in, which is T u t ;arisk penalty (a multiple of the ariance of the post-trade portfolio); a quadratic transaction cost; and the constraint that the portfolio is long-only (which is x t C u t > 0). It is gien by ² `.x, u/ D T u C.x C u/ T.x C u/ C u T diag.s/u x C u > 0, otherwise, where > 0 is the risk aersion parameter and s i > 0 models the price-impact cost for asset i. We assume that the initial portfolio x 0 is gien. Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc
13 QUADRATIC APPROXIMATE DYNAMIC PROGRAMMING FOR INPUT-AFFINE SYSTEMS Numerical instance. We will consider n D 0 assets, a discount factor of D 0.99, and an initial portfolio of x 0 D 0. The returns r t are IID and hae a lognormal distribution, that is, log r t N., Q /. The means i are chosen from a N.0, 0.0 / distribution; the diagonal entries of Q are chosen from a uniform Œ0, 0.0 distribution, and the off-diagonal entries are generated using Q ij D C ij. ii jj / =,wherec is a correlation matrix, generated randomly. Because the returns are lognormal, we hae Nr D exp. C.=/ diag. Q //, ij DNr i Nr j.exp Q ij /. We take D 0.5 and choose the entries of s from a uniform Œ0, distribution. Method parameters. We start with OV./ D V quad,wherev quad is the (true) quadratic alue function when the long-only constraint is ignored. Furthermore, V quad is a lower bound on the true alue function V. We choose the ealuation points in each iteration of projected alue iteration by starting from x 0 and running the policy induced by OV.k/ for 000 steps. We then fit a quadratic approximate alue function, enforcing a lower bound V quad and a proximal term with D 0.. The proximal term in this problem has a much more significant effect on the performance of projected alue iteration smaller alues of require many more iterations of projected alue iteration to achiee the same performance. Results. Figure demonstrates the performance of the ADP policies, where J.k/ corresponds to the ADP policy at iteration k of projected alue iteration. In each case, we run 000 Monte Carlo simulations, each for 000 steps, and report the aerage total discounted cost across the 000 time steps. We also plot a lower bound J lb obtained using the methods described in [43, 45]. We can see that after around 80 steps of projected alue iteration, we arrie at a policy that is nearly optimal. It is also interesting to note that it takes around 40 projected alue iteration steps before we produce a policy that is profitable (i.e., that has J 6 0). Computation timing. Ealuating the ADP policy requires soling a small QP with 0 ariables and 0 constraints. We use CVGXEN [38] to generate custom C code for this QP, which executes in around s. The simulation of 000 time steps, which we use in each projected alue iteration, requires around 0.0 s. The fitting is carried out using CVX [44] and takes around. s (this time could be speeded up considerably) Figure. O J.k/ (solid line) and a lower bound J lb (dashed line). Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc
14 A. KESHAVARZ AND S. BOYD 7. VEHICLE CONTROL We will consider a rigid body ehicle moing in two dimensions, with continuous time state x c D.p,,,!/, wherep R is the position, R is the elocity, R is the angle (orientation) of the ehicle, and! is the angular elocity. The ehicle is subject to control forces u c R k, which put a net force and torque on the ehicle. The ehicle dynamics in continuous time is gien by Px c D A c x c C B c.x c /u c C w c, where w c is a continuous time random noise process and A c D, B c.x/ D R.x 5 /C=m 0 D=I 3 7 5, with R./ D sin./ cos./ cos./ sin./. The matrices C R k and D R k depend on where the forces are applied with respect to the center of mass of the ehicle. We consider a discretized forward Euler approximation of the ehicle dynamics with discretization epoch h. With x c.ht/ D x t, u c.ht/ D u t,andw c.ht/ D w t, the discretized dynamics has the form which has an input-affine form with x tc D.I C ha c /x t C hb c.x t /u t C hw t, f t. / D.I C ha c / C hw t, g t. / D hb c. /. We will assume that w t are IID with mean Nw and coariance matrix W. We consider a finite-horizon trajectory tracking problem, with desired trajectory gien as x des t D, :::, T. The stage cost at time t is ² `t., / D T R C xt des T Q x des t juj 6 u max juj >u max, where R S k C, Q Rn n, u max R k C, and the inequalities in u are element-wise. Numerical instance. We consider a ehicle with unit mass m D and moment of inertia I D =50. There are k D 4 inputs applied to the ehicle, shown in Figure 3, at distance r D =40 to the center of mass of the ehicle. We consider T D 5 time steps, with a discretization epoch of h D =50. The matrices C and D are C D , D D r p 0 0. The maximum allowable input is u max D..5,.5,.5,.5/, and the input cost matrix is R D I=000. The state cost matrix Q is a diagonal matrix, chosen as follows: we first choose the entries of Q to normalize the range of the state ariables, based on x des. Then we multiply the weight of the first and second states by 0 to penalize a deiation in the position more than a deiation in the speed or orientation of the ehicle. t, Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc
15 QUADRATIC APPROXIMATE DYNAMIC PROGRAMMING FOR INPUT-AFFINE SYSTEMS Figure 3. Vehicle diagram. The forces are applied at two positions, shown with large circles. At each position, there are two forces applied: a lateral force (u and u 3 ) and a longitudinal force (u and u 4 ). Method parameters. We start at the end of the horizon, with V T D min `., /, which is quadratic and conex in. At each step of projected alue iteration t, we choose N D 000 ealuation points, generated using a N.xt des, 0.5I / distribution. We ealuate T t OV tc for each ealuation point and fit the quadratic approximate alue function OV t. Results. We calculate the aerage total cost by using Monte Carlo simulation with 000 samples, that is, 000 simulations of 5 steps. We compare the performance of two ADP policies: the ADP policy resulting from OV t results in J D 06, and the ADP policy resulting from V quad t results in J quad D 45. Here, V quad t is the quadratic alue function obtained by linearizing the dynamics along the desired trajectory and ignoring the actuator limits. Figure 4 shows two sample trajectories generated using OV t and V quad t. Computation timing. Ealuating the ADP policy requires soling a small QP with four ariables and eight constraints. We use CVGXEN [38] to generate custom C code for this QP, which executes in around 0 s. The simulation of 000 time steps, which we use in each projected alue iteration, requires around 0.0 s. The fitting is carried out using CVX [44] and takes around 0.5 s (this time could be speeded up considerably) Figure 4. Desired trajectory (dashed line), two sample trajectories obtained using OV t, t D, :::, T (left), and two sample trajectories obtained using V quad t (right). Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc
16 A. KESHAVARZ AND S. BOYD 8. CONCLUSIONS We hae shown how to carry out projected alue iteration with quadratic approximate alue functions, for input-affine systems. Unlike alue iteration (when the technical conditions hold), projected alue iteration need not conerge, but it typically yields a policy with good performance in a reasonable number of steps. To ealuate the quadratic ADP policy requires the solution of a (typically small) conex optimization problem, or more specifically a QP when the stage cost is QP-representable. Recently deeloped fast methods, and code generation techniques, allow such problems to be soled ery quickly, which means that quadratic ADP can be carried out at kilohertz (or faster) rates. Our ability to ealuate the policy ery quickly makes projected alue iteration practical, because many ealuations can be carried out in reasonable time. Although we offer no theoretical guarantees on the performance attained, we hae obsered that the performance is typically ery good. APPENDIX: EXPECTATION OF QUADRATIC FUNCTION In this appendix, we will show that when h W R n! R is a conex quadratic function, so is E h.f t. / C g t. //. We will compute the explicit coefficients A, a,andb in the expression T A a E h.f t. / C g t. // D.=/ a T. (4) b These coefficients depend only on the first and second moments of.f t, g t /. To simplify notation, we write f t. / as f and g t. / as g. Suppose h has the form T P h. / D.=/ p T with F 0. With input-affine dynamics, we hae T f C g P h.f C g/ D.=/ p T p q p q, f C g Taking expectation, we see that E h.f C g/ takes the form of (4), with coefficients gien by, A ij D X k,l P kl E.g ki g lj /, i, j D, :::, n a i D X j p j E.g ji /, i D, :::, n b D q C X ij P ij E.f i f j / C X i p i E.f i /. ACKNOWLEDGEMENTS We would like to thank an anonymous reiewer for pointing us to the literature in reinforcement learning on fitted alue iteration and fitted Q-iteration algorithms. REFERENCES. Bellman R. Dynamic Programming. Princeton Uniersity Press: Princeton, NJ, USA, Bertsekas D. Dynamic Programming and Optimal Control: Volume. Athena Scientific: Nashua, NH, USA, Bertsekas D. Dynamic Programming and Optimal Control: Volume. Athena Scientific: Nashua, NH, USA, Puterman M. Marko Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc.: New York, NY, USA, 994. Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc
17 QUADRATIC APPROXIMATE DYNAMIC PROGRAMMING FOR INPUT-AFFINE SYSTEMS 5. Bertsekas DP, Casta nón DA. Rollout Algorithms for Stochastic Scheduling Problems. Journal of Heuristics 999; 5: Wang Y, Boyd S. Performance bounds for linear stochastic control. Systems and Control Letters March 009; 58(3): Wang Y, Boyd S. Approximate dynamic programming ia iterated bellman inequalities, 0. Manuscript. 8. O Donoghue B, Wang Y, Boyd S. Min-max approximate dynamic programming. In Proceedings IEEE Multiconference on Systems and Control, September 0; Keshaarz A, Wang Y, Boyd S. Imputing a conex objectie function. In Proceedings of the IEEE Multi-Conference on Systems and Control, September 0; Boyd S, El Ghaoui L, Feron E, Balakrishnan V. Linear matrix inequalities in systems and control theory. SIAM Books: Philadelphia, Ng A, Russell S. Algorithms for inerse reinforcement learning. In Proceedings of 7th International Conference on Machine Learning. Morgan Kauffman: San Francisco, CA, USA, 000; Abbeel P, Coates A, Ng A. Autonomous helicopter aerobatics through apprenticeship learning. The International Journal of Robotics Research 00; 9: Ackerberg D, Benkard C, Berry S, Pakes A. Econometric tools for analyzing market outcomes. In Handbook of Econometrics, olume 6 of Handbook of Econometrics, Heckman J, Leamer E (eds), chapter 63. Elseier: North Holland, Bajari P, Benkard C, Lein J. Estimating dynamic models of imperfect competition. Econometrica 007; 75(5): Nielsen T, Jensen F. Learning a decision maker s utility function from (possibly) inconsistent behaior. Artificial Intelligence 004; 60: Watkins C. Learning from delayed rewards. Ph.D. Thesis, Cambridge Uniersity, Cambridge, England, Watkins J, Dayan P. Technical note: Q-learning. Machine Learning 99; 8(3): Kwon WH, Han S. Receding Horizon Control. Springer-Verlag: London, UK, Whittle P. Optimization Oer Time. John Wiley & Sons, Inc.: New York, NY, USA, Goodwin GC, Seron MM, De Doná JA. Constrained Control and Estimation. Springer: New York, USA, Judd KL. Numerical Methods in Economics. MIT Press: Cambridge, Massachusetts, USA, Powell W. Approximate Dynamic Programming: Soling the Curses of Dimensionality, Wiley Series in Probability and Statistics. John Wiley & Sons: Hoboken, NJ, USA, Ernst D, Geurts P, Wehenkel L. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research 005; 6: Riedmiller M. Neural fitted Q iteration first experiences with a data efficient neural reinforcement learning method. In 6th European Conference on Machine Learning. Springer: Porto, Portugal, 005; Munos R. Performance bounds in L p -norm for approximate alue iteration. SIAM Journal on Control and Optimization 007; 46(): Munos R, Szepesari C. Finite-time bounds for fitted alue iteration. Journal of Machine Learning Research 008; 9: Antos A, Munos R, Szepesari C. Fitted Q-iteration in continuous action-space MDPs. In Adances in Neural Information Processing Systems 0, 008; De Farias D, Van Roy B. The linear programming approach to approximate dynamic programming. Operations Research 003; 5(6): Desai V, Farias V, Moallemi C. Approximate dynamic programming ia a smoothed linear program. Operations Research May June 0; 60(3): Lagoudakis M, Parr R, Bartlett L. Least-squares policy iteration. Journal of Machine Learning Research 003; 4: Tsitsiklis JN, Roy B. V. An analysis of temporal difference learning with function approximation. IEEE Transactions on Automatic Control 997; 4: Van Roy B. Learning and alue function approximation in complex decision processes. Ph.D. Thesis, Department of EECS, MIT, Bemporad A, Morari M, Dua V, Pistikopoulos E. The explicit linear quadratic regulator for constrained systems. Automatica January 00; 38(): Wang Y, Boyd S. Fast model predictie control using online optimization. In Proceedings of the 7th IFAC World Congress, 008; Mattingley JE, Boyd S. Real-time conex optimization in signal processing. IEEE Signal Processing Magazine 009; 3(3): Mattingley J, Wang Y, Boyd S. Code generation for receding horizon control. In IEEE Multi-Conference on Systems and Control, 00; Mattingley JE, Boyd S. Automatic code generation for real-time conex optimization. In Conex Optimization in Signal Processing and Communications, Palomar DP, Eldar YC (eds). Cambridge Uniersity Press: New York, USA, 00; Mattingley J, Boyd S. CVXGEN: a code generator for embedded conex optimization. Optimization and Engineering 0; 3(): 7. Copyright 0 John Wiley & Sons, Ltd. Int. J. Robust. Nonlinear Control (0) DOI: 0.00/rnc
MDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationNotes on Linear Minimum Mean Square Error Estimators
Notes on Linear Minimum Mean Square Error Estimators Ça gatay Candan January, 0 Abstract Some connections between linear minimum mean square error estimators, maximum output SNR filters and the least square
More informationPosition in the xy plane y position x position
Robust Control of an Underactuated Surface Vessel with Thruster Dynamics K. Y. Pettersen and O. Egeland Department of Engineering Cybernetics Norwegian Uniersity of Science and Technology N- Trondheim,
More informationAn Adaptive Clustering Method for Model-free Reinforcement Learning
An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at
More informationReinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019
Reinforcement Learning and Optimal Control ASU, CSE 691, Winter 2019 Dimitri P. Bertsekas dimitrib@mit.edu Lecture 8 Bertsekas Reinforcement Learning 1 / 21 Outline 1 Review of Infinite Horizon Problems
More informationProcedia Computer Science 00 (2011) 000 6
Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-
More informationElements of Reinforcement Learning
Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More informationMarkov Decision Processes and Dynamic Programming
Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes
More informationApproximate Dynamic Programming
Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement
More informationOnline Companion to Pricing Services Subject to Congestion: Charge Per-Use Fees or Sell Subscriptions?
Online Companion to Pricing Serices Subject to Congestion: Charge Per-Use Fees or Sell Subscriptions? Gérard P. Cachon Pnina Feldman Operations and Information Management, The Wharton School, Uniersity
More informationAstrometric Errors Correlated Strongly Across Multiple SIRTF Images
Astrometric Errors Correlated Strongly Across Multiple SIRTF Images John Fowler 28 March 23 The possibility exists that after pointing transfer has been performed for each BCD (i.e. a calibrated image
More informationLinear-Quadratic Optimal Control: Full-State Feedback
Chapter 4 Linear-Quadratic Optimal Control: Full-State Feedback 1 Linear quadratic optimization is a basic method for designing controllers for linear (and often nonlinear) dynamical systems and is actually
More informationIn Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.
In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and J. Alspector, (Eds.). Morgan Kaufmann Publishers, San Fancisco, CA. 1994. Convergence of Indirect Adaptive Asynchronous
More informationProbabilistic Engineering Design
Probabilistic Engineering Design Chapter One Introduction 1 Introduction This chapter discusses the basics of probabilistic engineering design. Its tutorial-style is designed to allow the reader to quickly
More informationAn Empirical Algorithm for Relative Value Iteration for Average-cost MDPs
2015 IEEE 54th Annual Conference on Decision and Control CDC December 15-18, 2015. Osaka, Japan An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs Abhishek Gupta Rahul Jain Peter
More informationA Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley
A Tour of Reinforcement Learning The View from Continuous Control Benjamin Recht University of California, Berkeley trustable, scalable, predictable Control Theory! Reinforcement Learning is the study
More informationTrajectory Estimation for Tactical Ballistic Missiles in Terminal Phase Using On-line Input Estimator
Proc. Natl. Sci. Counc. ROC(A) Vol. 23, No. 5, 1999. pp. 644-653 Trajectory Estimation for Tactical Ballistic Missiles in Terminal Phase Using On-line Input Estimator SOU-CHEN LEE, YU-CHAO HUANG, AND CHENG-YU
More informationEE C128 / ME C134 Feedback Control Systems
EE C128 / ME C134 Feedback Control Systems Lecture Additional Material Introduction to Model Predictive Control Maximilian Balandat Department of Electrical Engineering & Computer Science University of
More informationAlternative non-linear predictive control under constraints applied to a two-wheeled nonholonomic mobile robot
Alternatie non-linear predictie control under constraints applied to a to-heeled nonholonomic mobile robot Gaspar Fontineli Dantas Júnior *, João Ricardo Taares Gadelha *, Carlor Eduardo Trabuco Dórea
More informationOptimal Control. McGill COMP 765 Oct 3 rd, 2017
Optimal Control McGill COMP 765 Oct 3 rd, 2017 Classical Control Quiz Question 1: Can a PID controller be used to balance an inverted pendulum: A) That starts upright? B) That must be swung-up (perhaps
More informationilstd: Eligibility Traces and Convergence Analysis
ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca
More informationProceedings of the International Conference on Neural Networks, Orlando Florida, June Leemon C. Baird III
Proceedings of the International Conference on Neural Networks, Orlando Florida, June 1994. REINFORCEMENT LEARNING IN CONTINUOUS TIME: ADVANTAGE UPDATING Leemon C. Baird III bairdlc@wl.wpafb.af.mil Wright
More informationBalancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm
Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu
More informationarxiv: v1 [stat.ml] 15 Feb 2018
1 : A New Algorithm for Streaming PCA arxi:1802.054471 [stat.ml] 15 Feb 2018 Puyudi Yang, Cho-Jui Hsieh, Jane-Ling Wang Uniersity of California, Dais pydyang, chohsieh, janelwang@ucdais.edu Abstract In
More informationMarkov Decision Processes and Dynamic Programming
Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process
More informationOn the Convergence of Optimistic Policy Iteration
Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology
More informationAn Optimal Split-Plot Design for Performing a Mixture-Process Experiment
Science Journal of Applied Mathematics and Statistics 217; 5(1): 15-23 http://www.sciencepublishinggroup.com/j/sjams doi: 1.11648/j.sjams.21751.13 ISSN: 2376-9491 (Print); ISSN: 2376-9513 (Online) An Optimal
More informationUNDERSTAND MOTION IN ONE AND TWO DIMENSIONS
SUBAREA I. COMPETENCY 1.0 UNDERSTAND MOTION IN ONE AND TWO DIMENSIONS MECHANICS Skill 1.1 Calculating displacement, aerage elocity, instantaneous elocity, and acceleration in a gien frame of reference
More informationEstimation of Efficiency with the Stochastic Frontier Cost. Function and Heteroscedasticity: A Monte Carlo Study
Estimation of Efficiency ith the Stochastic Frontier Cost Function and Heteroscedasticity: A Monte Carlo Study By Taeyoon Kim Graduate Student Oklahoma State Uniersity Department of Agricultural Economics
More informationMean-variance receding horizon control for discrete time linear stochastic systems
Proceedings o the 17th World Congress The International Federation o Automatic Control Seoul, Korea, July 6 11, 008 Mean-ariance receding horizon control or discrete time linear stochastic systems Mar
More informationEconometrics II - EXAM Outline Solutions All questions have 25pts Answer each question in separate sheets
Econometrics II - EXAM Outline Solutions All questions hae 5pts Answer each question in separate sheets. Consider the two linear simultaneous equations G with two exogeneous ariables K, y γ + y γ + x δ
More informationReinforcement learning
Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error
More informationDynamic Vehicle Routing with Moving Demands Part II: High speed demands or low arrival rates
Dynamic Vehicle Routing with Moing Demands Part II: High speed demands or low arrial rates Stephen L. Smith Shaunak D. Bopardikar Francesco Bullo João P. Hespanha Abstract In the companion paper we introduced
More informationOptimal Joint Detection and Estimation in Linear Models
Optimal Joint Detection and Estimation in Linear Models Jianshu Chen, Yue Zhao, Andrea Goldsmith, and H. Vincent Poor Abstract The problem of optimal joint detection and estimation in linear models with
More informationAsymptotic Normality of an Entropy Estimator with Exponentially Decaying Bias
Asymptotic Normality of an Entropy Estimator with Exponentially Decaying Bias Zhiyi Zhang Department of Mathematics and Statistics Uniersity of North Carolina at Charlotte Charlotte, NC 28223 Abstract
More informationOpen Theoretical Questions in Reinforcement Learning
Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem
More informationIMPROVED MPC DESIGN BASED ON SATURATING CONTROL LAWS
IMPROVED MPC DESIGN BASED ON SATURATING CONTROL LAWS D. Limon, J.M. Gomes da Silva Jr., T. Alamo and E.F. Camacho Dpto. de Ingenieria de Sistemas y Automática. Universidad de Sevilla Camino de los Descubrimientos
More informationWE consider finite-state Markov decision processes
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 7, JULY 2009 1515 Convergence Results for Some Temporal Difference Methods Based on Least Squares Huizhen Yu and Dimitri P. Bertsekas Abstract We consider
More informationA Globally Stabilizing Receding Horizon Controller for Neutrally Stable Linear Systems with Input Constraints 1
A Globally Stabilizing Receding Horizon Controller for Neutrally Stable Linear Systems with Input Constraints 1 Ali Jadbabaie, Claudio De Persis, and Tae-Woong Yoon 2 Department of Electrical Engineering
More informationIn: Proc. BENELEARN-98, 8th Belgian-Dutch Conference on Machine Learning, pp 9-46, 998 Linear Quadratic Regulation using Reinforcement Learning Stephan ten Hagen? and Ben Krose Department of Mathematics,
More informationApproximate Dynamic Programming
Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.
More informationAdaptive linear quadratic control using policy. iteration. Steven J. Bradtke. University of Massachusetts.
Adaptive linear quadratic control using policy iteration Steven J. Bradtke Computer Science Department University of Massachusetts Amherst, MA 01003 bradtke@cs.umass.edu B. Erik Ydstie Department of Chemical
More informationApproximate dynamic programming for stochastic reachability
Approximate dynamic programming for stochastic reachability Nikolaos Kariotoglou, Sean Summers, Tyler Summers, Maryam Kamgarpour and John Lygeros Abstract In this work we illustrate how approximate dynamic
More informationApproximate active fault detection and control
Approximate active fault detection and control Jan Škach Ivo Punčochář Miroslav Šimandl Department of Cybernetics Faculty of Applied Sciences University of West Bohemia Pilsen, Czech Republic 11th European
More information6.231 DYNAMIC PROGRAMMING LECTURE 9 LECTURE OUTLINE
6.231 DYNAMIC PROGRAMMING LECTURE 9 LECTURE OUTLINE Rollout algorithms Policy improvement property Discrete deterministic problems Approximations of rollout algorithms Model Predictive Control (MPC) Discretization
More informationSUPPLEMENTARY MATERIAL. Authors: Alan A. Stocker (1) and Eero P. Simoncelli (2)
SUPPLEMENTARY MATERIAL Authors: Alan A. Stocker () and Eero P. Simoncelli () Affiliations: () Dept. of Psychology, Uniersity of Pennsylania 34 Walnut Street 33C Philadelphia, PA 94-68 U.S.A. () Howard
More informationResidual migration in VTI media using anisotropy continuation
Stanford Exploration Project, Report SERGEY, Noember 9, 2000, pages 671?? Residual migration in VTI media using anisotropy continuation Tariq Alkhalifah Sergey Fomel 1 ABSTRACT We introduce anisotropy
More informationDistributed and Real-time Predictive Control
Distributed and Real-time Predictive Control Melanie Zeilinger Christian Conte (ETH) Alexander Domahidi (ETH) Ye Pu (EPFL) Colin Jones (EPFL) Challenges in modern control systems Power system: - Frequency
More informationLecture 21: Physical Brownian Motion II
Lecture 21: Physical Brownian Motion II Scribe: Ken Kamrin Department of Mathematics, MIT May 3, 25 Resources An instructie applet illustrating physical Brownian motion can be found at: http://www.phy.ntnu.edu.tw/jaa/gas2d/gas2d.html
More informationMATH4406 (Control Theory) Unit 6: The Linear Quadratic Regulator (LQR) and Model Predictive Control (MPC) Prepared by Yoni Nazarathy, Artem
MATH4406 (Control Theory) Unit 6: The Linear Quadratic Regulator (LQR) and Model Predictive Control (MPC) Prepared by Yoni Nazarathy, Artem Pulemotov, September 12, 2012 Unit Outline Goal 1: Outline linear
More informationAlleviating tuning sensitivity in Approximate Dynamic Programming
Alleviating tuning sensitivity in Approximate Dynamic Programming Paul Beuchat, Angelos Georghiou and John Lygeros Abstract Approximate Dynamic Programming offers benefits for large-scale systems compared
More informationApproximate Dynamic Programming
Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A.
More informationA014 Uncertainty Analysis of Velocity to Resistivity Transforms for Near Field Exploration
A014 Uncertainty Analysis of Velocity to Resistiity Transforms for Near Field Exploration D. Werthmüller* (Uniersity of Edinburgh), A.M. Ziolkowski (Uniersity of Edinburgh) & D.A. Wright (Uniersity of
More informationControl Theory : Course Summary
Control Theory : Course Summary Author: Joshua Volkmann Abstract There are a wide range of problems which involve making decisions over time in the face of uncertainty. Control theory draws from the fields
More informationMath Mathematical Notation
Math 160 - Mathematical Notation Purpose One goal in any course is to properly use the language of that subject. Finite Mathematics is no different and may often seem like a foreign language. These notations
More informationQ-Learning for Markov Decision Processes*
McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of
More informationIntroduction to Approximate Dynamic Programming
Introduction to Approximate Dynamic Programming Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Approximate Dynamic Programming 1 Key References Bertsekas, D.P.
More informationReinforcement Learning. Introduction
Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control
More informationLecture 1: March 7, 2018
Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights
More informationApproximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts.
An Upper Bound on the oss from Approximate Optimal-Value Functions Satinder P. Singh Richard C. Yee Department of Computer Science University of Massachusetts Amherst, MA 01003 singh@cs.umass.edu, yee@cs.umass.edu
More informationPolicy Gradient Reinforcement Learning for Robotics
Policy Gradient Reinforcement Learning for Robotics Michael C. Koval mkoval@cs.rutgers.edu Michael L. Littman mlittman@cs.rutgers.edu May 9, 211 1 Introduction Learning in an environment with a continuous
More informationIntroduction to Reinforcement Learning. CMPT 882 Mar. 18
Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and
More informationTemporal-Difference Q-learning in Active Fault Diagnosis
Temporal-Difference Q-learning in Active Fault Diagnosis Jan Škach 1 Ivo Punčochář 1 Frank L. Lewis 2 1 Identification and Decision Making Research Group (IDM) European Centre of Excellence - NTIS University
More informationToday s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning
CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides
More informationLinearly-solvable Markov decision problems
Advances in Neural Information Processing Systems 2 Linearly-solvable Markov decision problems Emanuel Todorov Department of Cognitive Science University of California San Diego todorov@cogsci.ucsd.edu
More informationNOTES ON THE REGULAR E-OPTIMAL SPRING BALANCE WEIGHING DESIGNS WITH CORRE- LATED ERRORS
REVSTAT Statistical Journal Volume 3, Number 2, June 205, 9 29 NOTES ON THE REGULAR E-OPTIMAL SPRING BALANCE WEIGHING DESIGNS WITH CORRE- LATED ERRORS Authors: Bronis law Ceranka Department of Mathematical
More informationFINITE HORIZON ROBUST MODEL PREDICTIVE CONTROL USING LINEAR MATRIX INEQUALITIES. Danlei Chu, Tongwen Chen, Horacio J. Marquez
FINITE HORIZON ROBUST MODEL PREDICTIVE CONTROL USING LINEAR MATRIX INEQUALITIES Danlei Chu Tongwen Chen Horacio J Marquez Department of Electrical and Computer Engineering University of Alberta Edmonton
More informationChapter 2 Optimal Control Problem
Chapter 2 Optimal Control Problem Optimal control of any process can be achieved either in open or closed loop. In the following two chapters we concentrate mainly on the first class. The first chapter
More informationOptimal Stopping Problems
2.997 Decision Making in Large Scale Systems March 3 MIT, Spring 2004 Handout #9 Lecture Note 5 Optimal Stopping Problems In the last lecture, we have analyzed the behavior of T D(λ) for approximating
More informationarxiv: v1 [math.ds] 19 Jul 2016
Mutually Quadratically Inariant Information Structures in Two-Team Stochastic Dynamic Games Marcello Colombino Roy S Smith and Tyler H Summers arxi:160705461 mathds 19 Jul 016 Abstract We formulate a two-team
More information1 Problem Formulation
Book Review Self-Learning Control of Finite Markov Chains by A. S. Poznyak, K. Najim, and E. Gómez-Ramírez Review by Benjamin Van Roy This book presents a collection of work on algorithms for learning
More informationBayesian Active Learning With Basis Functions
Bayesian Active Learning With Basis Functions Ilya O. Ryzhov Warren B. Powell Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA IEEE ADPRL April 13, 2011 1 / 29
More informationThe Art of Sequential Optimization via Simulations
The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern California (Based on joint
More informationTowards Green Distributed Storage Systems
Towards Green Distributed Storage Systems Abdelrahman M. Ibrahim, Ahmed A. Zewail, and Aylin Yener Wireless Communications and Networking Laboratory (WCAN) Electrical Engineering Department The Pennsylania
More informationModeling Highway Traffic Volumes
Modeling Highway Traffic Volumes Tomáš Šingliar1 and Miloš Hauskrecht 1 Computer Science Dept, Uniersity of Pittsburgh, Pittsburgh, PA 15260 {tomas, milos}@cs.pitt.edu Abstract. Most traffic management
More informationStochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions
International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.
More informationStatic Output Feedback Stabilisation with H Performance for a Class of Plants
Static Output Feedback Stabilisation with H Performance for a Class of Plants E. Prempain and I. Postlethwaite Control and Instrumentation Research, Department of Engineering, University of Leicester,
More informationAn Elastic Contact Problem with Normal Compliance and Memory Term
Machine Dynamics Research 2012, Vol. 36, No 1, 15 25 Abstract An Elastic Contact Problem with Normal Compliance and Memory Term Mikäel Barboteu, Ahmad Ramadan, Mircea Sofonea Uniersité de Perpignan, France
More informationBias-Variance Error Bounds for Temporal Difference Updates
Bias-Variance Bounds for Temporal Difference Updates Michael Kearns AT&T Labs mkearns@research.att.com Satinder Singh AT&T Labs baveja@research.att.com Abstract We give the first rigorous upper bounds
More informationDynamic Vehicle Routing with Heterogeneous Demands
Dynamic Vehicle Routing with Heterogeneous Demands Stephen L. Smith Marco Paone Francesco Bullo Emilio Frazzoli Abstract In this paper we study a ariation of the Dynamic Traeling Repairperson Problem DTRP
More informationNotes on Tabular Methods
Notes on Tabular ethods Nan Jiang September 28, 208 Overview of the methods. Tabular certainty-equivalence Certainty-equivalence is a model-based RL algorithm, that is, it first estimates an DP model from
More information6 Reinforcement Learning
6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,
More informationReinforcement Learning and Control
CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make
More informationQ-Learning and SARSA: Machine learningbased stochastic control approaches for financial trading
Q-Learning and SARSA: Machine learningbased stochastic control approaches for financial trading Marco CORAZZA (corazza@unive.it) Department of Economics Ca' Foscari University of Venice CONFERENCE ON COMPUTATIONAL
More informationPrioritized Sweeping Converges to the Optimal Value Function
Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science
More informationDynamic Vehicle Routing with Moving Demands Part II: High speed demands or low arrival rates
ACC 9, Submitted St. Louis, MO Dynamic Vehicle Routing with Moing Demands Part II: High speed demands or low arrial rates Stephen L. Smith Shaunak D. Bopardikar Francesco Bullo João P. Hespanha Abstract
More informationOn Regression-Based Stopping Times
On Regression-Based Stopping Times Benjamin Van Roy Stanford University October 14, 2009 Abstract We study approaches that fit a linear combination of basis functions to the continuation value function
More informationDuality in Robust Dynamic Programming: Pricing Convertibles, Stochastic Games and Control
Duality in Robust Dynamic Programming: Pricing Convertibles, Stochastic Games and Control Shyam S Chandramouli Abstract Many decision making problems that arise in inance, Economics, Inventory etc. can
More informationA Regularization Framework for Learning from Graph Data
A Regularization Framework for Learning from Graph Data Dengyong Zhou Max Planck Institute for Biological Cybernetics Spemannstr. 38, 7076 Tuebingen, Germany Bernhard Schölkopf Max Planck Institute for
More informationProject Discussions: SNL/ADMM, MDP/Randomization, Quadratic Regularization, and Online Linear Programming
Project Discussions: SNL/ADMM, MDP/Randomization, Quadratic Regularization, and Online Linear Programming Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305,
More informationOnline solution of the average cost Kullback-Leibler optimization problem
Online solution of the average cost Kullback-Leibler optimization problem Joris Bierkens Radboud University Nijmegen j.bierkens@science.ru.nl Bert Kappen Radboud University Nijmegen b.kappen@science.ru.nl
More informationReinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil
Reinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil Charles W. Anderson 1, Douglas C. Hittle 2, Alon D. Katz 2, and R. Matt Kretchmar 1 1 Department of Computer Science Colorado
More informationOptimal Switching of DC-DC Power Converters using Approximate Dynamic Programming
Optimal Switching of DC-DC Power Conerters using Approximate Dynamic Programming Ali Heydari, Member, IEEE Abstract Optimal switching between different topologies in step-down DC-DC oltage conerters, with
More informationDynamics of a Bertrand Game with Heterogeneous Players and Different Delay Structures. 1 Introduction
ISSN 749-889 (print), 749-897 (online) International Journal of Nonlinear Science Vol.5(8) No., pp.9-8 Dynamics of a Bertrand Game with Heterogeneous Players and Different Delay Structures Huanhuan Liu,
More informationFast Linear Iterations for Distributed Averaging 1
Fast Linear Iterations for Distributed Averaging 1 Lin Xiao Stephen Boyd Information Systems Laboratory, Stanford University Stanford, CA 943-91 lxiao@stanford.edu, boyd@stanford.edu Abstract We consider
More informationIntroduction to Reinforcement Learning
CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.
More informationA Geometric Review of Linear Algebra
A Geometric Reiew of Linear Algebra The following is a compact reiew of the primary concepts of linear algebra. The order of presentation is unconentional, with emphasis on geometric intuition rather than
More informationA Hierarchy of Suboptimal Policies for the Multi-period, Multi-echelon, Robust Inventory Problem
A Hierarchy of Suboptimal Policies for the Multi-period, Multi-echelon, Robust Inventory Problem Dimitris J. Bertsimas Dan A. Iancu Pablo A. Parrilo Sloan School of Management and Operations Research Center,
More information