Alleviating tuning sensitivity in Approximate Dynamic Programming

Size: px
Start display at page:

Download "Alleviating tuning sensitivity in Approximate Dynamic Programming"

Transcription

1 Alleviating tuning sensitivity in Approximate Dynamic Programming Paul Beuchat, Angelos Georghiou and John Lygeros Abstract Approximate Dynamic Programming offers benefits for large-scale systems compared to other synthesis and control methodologies. A common technique to approximate the Dynamic Program, is through the solution of the corresponding Linear Program. The major drawback of this approach is that the online performance is very sensitive to the choice of tuning parameters, in particular the state relevance weighting parameter. Our work aims at alleviating this sensitivity. To achieve this, we propose a point-wise maximum of multiple Q- functions for the online policy, and show that this immunizes against tuning errors in the parameter selection process. We formulate the resulting problem as a convex optimization problem and demonstrate the effectiveness of the approach using a stylized portfolio optimization problem. The approach offers a benefit for large scale systems where the cost of a parameter tuning process is prohibitively high. I. INTRODUCTION Stochastic optimal control provides a framework to describe many challenges across the field of engineering. The objective is to find a policy for decision making that optimizes the performance of the dynamical system under consideration. Dynamic Programming (DP) provides a method to solve stochastic optimal control problems, for which the key is to solve the Bellman equation []. Although a powerful result, computing an exact solution to the Bellman equation, called the optimal cost-to-go function, is in general intractable and inevitably leads to the curse of dimensionality [2]. Approximate Dynamic Programming (ADP) is a term covering methods that attempt to approximate the solution of the Bellman equation, [3], [4]. In particular, the Linear Programming (LP) approach to ADP, first suggested in 985 [5], introduces a set of parameters that need to be selected by the practitioner, and strongly affect the quality of the solution. In this paper, we address this sensitivity by proposing a systematic way to partially immunize the solution quality against bad choices of the tuning parameters. The LP approach to ADP is stated as follows: given a set of basis functions, find a linear combination of them that best approximates the optimal cost-to-go function, called the approximate value function. Based on this, the online policy is the one-step minimization of the approximate value function, called the approximate greedy policy. In the case that a set of basis functions and the coefficients of the linear combination can be found to closely approximate the optimal cost-to-go function, then the method can be a very powerful This research was partially funded by the European Commission under the project Local4Global. All Authors are with the Automatic Control Laboratory, ETH Zürich, Switzerland beuchatp@control.ee.ethz.ch tool. For example, the LP approach enjoyed some notable success for the applications of playing backgammon [6], elevator scheduling [7], and stochastic reachability problems [8]. However, these examples required significant trial and error tuning in order to find a suitable choice of basis functions and the best linear combination. For other applications the trial and error work involved in tuning prohibits the use of this method. Hence, alleviating the tuning effort will expand the scope of applications for the LP approach. Despite the rich choice of basis functions, see [9] and [0], choosing the optimal coefficients of the linear combination remains a difficult problem. The key parameter used throughout the literature to tune the coefficients is called the state relevance weighting. This tuning parameter specifies which regions of the state space are important for approximation. However, the regions of importance depend on the behaviour of the system when the approximate greedy policy is played online, and the policy in turn depends on the choice of the state relevance weighting. This circular dependence of the tuning parameter leads to the difficulties experienced. Different approaches have been suggested for tuning the state relevance weighting. In [] the authors use the initial distribution as the state relevance weighting. Although a natural choice, it leads to poor online performance if the system evolves to regions of the state space different from the initial distribution. The authors of [2] eliminate the state relevance weighting from the formulation at the expense of increased complexity to evaluate the online policy. Their approach is a variant of Model Predictive Control and will hence face similar difficulties as researched in that field [3]. The contributions is this paper are twofold. First, we propose a policy that allows the practitioner to choose multiple state relevance weightings and the policy automatically leverages the best performance of each without requiring trial and error tuning. Second, we provide bounds to guarantee that our proposed approach will perform at least as well as any individual choice of the tuning parameter. Finally, we show through numerical examples that the proposed policy immunizes against poor choice of the state relevance weighting. Our proposed approach extends from the pointwise maximum of approximate value functions suggested in [], and uses the Q-function formulation, see [4], to reduce the computational burden of the online policy. The structure of this paper is as follows. In Section II, we present the DP formulation considered and in Section III, we present our proposed policy using the Value function formulation of the LP approach to DP. This motivates Section IV where we use the Q-function formulation to propose

2 a tractable, point-wise maximum, greedy policy. Section IV also provides performance guarantees for the computed solution. In Section V, we demonstrate the performance of the proposed approach and conclude in Section VI. Notation: R + is the space of non-negative scalars; Z + is the space of positive integers; S n is the space of n n real symmetric matrices; I n is the n n identity matrix; n the vector of ones of size n; (.) > is the matrix transpose; given f :X!R, the infinity norm is kfk =sup x2x f(x), and the weighted -norm is k f k,c = R X f(x) c(dx). The term intractable is used throughout the paper. We loosely define intractable to mean that the computational burden of any existing solution method prohibits finding a solution in reasonable time. II. DYNAMIC PROGRAMMING (DP) FORMULATION This section introduces the problem formulation and states the DP as the solution to the Bellman equation. We consider infinite horizon, discounted cost, stochastic optimal control problems. The system is described by discrete dynamics over continuous state and action spaces. The state of the system at time t is x t 2X R nx. The system state is influenced by the control decisions u t 2U R nu, and the stochastic disturbance t 2 R n. In this setting, the state evolves according to the function g : X U!X as, x t+ = g (x t,u t, t ). At time t, the system incurs the stage cost t l (x t,u t ), where 2 [0, ) is the discount factor and the objective is to minimize the infinite sum of the stage costs. The optimal Value function, V : X! R, characterizes the solution of this stochastic optimal control problem. It represents the cost-to-go from any state of the system if the optimal control policy is played. The optimal Value function is the solution of the Bellman equation [], Q (x,u) z } { V (x) =minl (x, u)+ E [V (g (x, u, ))], () {z } (T V )(x) for all x 2 X, where T is the Bellman operator, and Q : (X U)! R is the optimal Q-function. The Q- function represents the cost of making decision u now and then playing optimally from the next time step forward. The optimal control actions are generated via the Greedy Policy: (x) = arg min l (x, u) + E [V (g (x, u, ))], (2) = arg min Q (x, u). The Bellman equation, (), can be equivalently written in terms of Q as follows: Q (x, u) =l(x, u)+ apple E min v2u Q (g (x, u, ),v) {z } (FQ )(x,u) for all x 2 X and u 2 U. Equation (3) defines the F - operator, the equivalent of the T for Q-functions. The operators T and F are both monotone and -contractive, see [4]., (3) Solving () exactly is only tractable under strong assumption on the problem structure, namely unconstrained Linear Quadratic Gaussian problems [5]. In other cases, the popular LP approach to ADP can be used to approximate the solution of (). This method is presented in the next sections. III. VALUE FUNCTION APPROACH TO ADP This sections presents a method to obtain an approximation of V through the solution of a LP. This is done by approximating the so-called exact LP, whose solution is V. We highlight the sensitivity of approximate solutions to the tuning parameters introduced, and then propose a policy that immunizes against this sensitivity. A. Iterated Bellman Inequality and the Exact LP Equation () is relaxed to the iterated Bellman Inequality, V (x) apple T M V (x), 8x 2X, (4) for some M 2 Z +, where T M denotes the M applications of the Bellman operator. As shown in [], any V satisfying (4) will be a point-wise under-estimator of V over the set X. The exact LP associated with () is formulated as follows: Z max V (x) c(dx) V X (5) s.t. V 2F(X ), V (x) applet M V (x), 8 x 2X. As shown in [6, Section 6.3], the solutions of () and (5) coincide when F(X ) is the function space of real-valued measurable functions on X with finite weighted -norm, and c( ) is any finite measure on X that assigns a positive mass to all open subsets of X. Taking M =here and in the subsequent analysis corresponds to the formulation originally proposed in [7]. Although () and (5) are equivalent for all M 2 Z +, the benefit is apparent after the approximation is made. As explained in [], problem (5) with M> has a larger feasible region than for M =. Solving (5) for V, and implementing (2), is in general intractable. The difficulties can be categorized as follows: (D) F(X ) is an infinite dimensional function space; (D2) Problem (5) involves an infinite number of constraints; (D3) The multidimensional integral in the objective of (5); (D4) The multidimensional integral over the disturbances in the bellman operator T, and the greedy policy (2); (D5) For arbitrary V 2F(X ), the greedy policy (2) may be intractable; Thus, methods that exactly solve (5) and (2) will suffer from the curse of dimensionality in at least one of these aspects, see [8, Section 2]. To gain computational tractability, in the following we restrict the function space F(X ) to simultaneously overcome (D-D5).

3 B. The Approximate LP As suggested in [5], we restrict the admissible value functions to those that can be expressed as an linear combinations of basis functions. In particular, given basis functions ˆV (i) (x) :R nx! R, we parameterize a restricted function space as, ˆF(X )=n ˆV ( ) ˆV (x) = P K i= i ˆV (i) (x) o F(X ). for some i 2 R. Hence an element of the set is specified by a set of i s. An approximate solution to (5) can be obtained through the solution of the following approximate LP: Z max ˆV (x) c(dx) ˆV X (6) s.t. ˆV 2 ˆF(X ), ˆV (x) applet M ˆV (x), 8 x 2X, where the optimization variables are the i s in the definition of ˆF(X ). The only change from (5) was to replace F(X ) by ˆF(X ). The iterated Bellman inequality is not a convex constraint on the optimization variables. As presented in [, Section 3.4], it can be replaced by a constraint that is convex in the i s and implies the iterated Bellman inequality. Difficulty (D) has been overcome in problem (6) as ˆF(X ) is parameterized by a finite dimensional decision variable. However, difficulties (D2-D4) are still present and are overcome by matching the choice of basis functions with the problem instance. The details of choosing the basis functions are omitted and the reader is referred to the following examples for guidance. The space of quadratic functions overcomes (D2-D4) for constrained LQG problems, see [] for the details of the S-lemma procedure used to reformulate (6). For problems with polynomial dynamics, costs, and constraints, see [0] where sums-of-squares techniques and polynomial basis functions are used. In [8], radial basis functions are used to approximate stochastic reachability problems. Piece-wise constant approximate Value functions are used in [9] to address a perimeter surveillance control problem. Sampling based alternatives for overcoming (D2) are suggested in [20] and [2]. Let ˆV denote the optimizer of problem (6). Then a natural choice for the online policy is, ˆ (x) = arg min l (x, u) + E h i ˆV (g (x, u, )), (7) called an approximate greedy policy. Unless ˆV is restricted to be convex when solving (6), difficulty (D5) will still be present. If convexity of ˆV is not enforced, results from global polynomial optimization, [22], may assist. The policy we propose in Section III-D requires that the ˆV are convex, and hence that a convexity constraint is added to (6). C. Choice of the weighting c( ) As discussed in [6], the choice of c( ) does not affect problem (5). Intuitively speaking, the reason is that the space F(X ) is rich enough to satisfy V (x) apple T M V (x) with equality, point-wise for all x 2 X. In contrast, once the restriction ˆF(X ) F(X ) is made this is no longer true. The choice of c( ), referred to as the state relevance weighting, provides a trade-off between elements of ˆF(X ) over the set of states. Thus, c( ) is a tuning parameter of the approximate LP and influences the optimizer, ˆV. A good approximation of the value function should achieve near optimal online performance when it replaces V in the greedy policy. Intuitively, we see from (7) that the online policy depends on the gradient of the approximate value function. Two value functions that differ by a constant will make identical decisions. However, the approximate LP finds the closest fit to V, relative to the choice of c( ) and does not attempts to match the gradient of V. We now provide the intuition behind the approach proposed in the next sub-section. Consider two choices of the state relevance weighting, c A ( ) and c B ( ), that separately place weight on narrow, disjoint regions of the state space, denoted A and B, and zero weight elsewhere. For each choice, the solution of (6), ˆV A and ˆV B, will be the closest under-estimator to V over the respective region. On region A, ˆV B will be lower than ˆV A, otherwise it would not be the solution of (6), and the reverse. Thus if we construct an approximate Value function that is a point-wise maximum of ˆV A and ˆV B, it is expected to give the best estimate of V. Finally, by fitting V closely over a larger region, it is expected that the gradient approximation will be improved. In this way, our proposed point-wise maximum approach immunizes against the tuning errors that occur when choosing a single c( ). D. Point-wise maximum Value function and policy We will solve problem (6) for several choices of c( ), and denote ˆV j as the solution for a corresponding c j ( ). Letting j 2J denote an index set, we define the point-wise maximum Value function as follows: n o ˆV pwm (x) := max ˆV j (x), 8 x 2X. Problem (6) ensures that each ˆV j is a point-wise underestimator of V. Hence ˆV pwm is a better under-estimator of V in the following sense: V ˆVpwm apple V ˆV j, 8 j 2J. The natural choice for the online policy now is to use ˆV pwm in the approximate greedy policy, i.e., apple n o ˆ (x) = arg min l (x, u)+e max ˆV j (f (x, u, )). (8) j However, the point-wise maximum value function reintroduces difficulty (D4): as ˆV pwm /2 ˆF(X ), evaluating (8) may not be tractable. The difficulty in (8) is that evaluating the expectation over the disturbance requires Monte Carlo sampling, and this makes the optimization over u prohibitively slow. Exchanging the expectation and maximization in (8) circumvents this difficulty, and leads to, ˆ (x) = arg min l (x, u) + max n E h ˆV j (f (x, u, )) io. (9)

4 This approximation induces a tractable reformulation, and by Jensen s inequality (9) is still a lower bound. A similar approach was proposed in [2] in the context of min-max approximate dynamic programming. It is not clear how the exchange will affect the performance of the approximate greedy policy. In the next section we propose an alternative formulation in terms of the Q functions that alleviates the need to use Jensen s inequality. IV. Q-FUNCTION APPROACH TO ADP In this section, we alternatively define the greedy policy using Q functions instead of Value functions. We will show that the resulting greedy policy does not suffer from difficulty (D4). Additionally, the greedy policy can be efficiently computed when using a point-wise maximum of approximate Q- functions. Finally in Section IV-D, we provide error bounds for point-wise maximum Q-functions. A. Iterated F -operator Inequality and the Approximate LP The bellman equation for the Q-function formation, (3), is relaxed to the iterated F -operator Inequality, Q(x, u) applef M Q(x, u), 8x 2X,, (0) for some M 2 Z + being the number of iterations, where F M denotes the M applications of the F -operator. As shown in [4], any Q satisfying (0) will be a point-wise underestimator of Q for all elements of the set (X U). An exact LP reformulation of (3) is analogous to (5) and also requires optimization over an infinite dimensional functional space. For brevity we omit this formulation, and move directly to the functional approximation. Similar to Section III-B, we restrict the admissible Q-functions to those that can be expressed as a linear combination of basis functions. In particular, given basis functions ˆQ (i) (x) :R nx! R, we parameterize a restricted function space as, ˆF(X U)=n ˆQ(, ) ˆQ(x, u) = P K i= i ˆQ (i) (x, u) o, for some i 2 R. Using ˆF(X U), an approximate solution to (3) is obtained through the solution of the following approximate LP: Z max ˆQ X U ˆQ(x, u) c(d(x, u)) s.t. ˆQ 2 ˆF(X U), ˆQ(x, u) apple F M ˆQ(x, u), 8 x 2X,, () where the optimization variables are the i s in the definition of ˆF(X U), hence the LP is finite dimensional. The weighting parameter in the objective, c(, ) needs to be defined over the (X U) space. Again, the iterated F -operator inequality is not a convex constraint on the optimization variables. However, it can be replaced by a constraint that is convex in the i s and implies that the iterated F -operator inequality is satisfied. The reformulation is presented in Appendix II, it combines the reformulations found in [] and [4]. The difficulties (D-D5), described for the Value function formulation, apply equally for the Q-function formulation. Similar to the discussion in Section III-B, difficulties (D2- D4) remain present in (). Overcoming (D2-D4) requires the basis function of ˆF(X U) to be chosen appropriately for a particular problem instance. The quadratic, polynomial, and radial basis functions described in [], [0], and [8] for the Value function formulation, can be used equivalently for the Q-function formulation. Let ˆQ denote the optimizer of problem (). Then a natural choice for an approximate greedy policy is, ˆ (x) = arg min ˆQ (x, u), (2) Overcoming difficulty (D5) for Q-functions requires that ˆQ is restricted to be convex in u when solving (). B. Choice of the weighting c(, ) The weighting in the objective of (), c(, ), is analogous to the weighting in the Value function formulation: it influences the solution of the approximate LP and is difficult to choose. In general, it is not possible to find a c(, ) so that the associated ˆQ approximates Q equally well over the whole state-by-input space. It is not clear how one would choose c(, ) such that ˆQ provides: (i) a tight under-estimate of Q, and (ii) achieves near optimal online performance. Next, we introduce the point-wise maximum Q function, which removes the need to choose a single c(, ) by combining the best fit over multiple approximate Q-functions. C. Point-wise maximum Q-functions and policy We propose the point-wise maximum of Q-functions as a method to alleviate the burden of tuning a single weighting parameter. We will show that: (i) the proposed method induces a computationally tractable policy; and (ii) a better under estimator will result in a tighter lower bound on Q, and potentially better performance with the online policy. We will now define the point-wise maximum Q-function. To this end, we solve problem (6) for several choices of c(, ) and denote ˆQ j as the solution for a corresponding c j(, ). Letting J denote an index set the ˆQ pwm is defined as, ˆQ pwm (x, u) := max n o ˆQ j (x, u), (3) point-wise for all x 2X and u 2U. Replacing Q by ˆQ pwm in equation (2) leads to the approximate greedy policy, ˆ (x) = arg min max n o ˆQ j (x, u). (4) The advantage of problem (4) compared to (8) is that it does not involve the multidimensional integration over the disturbance and hence avoids the re-introduction of difficulty (D4). Thus, (4) is equivalently reformulated as, ˆ (x) = arg min,t t s.t. ˆQ j (x, u) apple t, 8 j 2J. (5) This reformulation is a convex optimization program if ˆQ is restricted to be convex in u when solving (). The numerical

5 example presented in Section V uses convex quadratics as the basis function space. This means that ˆQpwm is a convex piece-wise quadratic function and solving (5) is a convex Quadratically Constrained Quadratic Program (QCQP). D. Fitting bound for the approximate Q-function We now provide a bound on how closely a solution of () approximates Q. This result allows us to show that ˆQpwm will provide a better estimate of Q than the ˆQ j from which it is composed. Combining the ideas from [] and [4], we prove the following theorem: Theorem 4.: Given Q is the solution of (3) and ˆQ is the solution of () for a given choice ˆF(X U) and c(, ), then the following bound holds, Q ˆQ,c(x,u) apple Proof: See Appendix I. 2 min kq M ˆQ2 ˆF(X U) ˆQk (6) The theorem says that when Q is close to the span of the restricted function space, then the under-estimator ˆQ will also be close to Q. In fact, the bound indicates that if Q 2 ˆF(X U), the approximate LP will recover ˆQ =Q. Notice that Theorem 4. holds for any choice of ˆF(X U) and any choice of c(, ). We will now argue that Theorem 4. applies also to ˆQ pwm as defined in (3). This will allow us to conclude that ˆQpwm provides a better estimate of Q. A valid choice of the restricted function space is the following point-wise maximum of N approximate Q-functions: 8 < ˆQ k (x, u) 2 ˆF(X 9 U), = ˆF pwm(x U)= N : ˆQ(, ) ˆQ(x, u) = max k=,...,n n o ˆQk (x, u) ;. Difficulties (D-D4) will still exist if one attempts to solve () using ˆF pwm (X U) as the restricted function space. However, the bound given in Theorem 4. applies regardless. Note that the right-hand side of the bound in Theorem 4. depends only on the restricted function space and not on the solution of the approximate LP. Therefore, ˆFpwm (X U) ˆF(X U), implies that the minimization on the right-hand side of the bound has a greater feasible region, and hence will achieve a tighter bound, for the point-wise maximum function space. For the theorem to apply to our choice of ˆQ pwm in (3), we must show that it is feasible for the approximate LP with ˆF pwm (X U) as the restricted function space. This is achieved by the following lemma. Lemma 4.2: Let { ˆQ j } be Q-functions such that for each j 2J the following inequality holds: ˆQ j (x, u) apple F M ˆQj (x, u), 8 x 2X,. Then the function ˆQ pwm (x, u), defined in (3), also satisfies the iterated F -operator inequality, i.e., ˆQ pwm (x, u) apple F M ˆQpwm (x, u). Proof: See Appendix I. Although we cannot tractably solve () with ˆF pwm (X U) as the restricted function space, Lemma 4.2 states that ˆQ pwm is a feasible point of that problem. Therefore, there likely exists a choice of c(, ) such that ˆQ pwm is the solution. Theorem 4. states that under this choice of c(, ), ˆQpwm approximates Q per the bound given in the theorem. We close this section by re-iterating the benefits of the ˆQ pwm compared to ˆV pwm. Both give improved lower bounds, but ˆQpwm has the advantage that the policy can be implemented without the need to introduce an additional approximation as in the case of (8). This will be further demonstrated in the following numerical example. V. NUMERICAL RESULTS In this section, we present a numerical case study to highlight the benefits of the proposed point-wise maximum policies, see Sections III-D and IV-C. We use a dynamic portfolio optimization example taken directly from [2] to compare the online performance of both the Value function and Q-function formulations. The model is briefly described here, using the same notation as [2, Section IV]. The task is to manage a portfolio of n assets with the objective of maximizing revenue over a discounted infinite horizon. The state of the system, x t 2 R n, is the value of assets in the portfolio, while the input, u t 2 R n, is the amount to buy or sell of each asset. By convention, negative values for an element of u t means the respective asset is sold, while positive means purchased. The stochastic disturbance affecting the system is the return of the assets occurring over a time period, denoted t 2 R n. Under the influence of u t and t, the value of the portfolio evolves over time as, x t+ = diag ( t )(x t + u t ). The dynamics represent an example of a linear system affected by multiplicative uncertainty. The transaction fees are parameterized by apple 2 R + and R 2 R n n, and hence the stage cost incurred at each time step is given by, > n u t + apple u t + u > t Ru t. The first term represents the gross cash from purchases and sales, and the final two terms represent the transaction cost. As revenue corresponds to a negative cost, the objective is to minimize the discounted infinite sum of the stage costs. The discount factor represents the time value of money. A restriction is placed on the risk of the portfolio by enforcing the following constraint on the return variance over a time step: (x t + u t ) > ˆ (xt + u t ) apple l, where l 2 R + is the maximum variance allowed, and ˆ is the covariance of the uncertain return t. Table I lists the parameter values that we use in the numerical instance presented here. We solved both the Value function and Q-function approximate LP using quadratic basis functions. The equations for fitting the Value functions and Q-functions via (6) and ()

6 TABLE I: Parameters used for portfolio example n =8 =0.96 x 0 = N (0 n, 0 I n) log( t)=n (0 n, 0.5 I n) apple =0.04 n R =diag(0.028, 0.034, 0.020, 0.026, 0.023, 0.022, 0.024, 0.027) follow directly from [2, Equations (4)-(7)]. The quadratic basis functions used are parameterized as follows: apple > apple x Pxx P ˆQ t (x, u) = xu u ˆV (x) =x > Px+ p > x + s V, apple x u + p > x P > xu P uu apple p > x u u + s Q, where P, P xx,p uu 2 S n, P xu 2 R n n, p, p x,p u 2 R n, s V,s Q 2 R, are the coefficients of the linear combinations in the basis functions sets ˆF(X ) and ˆF(X U). A positive semi-definite constraint is enforced on P and P uu to make the point-wise maximum policies convex and tractable. This restricts the Value functions to be convex and the Q-functions to be convex in u. A family of 0 weighting parameters was used. One of the weighting parameters, denoted c 0 ( ), was chosen to be the same as the initial distribution over X, and represents the method suggested in []. The remaining 9 weighting parameters, denoted {c j ( )} 9 j= were chosen to be low variance normal distributions centred at random locations in the state space. For the Q-function formulation, the same 0 weighting parameters were expanded with a fixed uniform distribution over U. For each c j ( ), problems (6) and () were solved to obtain ˆV j and ˆQ j, j =,...,9. The lower bound and online performance was computed for the individual ˆV j and ˆQ j and also for ˆV pwm and ˆQ pwm with the point-wise maximum taken over the whole family. The lower bound was computed as the average over 2000 samples while the online performance was averaged over 2000 simulations each of length 300 time steps. The results are shown in Table II. TABLE II: Lower Bound and online performance Lower Bound / Online Value Online ˆV Online ˆQ Online ˆQ j j =,...,9 [ 3.0, 42.8] Online ˆV j j =,...,9 [ 95., 42.8] Online ˆV pwm 45.2 Online ˆQ pwm 45.5 Lower Bound ˆQ pwm 65. Lower Bound ˆQ Lower Bound ˆQ j j =,...,9 [ 568.4, 65.] Lower Bound ˆV pwm 65.8 Lower Bound ˆV Lower Bound ˆV j j =,...,9 [ 220.7, 66.0] The results show that when using a single choice of c( ), both the lower bound and online performance can be arbitrarily bad. This is shown by the large range of values in rows 3, 4, 9, and 2 of Table II. This large range indicates that at least one of the weighting parameters was a poor choice. As the point-wise maximum function uses all 0 choices of c( ), it includes also this poor choice of the weighting parameter. We see from the results that the pointwise maximum function achieves the tightest lower bound and best online performance. This highlights that the pointwise maximum function immunizes against the poor choice of the weighting parameter. Using the initial distribution as the weighting parameter gives reasonable results for this example, see rows, 2, 8, and of Table II, thus, the suggestion of [] is reasonable. The benefit of the point-wise maximum policy is that the practitioner can explore other choices of c( ) without risk of degraded performance. Finally, we note that in the best case the Q-functions perform slightly better that the Value functions, indicating that exchanging the expectation and maximization in (8) had little impact for this example. However, in the worst case, the Q function performs very badly for a poor choice of the weighting parameter, highlighting further the importance of using our proposed approach to immunize against such sensitivity. This example demonstrates the features and trends indicated by the theory. VI. CONCLUSIONS In this paper, we addressed the difficulty of tuning the state relevance weighting parameter in the Linear Programming approach to Approximate Dynamic Programming. We proposed an approximate greedy policy that alleviates the tuning sensitivity of previous methods by allowing for a family of parameters to be used and automatically choosing the best parameter at each time step. This is achieved by using a point-wise maximum of functions that individually underestimate the optimal cost-to-go. We render the online policy tractable by using Q-functions. We proved that the proposed approach gives a satisfactory lower bound on the best achievable cost, and used a numerical example to demonstrate that the online performance is indeed immunized against poor choices of the weighting parameter. Future work will include improved theoretical bounds on the online performance of the policy and a deeper understanding of when the Value function or Q-function approach is preferable. APPENDIX I PROOFS OF THEOREM 4. AND LEMMA 4.2 The proof requires two auxiliary lemmas that are presented first, and then we present the proof of Theorem 4.. Lemma.2 provides a point-wise bound on how much the M-iterated F -operator inequality is violated for any given Q-function. This is used in the proof of Lemma.3, which shows that given a ˆQ 2 ˆF(X U), it can be downshifted by a certain constant amount to satisfy the iterated F -operator inequality. The constant by which it is downshifted relates directly to the constant on the RHS of Theorem 4.. We start by stating the monotone and contractive properties of the F -operator which are needed in the proofs.

7 Proposition.: the F -operator is (i) monotone (ii) - contractive as for any given Q,Q 2 :(X U)! R, (i) Q (x, u) apple Q 2 (x, u), 8x 2X,, ) FQ (x, u) apple FQ 2 (x, u), 8x 2X, ; (ii) kfq FQ 2 k apple kq Q 2 k ; These properties can be found in [4]. Lemma.2: Let M 2 Z +, and let Q :(X U)! R be any Q-function, then violations of the iterated F -operator inequality can be bounded as, F M Q (x, u) Q(x, u) + M kq Qk, for all x 2X,. Proof: Starting from the terms not involving, Q(x, u) kq Qk F M Q (x, u) apple Q (x, u) F M Q (x, u), 8 x 2X, apple F M Q F M Q apple M k Q Q k. The first inequality follows from the definition of the - norm, and the second inequality comes from Q (x, u) = (FQ )(x, u) and the -norm definition. Finally, the third inequality is due to the -contractive property of the F - operator. Re-arranging, the result follows. Lemma.3: Let ˆQ(x, u) 2 ˆF(X U) be an arbitrary element from the basis functions set, and let Q(x, u) be a Q-function defined as, Q(x, u) = ˆQ(x, u) + M M kq ˆQk, {z } downwards shift term (7) then Q(x, u) satisfies the iterated F -operator inequality, i.e., Q(x, u) apple F M Q (x, u), 8 x 2X,, and if ˆF(X U) allows for affine combinations of the basis functions, then Q is also an element of ˆF(X U). Proof: Let 2 R denote the constant downwards shift term for notational convenience. Using the definition of the F -operator we see that for any function Q(x, u), ( F (Q + ))(x, u) = l(x, u) + min E [ Q(f(x, u, ),v)+ ] v2u =(FQ)(x, u) +. where the equalities hold for all x 2X, u 2U. The first equality comes from the definition of the F -operator, and the second equality holds as is an additive constant in the objective of the minimization. Iterating the same argumentation M-times leads to F M (Q + ) (x, u) = F M (F (Q + )) (x, u) = F M ((FQ)+ ) (x, u) = F M 2 F 2 Q + 2 (x, u) =... = F M Q (x, u)+ M, (8) where the equivalences hold point-wise for all x 2X, u 2 U. Now we show that Q satisfies the iterated F -operator inequality, F M Q (x, u) = F M M + M ˆQ (x, u) M kq ˆQk ˆQ(x, u) + M Q ˆQ M + M M kq ˆQk = Q(x, u), where the first equality comes from (8), the inequality is a direct application of Lemma.2 to the term (F M ˆQ) and holds for all x 2X, u 2U, and the final equality follows from (7). Finally, if ˆF(X U) allows for affine combinations of the basis functions, then ˆQ 2 ˆF(X U) implies Q 2 ˆF(X U) as the downward shift term is an additive constant. Now, we have all the ingredients to prove Theorem 4.. Proof: of Theorem 4.. Given any approximate Q-function from the basis, ˆQ(x, u) 2 ˆF(X U), Lemma.3 allows us to construct, Q(x, u) = ˆQ(x, u) + M M Q ˆQ 2 ˆF(X U), which is feasible for (). It can be shown that maximizing R ˆQ(x, u) c(d(x, u)) X U is equivalent to minimizing kq ˆQk,c(x,u) for constraint that ensure ˆQ is an under-estimator of Q (), see [7, Lemma ] for an example proof. Thus, we start from left hand side of(6), Q ˆQ apple Q Q apple Q Q apple Q ˆQ,c(x,u),c(x,u) + ˆQ Q = Q ˆQ + + M M Q ˆQ 2 = M Q ˆQ where the first inequality holds because Q is also feasible for (), the second inequality by assuming w.l.o.g. that

8 c(x, u) is a probability distribution, the third inequality is an application of the triangle inequality, the first equality stems directly from the definition of Q, and the final is an algebraic manipulation. As this argumentation holds for any ˆQ 2 ˆF(X U), the result follows. Proof: of Lemma 4.2. Starting from the definition of ˆQ pwm we get, ˆQ j (x, u) apple ˆQ pwm (x, u), 8 j 2J ) F M ˆQj (x, u) apple F M ˆQpwm (x, u), 8 j 2J o, max n F M ˆQj (x, u) apple F M ˆQpwm (x, u), where the inequalities hold point-wise for all x 2X, u 2U. The first implication follows from the monotonicity property of the F operator, and the equivalence holds as j appears only on the LHS of the inequality. The assumption of the lemma that the iterated F -operator inequality is satisfied for each j, implies the following inequality max n ˆQj (x, u) o o apple max n F M ˆQk (x, u), k2j hold point-wise for all x 2X, u 2U. Noting the the lefthand side of (9) is the definition of ˆQ pwm (x, u), the claim follows. APPENDIX II REFORMULATION OF F -OPERATOR INEQUALITY This convex alternative applies to (). A sufficient condition for ˆQ 2 ˆF(X U) to satisfy the iterated F -operator inequality is the following: ˆQ (x, u) applef M ˆQ (x, u), * (9a) 9 ˆQ j 2 ˆF(X U), j =,...,M, h i ˆQ j (x, u) applel(x, u)+e ˆVj (f(x, u, )), (9b) j =,...,M, ˆV j (x) apple ˆQ j (x, u), j =2,...,M, ˆV M (x) apple ˆQ (x, u), where all inequalities hold for all x 2X and u 2U. The definitions of ˆF(X ) and ˆF(X U) are given in Section III- B and IV-A respectively. The reformulation (9b) is linear in the additional Value function and Q-function variables introduced. Hence (9b) is a tractable set of constraints, given that ˆF(X ) and ˆF(X U) were chosen to overcome difficulties (D-D4). Note that (9a), (9b) only when infinite dimensional function spaces are used for the additional variables, meaning that it is intractable to use an equivalent reformulation of the F -operator inequality. REFERENCES [] R. E. Bellman, On the theory of dynamic programming, Proceedings of the National Academy of Sciences of the United States of America, vol. 38, no. 8, pp , 952. [2] D. P. Bertsekas, Dynamic programming and optimal control. Athena Scientific Belmont, MA, 2005, vol., no. 3. [3] W. B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality, 2nd ed. Wiley, 20. [4] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming. Athena Scientific, 996. [5] P. J. Schweitzer and A. Seidmann, Generalized polynomial approximations in markovian decision processes, Journal of Mathematical Analysis and Applications, vol. 0, pp , 985. [6] G. Tesauro, Practical issues in temporal difference learning, Machine Learning, vol. 8, pp , 992. [7] A. G. Barto and R. H. Crites, Improving elevator performance using reinforcement learning, Advances in neural information processing systems, vol. 8, pp , 996. [8] N. Kariotoglou, S. Summers, T. Summers, M. Kamgarpour, and J. Lygeros, Approximate dynamic programming for stochastic reachability, in Control Conference (ECC), 203 European. IEEE, 203, pp [9] A. Keshavarz and S. Boyd, Quadratic approximate dynamic programming for input-affine systems, International Journal of Robust and Nonlinear Control, vol. 24, no. 3, pp , July 202. [0] T. Summers, K. Kunz, N. Kariotoglou, M. Kamgarpour, S. Summers, and J. Lygeros, Approximate dynamic programming via sum of squares programming, in Control Conference (ECC), 203 European. IEEE, 203, pp [] Y. Wang, B. O Donoghue, and S. Boyd, Approximate dynamic programming via iterated bellman inequalities, International Journal of Robust and Nonlinear Control, 204. [2] B. O Donoghue, Y. Wang, and S. Boyd, Min-max approximate dynamic programming, in Computer-Aided Control System Design (CACSD), 20 IEEE International Symposium on. IEEE, 20, pp [3] J. B. Rawlings and D. Q. Mayne, Model Predictive Control: Theory and Design. Nob Hill Publishing, [4] R. Cogill, M. Rotkowitz, B. Van Roy, and S. Lall, An approximate dynamic programming approach to decentralized control of stochastic systems, in Control of Uncertain Systems: Modelling, Approximation, and Design. Springer, 2006, pp [5] J. Casti, The linear-quadratic control problem: some recent results and outstanding problems, SIAM Review, vol. 22, no. 4, pp , 980. [6] O. Hernández-Lerma and J. B. Lasserre, Discrete-time Markov control processes: basic optimality criteria. Springer Science & Business Media, 202, vol. 30. [7] D. P. De Farias and B. Van Roy, The linear programming approach to approximate dynamic programming, Operations Research, vol. 5, no. 6, pp , November-December [8] W. B. Powell, What you should know about approximate dynamic programming, Naval Research Logistics (NRL), vol. 56, no. 3, pp , February [9] K. Krishnamoorthy, M. Pachter, S. Darbha, and P. Chandler, Approximate dynamic programming with state aggregation applied to uav perimeter patrol, International Journal of Robust and Nonlinear Control, vol. 2, no. 2, pp , 20. [20] D. P. De Farias and B. Van Roy, On constraint sampling in the linear programming approach to approximate dynamic programming, INFORMS - Mathematics of Operations Research, vol. 29, no. 3, pp , August [2] T. Sutter, P. M. Esfahani, and J. Lygeros, Approximation of constrained average cost markov control processes, in Decision and Control (CDC), 204 IEEE 53rd Annual Conference on. IEEE, 204, pp [22] J. B. Lasserre, Global optimization with polynomials and the problem of moments, SIAM Journal on Optimization, vol., no. 3, pp , 200.

arxiv: v2 [cs.sy] 29 Mar 2016

arxiv: v2 [cs.sy] 29 Mar 2016 Approximate Dynamic Programming: a Q-Function Approach Paul Beuchat, Angelos Georghiou and John Lygeros 1 ariv:1602.07273v2 [cs.sy] 29 Mar 2016 Abstract In this paper we study both the value function and

More information

Approximate dynamic programming for stochastic reachability

Approximate dynamic programming for stochastic reachability Approximate dynamic programming for stochastic reachability Nikolaos Kariotoglou, Sean Summers, Tyler Summers, Maryam Kamgarpour and John Lygeros Abstract In this work we illustrate how approximate dynamic

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs 2015 IEEE 54th Annual Conference on Decision and Control CDC December 15-18, 2015. Osaka, Japan An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs Abhishek Gupta Rahul Jain Peter

More information

1 Markov decision processes

1 Markov decision processes 2.997 Decision-Making in Large-Scale Systems February 4 MI, Spring 2004 Handout #1 Lecture Note 1 1 Markov decision processes In this class we will study discrete-time stochastic systems. We can describe

More information

Introduction to Approximate Dynamic Programming

Introduction to Approximate Dynamic Programming Introduction to Approximate Dynamic Programming Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Approximate Dynamic Programming 1 Key References Bertsekas, D.P.

More information

Linear Programming Formulation for Non-stationary, Finite-Horizon Markov Decision Process Models

Linear Programming Formulation for Non-stationary, Finite-Horizon Markov Decision Process Models Linear Programming Formulation for Non-stationary, Finite-Horizon Markov Decision Process Models Arnab Bhattacharya 1 and Jeffrey P. Kharoufeh 2 Department of Industrial Engineering University of Pittsburgh

More information

Suboptimality of minmax MPC. Seungho Lee. ẋ(t) = f(x(t), u(t)), x(0) = x 0, t 0 (1)

Suboptimality of minmax MPC. Seungho Lee. ẋ(t) = f(x(t), u(t)), x(0) = x 0, t 0 (1) Suboptimality of minmax MPC Seungho Lee In this paper, we consider particular case of Model Predictive Control (MPC) when the problem that needs to be solved in each sample time is the form of min max

More information

Bayesian Active Learning With Basis Functions

Bayesian Active Learning With Basis Functions Bayesian Active Learning With Basis Functions Ilya O. Ryzhov Warren B. Powell Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA IEEE ADPRL April 13, 2011 1 / 29

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

Generalized Dual Dynamic Programming for Infinite Horizon Problems in Continuous State and Action Spaces

Generalized Dual Dynamic Programming for Infinite Horizon Problems in Continuous State and Action Spaces Generalized Dual Dynamic Programming for Infinite Horizon Problems in Continuous State and Action Spaces Joseph Warrington, Paul N. Beuchat, and John Lygeros Abstract We describe a nonlinear generalization

More information

Computation and Dynamic Programming

Computation and Dynamic Programming Computation and Dynamic Programming Huseyin Topaloglu School of Operations Research and Information Engineering, Cornell University, Ithaca, New York 14853, USA topaloglu@orie.cornell.edu June 25, 2010

More information

1 Problem Formulation

1 Problem Formulation Book Review Self-Learning Control of Finite Markov Chains by A. S. Poznyak, K. Najim, and E. Gómez-Ramírez Review by Benjamin Van Roy This book presents a collection of work on algorithms for learning

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes

More information

The Linear Programming Approach to Reach-Avoid Problems for Markov Decision Processes

The Linear Programming Approach to Reach-Avoid Problems for Markov Decision Processes Journal of Artificial Intelligence Research 1 (2016)? Submitted 6/16; published? The Linear Programming Approach to Reach-Avoid Problems for Markov Decision Processes Nikolaos Kariotoglou karioto@control.ee.ethz.ch

More information

Stability Analysis of Optimal Adaptive Control under Value Iteration using a Stabilizing Initial Policy

Stability Analysis of Optimal Adaptive Control under Value Iteration using a Stabilizing Initial Policy Stability Analysis of Optimal Adaptive Control under Value Iteration using a Stabilizing Initial Policy Ali Heydari, Member, IEEE Abstract Adaptive optimal control using value iteration initiated from

More information

On the Convergence of Optimistic Policy Iteration

On the Convergence of Optimistic Policy Iteration Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology

More information

Distributed Optimization. Song Chong EE, KAIST

Distributed Optimization. Song Chong EE, KAIST Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

arxiv: v1 [math.oc] 9 Oct 2018

arxiv: v1 [math.oc] 9 Oct 2018 A Convex Optimization Approach to Dynamic Programming in Continuous State and Action Spaces Insoon Yang arxiv:1810.03847v1 [math.oc] 9 Oct 2018 Abstract A convex optimization-based method is proposed to

More information

Distributionally Robust Convex Optimization

Distributionally Robust Convex Optimization Submitted to Operations Research manuscript OPRE-2013-02-060 Authors are encouraged to submit new papers to INFORMS journals by means of a style file template, which includes the journal title. However,

More information

On the Approximate Linear Programming Approach for Network Revenue Management Problems

On the Approximate Linear Programming Approach for Network Revenue Management Problems On the Approximate Linear Programming Approach for Network Revenue Management Problems Chaoxu Tong School of Operations Research and Information Engineering, Cornell University, Ithaca, New York 14853,

More information

ilstd: Eligibility Traces and Convergence Analysis

ilstd: Eligibility Traces and Convergence Analysis ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca

More information

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019 Reinforcement Learning and Optimal Control ASU, CSE 691, Winter 2019 Dimitri P. Bertsekas dimitrib@mit.edu Lecture 8 Bertsekas Reinforcement Learning 1 / 21 Outline 1 Review of Infinite Horizon Problems

More information

FINITE HORIZON ROBUST MODEL PREDICTIVE CONTROL USING LINEAR MATRIX INEQUALITIES. Danlei Chu, Tongwen Chen, Horacio J. Marquez

FINITE HORIZON ROBUST MODEL PREDICTIVE CONTROL USING LINEAR MATRIX INEQUALITIES. Danlei Chu, Tongwen Chen, Horacio J. Marquez FINITE HORIZON ROBUST MODEL PREDICTIVE CONTROL USING LINEAR MATRIX INEQUALITIES Danlei Chu Tongwen Chen Horacio J Marquez Department of Electrical and Computer Engineering University of Alberta Edmonton

More information

Semidefinite and Second Order Cone Programming Seminar Fall 2012 Project: Robust Optimization and its Application of Robust Portfolio Optimization

Semidefinite and Second Order Cone Programming Seminar Fall 2012 Project: Robust Optimization and its Application of Robust Portfolio Optimization Semidefinite and Second Order Cone Programming Seminar Fall 2012 Project: Robust Optimization and its Application of Robust Portfolio Optimization Instructor: Farid Alizadeh Author: Ai Kagawa 12/12/2012

More information

Decentralized Control of Stochastic Systems

Decentralized Control of Stochastic Systems Decentralized Control of Stochastic Systems Sanjay Lall Stanford University CDC-ECC Workshop, December 11, 2005 2 S. Lall, Stanford 2005.12.11.02 Decentralized Control G 1 G 2 G 3 G 4 G 5 y 1 u 1 y 2 u

More information

A Hierarchy of Suboptimal Policies for the Multi-period, Multi-echelon, Robust Inventory Problem

A Hierarchy of Suboptimal Policies for the Multi-period, Multi-echelon, Robust Inventory Problem A Hierarchy of Suboptimal Policies for the Multi-period, Multi-echelon, Robust Inventory Problem Dimitris J. Bertsimas Dan A. Iancu Pablo A. Parrilo Sloan School of Management and Operations Research Center,

More information

EE C128 / ME C134 Feedback Control Systems

EE C128 / ME C134 Feedback Control Systems EE C128 / ME C134 Feedback Control Systems Lecture Additional Material Introduction to Model Predictive Control Maximilian Balandat Department of Electrical Engineering & Computer Science University of

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Chance-constrained optimization with tight confidence bounds

Chance-constrained optimization with tight confidence bounds Chance-constrained optimization with tight confidence bounds Mark Cannon University of Oxford 25 April 2017 1 Outline 1. Problem definition and motivation 2. Confidence bounds for randomised sample discarding

More information

Handout 8: Dealing with Data Uncertainty

Handout 8: Dealing with Data Uncertainty MFE 5100: Optimization 2015 16 First Term Handout 8: Dealing with Data Uncertainty Instructor: Anthony Man Cho So December 1, 2015 1 Introduction Conic linear programming CLP, and in particular, semidefinite

More information

Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming MATHEMATICS OF OPERATIONS RESEARCH Vol. 37, No. 1, February 2012, pp. 66 94 ISSN 0364-765X (print) ISSN 1526-5471 (online) http://dx.doi.org/10.1287/moor.1110.0532 2012 INFORMS Q-Learning and Enhanced

More information

CHAPTER 2: QUADRATIC PROGRAMMING

CHAPTER 2: QUADRATIC PROGRAMMING CHAPTER 2: QUADRATIC PROGRAMMING Overview Quadratic programming (QP) problems are characterized by objective functions that are quadratic in the design variables, and linear constraints. In this sense,

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

Model-Based Reinforcement Learning with Continuous States and Actions

Model-Based Reinforcement Learning with Continuous States and Actions Marc P. Deisenroth, Carl E. Rasmussen, and Jan Peters: Model-Based Reinforcement Learning with Continuous States and Actions in Proceedings of the 16th European Symposium on Artificial Neural Networks

More information

Procedia Computer Science 00 (2011) 000 6

Procedia Computer Science 00 (2011) 000 6 Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-

More information

Signal Recovery from Permuted Observations

Signal Recovery from Permuted Observations EE381V Course Project Signal Recovery from Permuted Observations 1 Problem Shanshan Wu (sw33323) May 8th, 2015 We start with the following problem: let s R n be an unknown n-dimensional real-valued signal,

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

IMPROVED MPC DESIGN BASED ON SATURATING CONTROL LAWS

IMPROVED MPC DESIGN BASED ON SATURATING CONTROL LAWS IMPROVED MPC DESIGN BASED ON SATURATING CONTROL LAWS D. Limon, J.M. Gomes da Silva Jr., T. Alamo and E.F. Camacho Dpto. de Ingenieria de Sistemas y Automática. Universidad de Sevilla Camino de los Descubrimientos

More information

Selected Examples of CONIC DUALITY AT WORK Robust Linear Optimization Synthesis of Linear Controllers Matrix Cube Theorem A.

Selected Examples of CONIC DUALITY AT WORK Robust Linear Optimization Synthesis of Linear Controllers Matrix Cube Theorem A. . Selected Examples of CONIC DUALITY AT WORK Robust Linear Optimization Synthesis of Linear Controllers Matrix Cube Theorem A. Nemirovski Arkadi.Nemirovski@isye.gatech.edu Linear Optimization Problem,

More information

On Constraint Sampling in the Linear Programming Approach to Approximate Dynamic Programming

On Constraint Sampling in the Linear Programming Approach to Approximate Dynamic Programming MATHEMATICS OF OPERATIONS RESEARCH Vol. 29, No. 3, August 2004, pp. 462 478 issn 0364-765X eissn 1526-5471 04 2903 0462 informs doi 10.1287/moor.1040.0094 2004 INFORMS On Constraint Sampling in the Linear

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

A Globally Stabilizing Receding Horizon Controller for Neutrally Stable Linear Systems with Input Constraints 1

A Globally Stabilizing Receding Horizon Controller for Neutrally Stable Linear Systems with Input Constraints 1 A Globally Stabilizing Receding Horizon Controller for Neutrally Stable Linear Systems with Input Constraints 1 Ali Jadbabaie, Claudio De Persis, and Tae-Woong Yoon 2 Department of Electrical Engineering

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

A Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley

A Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley A Tour of Reinforcement Learning The View from Continuous Control Benjamin Recht University of California, Berkeley trustable, scalable, predictable Control Theory! Reinforcement Learning is the study

More information

Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms

Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms Jonathan Baxter and Peter L. Bartlett Research School of Information Sciences and Engineering Australian National University

More information

Linear Programming Methods

Linear Programming Methods Chapter 11 Linear Programming Methods 1 In this chapter we consider the linear programming approach to dynamic programming. First, Bellman s equation can be reformulated as a linear program whose solution

More information

HOW TO CHOOSE THE STATE RELEVANCE WEIGHT OF THE APPROXIMATE LINEAR PROGRAM?

HOW TO CHOOSE THE STATE RELEVANCE WEIGHT OF THE APPROXIMATE LINEAR PROGRAM? HOW O CHOOSE HE SAE RELEVANCE WEIGH OF HE APPROXIMAE LINEAR PROGRAM? YANN LE ALLEC AND HEOPHANE WEBER Abstract. he linear programming approach to approximate dynamic programming was introduced in [1].

More information

Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts.

Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts. An Upper Bound on the oss from Approximate Optimal-Value Functions Satinder P. Singh Richard C. Yee Department of Computer Science University of Massachusetts Amherst, MA 01003 singh@cs.umass.edu, yee@cs.umass.edu

More information

The Art of Sequential Optimization via Simulations

The Art of Sequential Optimization via Simulations The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern California (Based on joint

More information

Bias-Variance Error Bounds for Temporal Difference Updates

Bias-Variance Error Bounds for Temporal Difference Updates Bias-Variance Bounds for Temporal Difference Updates Michael Kearns AT&T Labs mkearns@research.att.com Satinder Singh AT&T Labs baveja@research.att.com Abstract We give the first rigorous upper bounds

More information

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER 2011 7255 On the Performance of Sparse Recovery Via `p-minimization (0 p 1) Meng Wang, Student Member, IEEE, Weiyu Xu, and Ao Tang, Senior

More information

Approximate Dynamic Programming By Minimizing Distributionally Robust Bounds

Approximate Dynamic Programming By Minimizing Distributionally Robust Bounds Approximate Dynamic Programming By Minimizing Distributionally Robust Bounds Marek Petrik IBM T.J. Watson Research Center, Yorktown, NY, USA Abstract Approximate dynamic programming is a popular method

More information

Value and Policy Iteration

Value and Policy Iteration Chapter 7 Value and Policy Iteration 1 For infinite horizon problems, we need to replace our basic computational tool, the DP algorithm, which we used to compute the optimal cost and policy for finite

More information

arxiv: v1 [math.oc] 23 Oct 2017

arxiv: v1 [math.oc] 23 Oct 2017 Stability Analysis of Optimal Adaptive Control using Value Iteration Approximation Errors Ali Heydari arxiv:1710.08530v1 [math.oc] 23 Oct 2017 Abstract Adaptive optimal control using value iteration initiated

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE 6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE Review of Q-factors and Bellman equations for Q-factors VI and PI for Q-factors Q-learning - Combination of VI and sampling Q-learning and cost function

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

Structured Problems and Algorithms

Structured Problems and Algorithms Integer and quadratic optimization problems Dept. of Engg. and Comp. Sci., Univ. of Cal., Davis Aug. 13, 2010 Table of contents Outline 1 2 3 Benefits of Structured Problems Optimization problems may become

More information

Stabilization of constrained linear systems via smoothed truncated ellipsoids

Stabilization of constrained linear systems via smoothed truncated ellipsoids Preprints of the 8th IFAC World Congress Milano (Italy) August 28 - September 2, 2 Stabilization of constrained linear systems via smoothed truncated ellipsoids A. Balestrino, E. Crisostomi, S. Grammatico,

More information

In: Proc. BENELEARN-98, 8th Belgian-Dutch Conference on Machine Learning, pp 9-46, 998 Linear Quadratic Regulation using Reinforcement Learning Stephan ten Hagen? and Ben Krose Department of Mathematics,

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Biasing Approximate Dynamic Programming with a Lower Discount Factor

Biasing Approximate Dynamic Programming with a Lower Discount Factor Biasing Approximate Dynamic Programming with a Lower Discount Factor Marek Petrik, Bruno Scherrer To cite this version: Marek Petrik, Bruno Scherrer. Biasing Approximate Dynamic Programming with a Lower

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Optimal Decentralized Control of Coupled Subsystems With Control Sharing

Optimal Decentralized Control of Coupled Subsystems With Control Sharing IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 58, NO. 9, SEPTEMBER 2013 2377 Optimal Decentralized Control of Coupled Subsystems With Control Sharing Aditya Mahajan, Member, IEEE Abstract Subsystems that

More information

On the Power of Robust Solutions in Two-Stage Stochastic and Adaptive Optimization Problems

On the Power of Robust Solutions in Two-Stage Stochastic and Adaptive Optimization Problems MATHEMATICS OF OPERATIONS RESEARCH Vol. 35, No., May 010, pp. 84 305 issn 0364-765X eissn 156-5471 10 350 084 informs doi 10.187/moor.1090.0440 010 INFORMS On the Power of Robust Solutions in Two-Stage

More information

Robustness of policies in Constrained Markov Decision Processes

Robustness of policies in Constrained Markov Decision Processes 1 Robustness of policies in Constrained Markov Decision Processes Alexander Zadorojniy and Adam Shwartz, Senior Member, IEEE Abstract We consider the optimization of finite-state, finite-action Markov

More information

Loss Bounds for Uncertain Transition Probabilities in Markov Decision Processes

Loss Bounds for Uncertain Transition Probabilities in Markov Decision Processes Loss Bounds for Uncertain Transition Probabilities in Markov Decision Processes Andrew Mastin and Patrick Jaillet Abstract We analyze losses resulting from uncertain transition probabilities in Markov

More information

arxiv: v3 [math.oc] 25 Apr 2018

arxiv: v3 [math.oc] 25 Apr 2018 Problem-driven scenario generation: an analytical approach for stochastic programs with tail risk measure Jamie Fairbrother *, Amanda Turner *, and Stein W. Wallace ** * STOR-i Centre for Doctoral Training,

More information

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017 The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement

More information

Optimal Stopping Problems

Optimal Stopping Problems 2.997 Decision Making in Large Scale Systems March 3 MIT, Spring 2004 Handout #9 Lecture Note 5 Optimal Stopping Problems In the last lecture, we have analyzed the behavior of T D(λ) for approximating

More information

OPTIMALITY OF RANDOMIZED TRUNK RESERVATION FOR A PROBLEM WITH MULTIPLE CONSTRAINTS

OPTIMALITY OF RANDOMIZED TRUNK RESERVATION FOR A PROBLEM WITH MULTIPLE CONSTRAINTS OPTIMALITY OF RANDOMIZED TRUNK RESERVATION FOR A PROBLEM WITH MULTIPLE CONSTRAINTS Xiaofei Fan-Orzechowski Department of Applied Mathematics and Statistics State University of New York at Stony Brook Stony

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Tube Model Predictive Control Using Homothety & Invariance

Tube Model Predictive Control Using Homothety & Invariance Tube Model Predictive Control Using Homothety & Invariance Saša V. Raković rakovic@control.ee.ethz.ch http://control.ee.ethz.ch/~srakovic Collaboration in parts with Mr. Mirko Fiacchini Automatic Control

More information

An iterative procedure for constructing subsolutions of discrete-time optimal control problems

An iterative procedure for constructing subsolutions of discrete-time optimal control problems An iterative procedure for constructing subsolutions of discrete-time optimal control problems Markus Fischer version of November, 2011 Abstract An iterative procedure for constructing subsolutions of

More information

Multi-Attribute Bayesian Optimization under Utility Uncertainty

Multi-Attribute Bayesian Optimization under Utility Uncertainty Multi-Attribute Bayesian Optimization under Utility Uncertainty Raul Astudillo Cornell University Ithaca, NY 14853 ra598@cornell.edu Peter I. Frazier Cornell University Ithaca, NY 14853 pf98@cornell.edu

More information

ASIGNIFICANT research effort has been devoted to the. Optimal State Estimation for Stochastic Systems: An Information Theoretic Approach

ASIGNIFICANT research effort has been devoted to the. Optimal State Estimation for Stochastic Systems: An Information Theoretic Approach IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL 42, NO 6, JUNE 1997 771 Optimal State Estimation for Stochastic Systems: An Information Theoretic Approach Xiangbo Feng, Kenneth A Loparo, Senior Member, IEEE,

More information

Lecture 6: Conic Optimization September 8

Lecture 6: Conic Optimization September 8 IE 598: Big Data Optimization Fall 2016 Lecture 6: Conic Optimization September 8 Lecturer: Niao He Scriber: Juan Xu Overview In this lecture, we finish up our previous discussion on optimality conditions

More information

A Tighter Variant of Jensen s Lower Bound for Stochastic Programs and Separable Approximations to Recourse Functions

A Tighter Variant of Jensen s Lower Bound for Stochastic Programs and Separable Approximations to Recourse Functions A Tighter Variant of Jensen s Lower Bound for Stochastic Programs and Separable Approximations to Recourse Functions Huseyin Topaloglu School of Operations Research and Information Engineering, Cornell

More information

A Gentle Introduction to Reinforcement Learning

A Gentle Introduction to Reinforcement Learning A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,

More information

SPARSE signal representations have gained popularity in recent

SPARSE signal representations have gained popularity in recent 6958 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 10, OCTOBER 2011 Blind Compressed Sensing Sivan Gleichman and Yonina C. Eldar, Senior Member, IEEE Abstract The fundamental principle underlying

More information

MFM Practitioner Module: Risk & Asset Allocation. John Dodson. January 25, 2012

MFM Practitioner Module: Risk & Asset Allocation. John Dodson. January 25, 2012 MFM Practitioner Module: Risk & Asset Allocation January 25, 2012 Optimizing Allocations Once we have 1. chosen the markets and an investment horizon 2. modeled the markets 3. agreed on an objective with

More information

Mathematical Optimization Models and Applications

Mathematical Optimization Models and Applications Mathematical Optimization Models and Applications Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305, U.S.A. http://www.stanford.edu/ yyye Chapters 1, 2.1-2,

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Theory in Model Predictive Control :" Constraint Satisfaction and Stability!

Theory in Model Predictive Control : Constraint Satisfaction and Stability! Theory in Model Predictive Control :" Constraint Satisfaction and Stability Colin Jones, Melanie Zeilinger Automatic Control Laboratory, EPFL Example: Cessna Citation Aircraft Linearized continuous-time

More information

Adaptive Nonlinear Model Predictive Control with Suboptimality and Stability Guarantees

Adaptive Nonlinear Model Predictive Control with Suboptimality and Stability Guarantees Adaptive Nonlinear Model Predictive Control with Suboptimality and Stability Guarantees Pontus Giselsson Department of Automatic Control LTH Lund University Box 118, SE-221 00 Lund, Sweden pontusg@control.lth.se

More information

SOME RESOURCE ALLOCATION PROBLEMS

SOME RESOURCE ALLOCATION PROBLEMS SOME RESOURCE ALLOCATION PROBLEMS A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

More information

Average Reward Parameters

Average Reward Parameters Simulation-Based Optimization of Markov Reward Processes: Implementation Issues Peter Marbach 2 John N. Tsitsiklis 3 Abstract We consider discrete time, nite state space Markov reward processes which depend

More information

A Geometric Characterization of the Power of Finite Adaptability in Multistage Stochastic and Adaptive Optimization

A Geometric Characterization of the Power of Finite Adaptability in Multistage Stochastic and Adaptive Optimization MATHEMATICS OF OPERATIONS RESEARCH Vol. 36, No., February 20, pp. 24 54 issn 0364-765X eissn 526-547 360 0024 informs doi 0.287/moor.0.0482 20 INFORMS A Geometric Characterization of the Power of Finite

More information

Information Structures, the Witsenhausen Counterexample, and Communicating Using Actions

Information Structures, the Witsenhausen Counterexample, and Communicating Using Actions Information Structures, the Witsenhausen Counterexample, and Communicating Using Actions Pulkit Grover, Carnegie Mellon University Abstract The concept of information-structures in decentralized control

More information

On optimal quadratic Lyapunov functions for polynomial systems

On optimal quadratic Lyapunov functions for polynomial systems On optimal quadratic Lyapunov functions for polynomial systems G. Chesi 1,A.Tesi 2, A. Vicino 1 1 Dipartimento di Ingegneria dell Informazione, Università disiena Via Roma 56, 53100 Siena, Italy 2 Dipartimento

More information

Nonlinear L 2 -gain analysis via a cascade

Nonlinear L 2 -gain analysis via a cascade 9th IEEE Conference on Decision and Control December -7, Hilton Atlanta Hotel, Atlanta, GA, USA Nonlinear L -gain analysis via a cascade Peter M Dower, Huan Zhang and Christopher M Kellett Abstract A nonlinear

More information

Lecture 1. Stochastic Optimization: Introduction. January 8, 2018

Lecture 1. Stochastic Optimization: Introduction. January 8, 2018 Lecture 1 Stochastic Optimization: Introduction January 8, 2018 Optimization Concerned with mininmization/maximization of mathematical functions Often subject to constraints Euler (1707-1783): Nothing

More information