Alleviating tuning sensitivity in Approximate Dynamic Programming

Size: px

Start display at page:

Download "Alleviating tuning sensitivity in Approximate Dynamic Programming"

Grant Johnson
6 years ago
Views:

1 Alleviating tuning sensitivity in Approximate Dynamic Programming Paul Beuchat, Angelos Georghiou and John Lygeros Abstract Approximate Dynamic Programming offers benefits for large-scale systems compared to other synthesis and control methodologies. A common technique to approximate the Dynamic Program, is through the solution of the corresponding Linear Program. The major drawback of this approach is that the online performance is very sensitive to the choice of tuning parameters, in particular the state relevance weighting parameter. Our work aims at alleviating this sensitivity. To achieve this, we propose a point-wise maximum of multiple Q- functions for the online policy, and show that this immunizes against tuning errors in the parameter selection process. We formulate the resulting problem as a convex optimization problem and demonstrate the effectiveness of the approach using a stylized portfolio optimization problem. The approach offers a benefit for large scale systems where the cost of a parameter tuning process is prohibitively high. I. INTRODUCTION Stochastic optimal control provides a framework to describe many challenges across the field of engineering. The objective is to find a policy for decision making that optimizes the performance of the dynamical system under consideration. Dynamic Programming (DP) provides a method to solve stochastic optimal control problems, for which the key is to solve the Bellman equation []. Although a powerful result, computing an exact solution to the Bellman equation, called the optimal cost-to-go function, is in general intractable and inevitably leads to the curse of dimensionality [2]. Approximate Dynamic Programming (ADP) is a term covering methods that attempt to approximate the solution of the Bellman equation, [3], [4]. In particular, the Linear Programming (LP) approach to ADP, first suggested in 985 [5], introduces a set of parameters that need to be selected by the practitioner, and strongly affect the quality of the solution. In this paper, we address this sensitivity by proposing a systematic way to partially immunize the solution quality against bad choices of the tuning parameters. The LP approach to ADP is stated as follows: given a set of basis functions, find a linear combination of them that best approximates the optimal cost-to-go function, called the approximate value function. Based on this, the online policy is the one-step minimization of the approximate value function, called the approximate greedy policy. In the case that a set of basis functions and the coefficients of the linear combination can be found to closely approximate the optimal cost-to-go function, then the method can be a very powerful This research was partially funded by the European Commission under the project Local4Global. All Authors are with the Automatic Control Laboratory, ETH Zürich, Switzerland beuchatp@control.ee.ethz.ch tool. For example, the LP approach enjoyed some notable success for the applications of playing backgammon [6], elevator scheduling [7], and stochastic reachability problems [8]. However, these examples required significant trial and error tuning in order to find a suitable choice of basis functions and the best linear combination. For other applications the trial and error work involved in tuning prohibits the use of this method. Hence, alleviating the tuning effort will expand the scope of applications for the LP approach. Despite the rich choice of basis functions, see [9] and [0], choosing the optimal coefficients of the linear combination remains a difficult problem. The key parameter used throughout the literature to tune the coefficients is called the state relevance weighting. This tuning parameter specifies which regions of the state space are important for approximation. However, the regions of importance depend on the behaviour of the system when the approximate greedy policy is played online, and the policy in turn depends on the choice of the state relevance weighting. This circular dependence of the tuning parameter leads to the difficulties experienced. Different approaches have been suggested for tuning the state relevance weighting. In [] the authors use the initial distribution as the state relevance weighting. Although a natural choice, it leads to poor online performance if the system evolves to regions of the state space different from the initial distribution. The authors of [2] eliminate the state relevance weighting from the formulation at the expense of increased complexity to evaluate the online policy. Their approach is a variant of Model Predictive Control and will hence face similar difficulties as researched in that field [3]. The contributions is this paper are twofold. First, we propose a policy that allows the practitioner to choose multiple state relevance weightings and the policy automatically leverages the best performance of each without requiring trial and error tuning. Second, we provide bounds to guarantee that our proposed approach will perform at least as well as any individual choice of the tuning parameter. Finally, we show through numerical examples that the proposed policy immunizes against poor choice of the state relevance weighting. Our proposed approach extends from the pointwise maximum of approximate value functions suggested in [], and uses the Q-function formulation, see [4], to reduce the computational burden of the online policy. The structure of this paper is as follows. In Section II, we present the DP formulation considered and in Section III, we present our proposed policy using the Value function formulation of the LP approach to DP. This motivates Section IV where we use the Q-function formulation to propose

2 a tractable, point-wise maximum, greedy policy. Section IV also provides performance guarantees for the computed solution. In Section V, we demonstrate the performance of the proposed approach and conclude in Section VI. Notation: R + is the space of non-negative scalars; Z + is the space of positive integers; S n is the space of n n real symmetric matrices; I n is the n n identity matrix; n the vector of ones of size n; (.) > is the matrix transpose; given f :X!R, the infinity norm is kfk =sup x2x f(x), and the weighted -norm is k f k,c = R X f(x) c(dx). The term intractable is used throughout the paper. We loosely define intractable to mean that the computational burden of any existing solution method prohibits finding a solution in reasonable time. II. DYNAMIC PROGRAMMING (DP) FORMULATION This section introduces the problem formulation and states the DP as the solution to the Bellman equation. We consider infinite horizon, discounted cost, stochastic optimal control problems. The system is described by discrete dynamics over continuous state and action spaces. The state of the system at time t is x t 2X R nx. The system state is influenced by the control decisions u t 2U R nu, and the stochastic disturbance t 2 R n. In this setting, the state evolves according to the function g : X U!X as, x t+ = g (x t,u t, t ). At time t, the system incurs the stage cost t l (x t,u t ), where 2 [0, ) is the discount factor and the objective is to minimize the infinite sum of the stage costs. The optimal Value function, V : X! R, characterizes the solution of this stochastic optimal control problem. It represents the cost-to-go from any state of the system if the optimal control policy is played. The optimal Value function is the solution of the Bellman equation [], Q (x,u) z } { V (x) =minl (x, u)+ E [V (g (x, u, ))], () {z } (T V )(x) for all x 2 X, where T is the Bellman operator, and Q : (X U)! R is the optimal Q-function. The Q- function represents the cost of making decision u now and then playing optimally from the next time step forward. The optimal control actions are generated via the Greedy Policy: (x) = arg min l (x, u) + E [V (g (x, u, ))], (2) = arg min Q (x, u). The Bellman equation, (), can be equivalently written in terms of Q as follows: Q (x, u) =l(x, u)+ apple E min v2u Q (g (x, u, ),v) {z } (FQ )(x,u) for all x 2 X and u 2 U. Equation (3) defines the F - operator, the equivalent of the T for Q-functions. The operators T and F are both monotone and -contractive, see [4]., (3) Solving () exactly is only tractable under strong assumption on the problem structure, namely unconstrained Linear Quadratic Gaussian problems [5]. In other cases, the popular LP approach to ADP can be used to approximate the solution of (). This method is presented in the next sections. III. VALUE FUNCTION APPROACH TO ADP This sections presents a method to obtain an approximation of V through the solution of a LP. This is done by approximating the so-called exact LP, whose solution is V. We highlight the sensitivity of approximate solutions to the tuning parameters introduced, and then propose a policy that immunizes against this sensitivity. A. Iterated Bellman Inequality and the Exact LP Equation () is relaxed to the iterated Bellman Inequality, V (x) apple T M V (x), 8x 2X, (4) for some M 2 Z +, where T M denotes the M applications of the Bellman operator. As shown in [], any V satisfying (4) will be a point-wise under-estimator of V over the set X. The exact LP associated with () is formulated as follows: Z max V (x) c(dx) V X (5) s.t. V 2F(X ), V (x) applet M V (x), 8 x 2X. As shown in [6, Section 6.3], the solutions of () and (5) coincide when F(X ) is the function space of real-valued measurable functions on X with finite weighted -norm, and c( ) is any finite measure on X that assigns a positive mass to all open subsets of X. Taking M =here and in the subsequent analysis corresponds to the formulation originally proposed in [7]. Although () and (5) are equivalent for all M 2 Z +, the benefit is apparent after the approximation is made. As explained in [], problem (5) with M> has a larger feasible region than for M =. Solving (5) for V, and implementing (2), is in general intractable. The difficulties can be categorized as follows: (D) F(X ) is an infinite dimensional function space; (D2) Problem (5) involves an infinite number of constraints; (D3) The multidimensional integral in the objective of (5); (D4) The multidimensional integral over the disturbances in the bellman operator T, and the greedy policy (2); (D5) For arbitrary V 2F(X ), the greedy policy (2) may be intractable; Thus, methods that exactly solve (5) and (2) will suffer from the curse of dimensionality in at least one of these aspects, see [8, Section 2]. To gain computational tractability, in the following we restrict the function space F(X ) to simultaneously overcome (D-D5).

3 B. The Approximate LP As suggested in [5], we restrict the admissible value functions to those that can be expressed as an linear combinations of basis functions. In particular, given basis functions ˆV (i) (x) :R nx! R, we parameterize a restricted function space as, ˆF(X )=n ˆV ( ) ˆV (x) = P K i= i ˆV (i) (x) o F(X ). for some i 2 R. Hence an element of the set is specified by a set of i s. An approximate solution to (5) can be obtained through the solution of the following approximate LP: Z max ˆV (x) c(dx) ˆV X (6) s.t. ˆV 2 ˆF(X ), ˆV (x) applet M ˆV (x), 8 x 2X, where the optimization variables are the i s in the definition of ˆF(X ). The only change from (5) was to replace F(X ) by ˆF(X ). The iterated Bellman inequality is not a convex constraint on the optimization variables. As presented in [, Section 3.4], it can be replaced by a constraint that is convex in the i s and implies the iterated Bellman inequality. Difficulty (D) has been overcome in problem (6) as ˆF(X ) is parameterized by a finite dimensional decision variable. However, difficulties (D2-D4) are still present and are overcome by matching the choice of basis functions with the problem instance. The details of choosing the basis functions are omitted and the reader is referred to the following examples for guidance. The space of quadratic functions overcomes (D2-D4) for constrained LQG problems, see [] for the details of the S-lemma procedure used to reformulate (6). For problems with polynomial dynamics, costs, and constraints, see [0] where sums-of-squares techniques and polynomial basis functions are used. In [8], radial basis functions are used to approximate stochastic reachability problems. Piece-wise constant approximate Value functions are used in [9] to address a perimeter surveillance control problem. Sampling based alternatives for overcoming (D2) are suggested in [20] and [2]. Let ˆV denote the optimizer of problem (6). Then a natural choice for the online policy is, ˆ (x) = arg min l (x, u) + E h i ˆV (g (x, u, )), (7) called an approximate greedy policy. Unless ˆV is restricted to be convex when solving (6), difficulty (D5) will still be present. If convexity of ˆV is not enforced, results from global polynomial optimization, [22], may assist. The policy we propose in Section III-D requires that the ˆV are convex, and hence that a convexity constraint is added to (6). C. Choice of the weighting c( ) As discussed in [6], the choice of c( ) does not affect problem (5). Intuitively speaking, the reason is that the space F(X ) is rich enough to satisfy V (x) apple T M V (x) with equality, point-wise for all x 2 X. In contrast, once the restriction ˆF(X ) F(X ) is made this is no longer true. The choice of c( ), referred to as the state relevance weighting, provides a trade-off between elements of ˆF(X ) over the set of states. Thus, c( ) is a tuning parameter of the approximate LP and influences the optimizer, ˆV. A good approximation of the value function should achieve near optimal online performance when it replaces V in the greedy policy. Intuitively, we see from (7) that the online policy depends on the gradient of the approximate value function. Two value functions that differ by a constant will make identical decisions. However, the approximate LP finds the closest fit to V, relative to the choice of c( ) and does not attempts to match the gradient of V. We now provide the intuition behind the approach proposed in the next sub-section. Consider two choices of the state relevance weighting, c A ( ) and c B ( ), that separately place weight on narrow, disjoint regions of the state space, denoted A and B, and zero weight elsewhere. For each choice, the solution of (6), ˆV A and ˆV B, will be the closest under-estimator to V over the respective region. On region A, ˆV B will be lower than ˆV A, otherwise it would not be the solution of (6), and the reverse. Thus if we construct an approximate Value function that is a point-wise maximum of ˆV A and ˆV B, it is expected to give the best estimate of V. Finally, by fitting V closely over a larger region, it is expected that the gradient approximation will be improved. In this way, our proposed point-wise maximum approach immunizes against the tuning errors that occur when choosing a single c( ). D. Point-wise maximum Value function and policy We will solve problem (6) for several choices of c( ), and denote ˆV j as the solution for a corresponding c j ( ). Letting j 2J denote an index set, we define the point-wise maximum Value function as follows: n o ˆV pwm (x) := max ˆV j (x), 8 x 2X. Problem (6) ensures that each ˆV j is a point-wise underestimator of V. Hence ˆV pwm is a better under-estimator of V in the following sense: V ˆVpwm apple V ˆV j, 8 j 2J. The natural choice for the online policy now is to use ˆV pwm in the approximate greedy policy, i.e., apple n o ˆ (x) = arg min l (x, u)+e max ˆV j (f (x, u, )). (8) j However, the point-wise maximum value function reintroduces difficulty (D4): as ˆV pwm /2 ˆF(X ), evaluating (8) may not be tractable. The difficulty in (8) is that evaluating the expectation over the disturbance requires Monte Carlo sampling, and this makes the optimization over u prohibitively slow. Exchanging the expectation and maximization in (8) circumvents this difficulty, and leads to, ˆ (x) = arg min l (x, u) + max n E h ˆV j (f (x, u, )) io. (9)

4 This approximation induces a tractable reformulation, and by Jensen s inequality (9) is still a lower bound. A similar approach was proposed in [2] in the context of min-max approximate dynamic programming. It is not clear how the exchange will affect the performance of the approximate greedy policy. In the next section we propose an alternative formulation in terms of the Q functions that alleviates the need to use Jensen s inequality. IV. Q-FUNCTION APPROACH TO ADP In this section, we alternatively define the greedy policy using Q functions instead of Value functions. We will show that the resulting greedy policy does not suffer from difficulty (D4). Additionally, the greedy policy can be efficiently computed when using a point-wise maximum of approximate Q- functions. Finally in Section IV-D, we provide error bounds for point-wise maximum Q-functions. A. Iterated F -operator Inequality and the Approximate LP The bellman equation for the Q-function formation, (3), is relaxed to the iterated F -operator Inequality, Q(x, u) applef M Q(x, u), 8x 2X,, (0) for some M 2 Z + being the number of iterations, where F M denotes the M applications of the F -operator. As shown in [4], any Q satisfying (0) will be a point-wise underestimator of Q for all elements of the set (X U). An exact LP reformulation of (3) is analogous to (5) and also requires optimization over an infinite dimensional functional space. For brevity we omit this formulation, and move directly to the functional approximation. Similar to Section III-B, we restrict the admissible Q-functions to those that can be expressed as a linear combination of basis functions. In particular, given basis functions ˆQ (i) (x) :R nx! R, we parameterize a restricted function space as, ˆF(X U)=n ˆQ(, ) ˆQ(x, u) = P K i= i ˆQ (i) (x, u) o, for some i 2 R. Using ˆF(X U), an approximate solution to (3) is obtained through the solution of the following approximate LP: Z max ˆQ X U ˆQ(x, u) c(d(x, u)) s.t. ˆQ 2 ˆF(X U), ˆQ(x, u) apple F M ˆQ(x, u), 8 x 2X,, () where the optimization variables are the i s in the definition of ˆF(X U), hence the LP is finite dimensional. The weighting parameter in the objective, c(, ) needs to be defined over the (X U) space. Again, the iterated F -operator inequality is not a convex constraint on the optimization variables. However, it can be replaced by a constraint that is convex in the i s and implies that the iterated F -operator inequality is satisfied. The reformulation is presented in Appendix II, it combines the reformulations found in [] and [4]. The difficulties (D-D5), described for the Value function formulation, apply equally for the Q-function formulation. Similar to the discussion in Section III-B, difficulties (D2- D4) remain present in (). Overcoming (D2-D4) requires the basis function of ˆF(X U) to be chosen appropriately for a particular problem instance. The quadratic, polynomial, and radial basis functions described in [], [0], and [8] for the Value function formulation, can be used equivalently for the Q-function formulation. Let ˆQ denote the optimizer of problem (). Then a natural choice for an approximate greedy policy is, ˆ (x) = arg min ˆQ (x, u), (2) Overcoming difficulty (D5) for Q-functions requires that ˆQ is restricted to be convex in u when solving (). B. Choice of the weighting c(, ) The weighting in the objective of (), c(, ), is analogous to the weighting in the Value function formulation: it influences the solution of the approximate LP and is difficult to choose. In general, it is not possible to find a c(, ) so that the associated ˆQ approximates Q equally well over the whole state-by-input space. It is not clear how one would choose c(, ) such that ˆQ provides: (i) a tight under-estimate of Q, and (ii) achieves near optimal online performance. Next, we introduce the point-wise maximum Q function, which removes the need to choose a single c(, ) by combining the best fit over multiple approximate Q-functions. C. Point-wise maximum Q-functions and policy We propose the point-wise maximum of Q-functions as a method to alleviate the burden of tuning a single weighting parameter. We will show that: (i) the proposed method induces a computationally tractable policy; and (ii) a better under estimator will result in a tighter lower bound on Q, and potentially better performance with the online policy. We will now define the point-wise maximum Q-function. To this end, we solve problem (6) for several choices of c(, ) and denote ˆQ j as the solution for a corresponding c j(, ). Letting J denote an index set the ˆQ pwm is defined as, ˆQ pwm (x, u) := max n o ˆQ j (x, u), (3) point-wise for all x 2X and u 2U. Replacing Q by ˆQ pwm in equation (2) leads to the approximate greedy policy, ˆ (x) = arg min max n o ˆQ j (x, u). (4) The advantage of problem (4) compared to (8) is that it does not involve the multidimensional integration over the disturbance and hence avoids the re-introduction of difficulty (D4). Thus, (4) is equivalently reformulated as, ˆ (x) = arg min,t t s.t. ˆQ j (x, u) apple t, 8 j 2J. (5) This reformulation is a convex optimization program if ˆQ is restricted to be convex in u when solving (). The numerical

5 example presented in Section V uses convex quadratics as the basis function space. This means that ˆQpwm is a convex piece-wise quadratic function and solving (5) is a convex Quadratically Constrained Quadratic Program (QCQP). D. Fitting bound for the approximate Q-function We now provide a bound on how closely a solution of () approximates Q. This result allows us to show that ˆQpwm will provide a better estimate of Q than the ˆQ j from which it is composed. Combining the ideas from [] and [4], we prove the following theorem: Theorem 4.: Given Q is the solution of (3) and ˆQ is the solution of () for a given choice ˆF(X U) and c(, ), then the following bound holds, Q ˆQ,c(x,u) apple Proof: See Appendix I. 2 min kq M ˆQ2 ˆF(X U) ˆQk (6) The theorem says that when Q is close to the span of the restricted function space, then the under-estimator ˆQ will also be close to Q. In fact, the bound indicates that if Q 2 ˆF(X U), the approximate LP will recover ˆQ =Q. Notice that Theorem 4. holds for any choice of ˆF(X U) and any choice of c(, ). We will now argue that Theorem 4. applies also to ˆQ pwm as defined in (3). This will allow us to conclude that ˆQpwm provides a better estimate of Q. A valid choice of the restricted function space is the following point-wise maximum of N approximate Q-functions: 8 < ˆQ k (x, u) 2 ˆF(X 9 U), = ˆF pwm(x U)= N : ˆQ(, ) ˆQ(x, u) = max k=,...,n n o ˆQk (x, u) ;. Difficulties (D-D4) will still exist if one attempts to solve () using ˆF pwm (X U) as the restricted function space. However, the bound given in Theorem 4. applies regardless. Note that the right-hand side of the bound in Theorem 4. depends only on the restricted function space and not on the solution of the approximate LP. Therefore, ˆFpwm (X U) ˆF(X U), implies that the minimization on the right-hand side of the bound has a greater feasible region, and hence will achieve a tighter bound, for the point-wise maximum function space. For the theorem to apply to our choice of ˆQ pwm in (3), we must show that it is feasible for the approximate LP with ˆF pwm (X U) as the restricted function space. This is achieved by the following lemma. Lemma 4.2: Let { ˆQ j } be Q-functions such that for each j 2J the following inequality holds: ˆQ j (x, u) apple F M ˆQj (x, u), 8 x 2X,. Then the function ˆQ pwm (x, u), defined in (3), also satisfies the iterated F -operator inequality, i.e., ˆQ pwm (x, u) apple F M ˆQpwm (x, u). Proof: See Appendix I. Although we cannot tractably solve () with ˆF pwm (X U) as the restricted function space, Lemma 4.2 states that ˆQ pwm is a feasible point of that problem. Therefore, there likely exists a choice of c(, ) such that ˆQ pwm is the solution. Theorem 4. states that under this choice of c(, ), ˆQpwm approximates Q per the bound given in the theorem. We close this section by re-iterating the benefits of the ˆQ pwm compared to ˆV pwm. Both give improved lower bounds, but ˆQpwm has the advantage that the policy can be implemented without the need to introduce an additional approximation as in the case of (8). This will be further demonstrated in the following numerical example. V. NUMERICAL RESULTS In this section, we present a numerical case study to highlight the benefits of the proposed point-wise maximum policies, see Sections III-D and IV-C. We use a dynamic portfolio optimization example taken directly from [2] to compare the online performance of both the Value function and Q-function formulations. The model is briefly described here, using the same notation as [2, Section IV]. The task is to manage a portfolio of n assets with the objective of maximizing revenue over a discounted infinite horizon. The state of the system, x t 2 R n, is the value of assets in the portfolio, while the input, u t 2 R n, is the amount to buy or sell of each asset. By convention, negative values for an element of u t means the respective asset is sold, while positive means purchased. The stochastic disturbance affecting the system is the return of the assets occurring over a time period, denoted t 2 R n. Under the influence of u t and t, the value of the portfolio evolves over time as, x t+ = diag ( t )(x t + u t ). The dynamics represent an example of a linear system affected by multiplicative uncertainty. The transaction fees are parameterized by apple 2 R + and R 2 R n n, and hence the stage cost incurred at each time step is given by, > n u t + apple u t + u > t Ru t. The first term represents the gross cash from purchases and sales, and the final two terms represent the transaction cost. As revenue corresponds to a negative cost, the objective is to minimize the discounted infinite sum of the stage costs. The discount factor represents the time value of money. A restriction is placed on the risk of the portfolio by enforcing the following constraint on the return variance over a time step: (x t + u t ) > ˆ (xt + u t ) apple l, where l 2 R + is the maximum variance allowed, and ˆ is the covariance of the uncertain return t. Table I lists the parameter values that we use in the numerical instance presented here. We solved both the Value function and Q-function approximate LP using quadratic basis functions. The equations for fitting the Value functions and Q-functions via (6) and ()

6 TABLE I: Parameters used for portfolio example n =8 =0.96 x 0 = N (0 n, 0 I n) log( t)=n (0 n, 0.5 I n) apple =0.04 n R =diag(0.028, 0.034, 0.020, 0.026, 0.023, 0.022, 0.024, 0.027) follow directly from [2, Equations (4)-(7)]. The quadratic basis functions used are parameterized as follows: apple > apple x Pxx P ˆQ t (x, u) = xu u ˆV (x) =x > Px+ p > x + s V, apple x u + p > x P > xu P uu apple p > x u u + s Q, where P, P xx,p uu 2 S n, P xu 2 R n n, p, p x,p u 2 R n, s V,s Q 2 R, are the coefficients of the linear combinations in the basis functions sets ˆF(X ) and ˆF(X U). A positive semi-definite constraint is enforced on P and P uu to make the point-wise maximum policies convex and tractable. This restricts the Value functions to be convex and the Q-functions to be convex in u. A family of 0 weighting parameters was used. One of the weighting parameters, denoted c 0 ( ), was chosen to be the same as the initial distribution over X, and represents the method suggested in []. The remaining 9 weighting parameters, denoted {c j ( )} 9 j= were chosen to be low variance normal distributions centred at random locations in the state space. For the Q-function formulation, the same 0 weighting parameters were expanded with a fixed uniform distribution over U. For each c j ( ), problems (6) and () were solved to obtain ˆV j and ˆQ j, j =,...,9. The lower bound and online performance was computed for the individual ˆV j and ˆQ j and also for ˆV pwm and ˆQ pwm with the point-wise maximum taken over the whole family. The lower bound was computed as the average over 2000 samples while the online performance was averaged over 2000 simulations each of length 300 time steps. The results are shown in Table II. TABLE II: Lower Bound and online performance Lower Bound / Online Value Online ˆV Online ˆQ Online ˆQ j j =,...,9 [ 3.0, 42.8] Online ˆV j j =,...,9 [ 95., 42.8] Online ˆV pwm 45.2 Online ˆQ pwm 45.5 Lower Bound ˆQ pwm 65. Lower Bound ˆQ Lower Bound ˆQ j j =,...,9 [ 568.4, 65.] Lower Bound ˆV pwm 65.8 Lower Bound ˆV Lower Bound ˆV j j =,...,9 [ 220.7, 66.0] The results show that when using a single choice of c( ), both the lower bound and online performance can be arbitrarily bad. This is shown by the large range of values in rows 3, 4, 9, and 2 of Table II. This large range indicates that at least one of the weighting parameters was a poor choice. As the point-wise maximum function uses all 0 choices of c( ), it includes also this poor choice of the weighting parameter. We see from the results that the pointwise maximum function achieves the tightest lower bound and best online performance. This highlights that the pointwise maximum function immunizes against the poor choice of the weighting parameter. Using the initial distribution as the weighting parameter gives reasonable results for this example, see rows, 2, 8, and of Table II, thus, the suggestion of [] is reasonable. The benefit of the point-wise maximum policy is that the practitioner can explore other choices of c( ) without risk of degraded performance. Finally, we note that in the best case the Q-functions perform slightly better that the Value functions, indicating that exchanging the expectation and maximization in (8) had little impact for this example. However, in the worst case, the Q function performs very badly for a poor choice of the weighting parameter, highlighting further the importance of using our proposed approach to immunize against such sensitivity. This example demonstrates the features and trends indicated by the theory. VI. CONCLUSIONS In this paper, we addressed the difficulty of tuning the state relevance weighting parameter in the Linear Programming approach to Approximate Dynamic Programming. We proposed an approximate greedy policy that alleviates the tuning sensitivity of previous methods by allowing for a family of parameters to be used and automatically choosing the best parameter at each time step. This is achieved by using a point-wise maximum of functions that individually underestimate the optimal cost-to-go. We render the online policy tractable by using Q-functions. We proved that the proposed approach gives a satisfactory lower bound on the best achievable cost, and used a numerical example to demonstrate that the online performance is indeed immunized against poor choices of the weighting parameter. Future work will include improved theoretical bounds on the online performance of the policy and a deeper understanding of when the Value function or Q-function approach is preferable. APPENDIX I PROOFS OF THEOREM 4. AND LEMMA 4.2 The proof requires two auxiliary lemmas that are presented first, and then we present the proof of Theorem 4.. Lemma.2 provides a point-wise bound on how much the M-iterated F -operator inequality is violated for any given Q-function. This is used in the proof of Lemma.3, which shows that given a ˆQ 2 ˆF(X U), it can be downshifted by a certain constant amount to satisfy the iterated F -operator inequality. The constant by which it is downshifted relates directly to the constant on the RHS of Theorem 4.. We start by stating the monotone and contractive properties of the F -operator which are needed in the proofs.

7 Proposition.: the F -operator is (i) monotone (ii) - contractive as for any given Q,Q 2 :(X U)! R, (i) Q (x, u) apple Q 2 (x, u), 8x 2X,, ) FQ (x, u) apple FQ 2 (x, u), 8x 2X, ; (ii) kfq FQ 2 k apple kq Q 2 k ; These properties can be found in [4]. Lemma.2: Let M 2 Z +, and let Q :(X U)! R be any Q-function, then violations of the iterated F -operator inequality can be bounded as, F M Q (x, u) Q(x, u) + M kq Qk, for all x 2X,. Proof: Starting from the terms not involving, Q(x, u) kq Qk F M Q (x, u) apple Q (x, u) F M Q (x, u), 8 x 2X, apple F M Q F M Q apple M k Q Q k. The first inequality follows from the definition of the - norm, and the second inequality comes from Q (x, u) = (FQ )(x, u) and the -norm definition. Finally, the third inequality is due to the -contractive property of the F - operator. Re-arranging, the result follows. Lemma.3: Let ˆQ(x, u) 2 ˆF(X U) be an arbitrary element from the basis functions set, and let Q(x, u) be a Q-function defined as, Q(x, u) = ˆQ(x, u) + M M kq ˆQk, {z } downwards shift term (7) then Q(x, u) satisfies the iterated F -operator inequality, i.e., Q(x, u) apple F M Q (x, u), 8 x 2X,, and if ˆF(X U) allows for affine combinations of the basis functions, then Q is also an element of ˆF(X U). Proof: Let 2 R denote the constant downwards shift term for notational convenience. Using the definition of the F -operator we see that for any function Q(x, u), ( F (Q + ))(x, u) = l(x, u) + min E [ Q(f(x, u, ),v)+ ] v2u =(FQ)(x, u) +. where the equalities hold for all x 2X, u 2U. The first equality comes from the definition of the F -operator, and the second equality holds as is an additive constant in the objective of the minimization. Iterating the same argumentation M-times leads to F M (Q + ) (x, u) = F M (F (Q + )) (x, u) = F M ((FQ)+ ) (x, u) = F M 2 F 2 Q + 2 (x, u) =... = F M Q (x, u)+ M, (8) where the equivalences hold point-wise for all x 2X, u 2 U. Now we show that Q satisfies the iterated F -operator inequality, F M Q (x, u) = F M M + M ˆQ (x, u) M kq ˆQk ˆQ(x, u) + M Q ˆQ M + M M kq ˆQk = Q(x, u), where the first equality comes from (8), the inequality is a direct application of Lemma.2 to the term (F M ˆQ) and holds for all x 2X, u 2U, and the final equality follows from (7). Finally, if ˆF(X U) allows for affine combinations of the basis functions, then ˆQ 2 ˆF(X U) implies Q 2 ˆF(X U) as the downward shift term is an additive constant. Now, we have all the ingredients to prove Theorem 4.. Proof: of Theorem 4.. Given any approximate Q-function from the basis, ˆQ(x, u) 2 ˆF(X U), Lemma.3 allows us to construct, Q(x, u) = ˆQ(x, u) + M M Q ˆQ 2 ˆF(X U), which is feasible for (). It can be shown that maximizing R ˆQ(x, u) c(d(x, u)) X U is equivalent to minimizing kq ˆQk,c(x,u) for constraint that ensure ˆQ is an under-estimator of Q (), see [7, Lemma ] for an example proof. Thus, we start from left hand side of(6), Q ˆQ apple Q Q apple Q Q apple Q ˆQ,c(x,u),c(x,u) + ˆQ Q = Q ˆQ + + M M Q ˆQ 2 = M Q ˆQ where the first inequality holds because Q is also feasible for (), the second inequality by assuming w.l.o.g. that

8 c(x, u) is a probability distribution, the third inequality is an application of the triangle inequality, the first equality stems directly from the definition of Q, and the final is an algebraic manipulation. As this argumentation holds for any ˆQ 2 ˆF(X U), the result follows. Proof: of Lemma 4.2. Starting from the definition of ˆQ pwm we get, ˆQ j (x, u) apple ˆQ pwm (x, u), 8 j 2J ) F M ˆQj (x, u) apple F M ˆQpwm (x, u), 8 j 2J o, max n F M ˆQj (x, u) apple F M ˆQpwm (x, u), where the inequalities hold point-wise for all x 2X, u 2U. The first implication follows from the monotonicity property of the F operator, and the equivalence holds as j appears only on the LHS of the inequality. The assumption of the lemma that the iterated F -operator inequality is satisfied for each j, implies the following inequality max n ˆQj (x, u) o o apple max n F M ˆQk (x, u), k2j hold point-wise for all x 2X, u 2U. Noting the the lefthand side of (9) is the definition of ˆQ pwm (x, u), the claim follows. APPENDIX II REFORMULATION OF F -OPERATOR INEQUALITY This convex alternative applies to (). A sufficient condition for ˆQ 2 ˆF(X U) to satisfy the iterated F -operator inequality is the following: ˆQ (x, u) applef M ˆQ (x, u), * (9a) 9 ˆQ j 2 ˆF(X U), j =,...,M, h i ˆQ j (x, u) applel(x, u)+e ˆVj (f(x, u, )), (9b) j =,...,M, ˆV j (x) apple ˆQ j (x, u), j =2,...,M, ˆV M (x) apple ˆQ (x, u), where all inequalities hold for all x 2X and u 2U. The definitions of ˆF(X ) and ˆF(X U) are given in Section III- B and IV-A respectively. The reformulation (9b) is linear in the additional Value function and Q-function variables introduced. Hence (9b) is a tractable set of constraints, given that ˆF(X ) and ˆF(X U) were chosen to overcome difficulties (D-D4). Note that (9a), (9b) only when infinite dimensional function spaces are used for the additional variables, meaning that it is intractable to use an equivalent reformulation of the F -operator inequality. REFERENCES [] R. E. Bellman, On the theory of dynamic programming, Proceedings of the National Academy of Sciences of the United States of America, vol. 38, no. 8, pp , 952. [2] D. P. Bertsekas, Dynamic programming and optimal control. Athena Scientific Belmont, MA, 2005, vol., no. 3. [3] W. B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality, 2nd ed. Wiley, 20. [4] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming. Athena Scientific, 996. [5] P. J. Schweitzer and A. Seidmann, Generalized polynomial approximations in markovian decision processes, Journal of Mathematical Analysis and Applications, vol. 0, pp , 985. [6] G. Tesauro, Practical issues in temporal difference learning, Machine Learning, vol. 8, pp , 992. [7] A. G. Barto and R. H. Crites, Improving elevator performance using reinforcement learning, Advances in neural information processing systems, vol. 8, pp , 996. [8] N. Kariotoglou, S. Summers, T. Summers, M. Kamgarpour, and J. Lygeros, Approximate dynamic programming for stochastic reachability, in Control Conference (ECC), 203 European. IEEE, 203, pp [9] A. Keshavarz and S. Boyd, Quadratic approximate dynamic programming for input-affine systems, International Journal of Robust and Nonlinear Control, vol. 24, no. 3, pp , July 202. [0] T. Summers, K. Kunz, N. Kariotoglou, M. Kamgarpour, S. Summers, and J. Lygeros, Approximate dynamic programming via sum of squares programming, in Control Conference (ECC), 203 European. IEEE, 203, pp [] Y. Wang, B. O Donoghue, and S. Boyd, Approximate dynamic programming via iterated bellman inequalities, International Journal of Robust and Nonlinear Control, 204. [2] B. O Donoghue, Y. Wang, and S. Boyd, Min-max approximate dynamic programming, in Computer-Aided Control System Design (CACSD), 20 IEEE International Symposium on. IEEE, 20, pp [3] J. B. Rawlings and D. Q. Mayne, Model Predictive Control: Theory and Design. Nob Hill Publishing, [4] R. Cogill, M. Rotkowitz, B. Van Roy, and S. Lall, An approximate dynamic programming approach to decentralized control of stochastic systems, in Control of Uncertain Systems: Modelling, Approximation, and Design. Springer, 2006, pp [5] J. Casti, The linear-quadratic control problem: some recent results and outstanding problems, SIAM Review, vol. 22, no. 4, pp , 980. [6] O. Hernández-Lerma and J. B. Lasserre, Discrete-time Markov control processes: basic optimality criteria. Springer Science & Business Media, 202, vol. 30. [7] D. P. De Farias and B. Van Roy, The linear programming approach to approximate dynamic programming, Operations Research, vol. 5, no. 6, pp , November-December [8] W. B. Powell, What you should know about approximate dynamic programming, Naval Research Logistics (NRL), vol. 56, no. 3, pp , February [9] K. Krishnamoorthy, M. Pachter, S. Darbha, and P. Chandler, Approximate dynamic programming with state aggregation applied to uav perimeter patrol, International Journal of Robust and Nonlinear Control, vol. 2, no. 2, pp , 20. [20] D. P. De Farias and B. Van Roy, On constraint sampling in the linear programming approach to approximate dynamic programming, INFORMS - Mathematics of Operations Research, vol. 29, no. 3, pp , August [2] T. Sutter, P. M. Esfahani, and J. Lygeros, Approximation of constrained average cost markov control processes, in Decision and Control (CDC), 204 IEEE 53rd Annual Conference on. IEEE, 204, pp [22] J. B. Lasserre, Global optimization with polynomials and the problem of moments, SIAM Journal on Optimization, vol., no. 3, pp , 200.

arxiv: v2 [cs.sy] 29 Mar 2016

arxiv: v2 [cs.sy] 29 Mar 2016 Approximate Dynamic Programming: a Q-Function Approach Paul Beuchat, Angelos Georghiou and John Lygeros 1 ariv:1602.07273v2 [cs.sy] 29 Mar 2016 Abstract In this paper we study both the value function and