Optimal Scheduling for Reference Tracking or State Regulation using Reinforcement Learning

Size: px

Start display at page:

Download "Optimal Scheduling for Reference Tracking or State Regulation using Reinforcement Learning"

Kevin Norman
5 years ago
Views:

1 Optimal Scheduling for Reference Tracking or State Regulation using Reinforcement Learning Ali Heydari Abstract The problem of optimal control of autonomous nonlinear switching systems with infinite-horizon cost functions, for the purpose of tracking a family of reference signals or regulation of the states, is investigated. A reinforcement learning scheme is presented which learns the solution and provides scheduling between the modes in a feedback form without enforcing a mode sequence or a number of switching. This is done through a value iteration based approach. The convergence of the iterative learning scheme to the optimal solution is proved. After answering different analytical questions about the solution, the learning algorithm is presented. Finally, numerical analyses are provided to evaluate the performance of the developed technique in practice. I. INTRODUCTION Optimal scheduling between different modes/subsystems in control of switching systems is a challenging problem in controls engineering discipline and numerous research papers have emerged in the literature within the last decade in this regard, [] [4]. The reason for this attention and the vast effort in solving these problems is the fact that many real-world control problems can be classified as switching problems, including problems in mechanical and aerospace systems [5], [6], electronics [7], chemical processes [8], and bioengineering [2], [9]. Conventional optimal control methods fail to provide solution to switching problems, generally. It should be noted that the solution includes discrete decisions, in switching problems. In other words, the solution provides optimal decisions, in terms of suitable switching between the modes. One of the most attractive approaches to solving optimal switching problems is freezing the mode sequence, i.e., the order of active modes, as well as the number of switches, and optimizing the switching times. Note that once the mode sequence and the number of switches are fixed, the unknown is the switching time. Nonlinear programming is an approach followed by different researchers [] [7], in which, the gradient of the cost function with respect to the switching instant is utilized for optimizing the switching instant/time with a pre-selected mode sequence and number of switching. Some ideas were presented in [5] and [6] for admitting free mode sequence conditions. In [5], a two stage optimization algorithm was developed where in one stage the switching time is updated and at another stage the mode sequence is modified. In a recent study, another nonlinear programming based solution was proposed in [20] for the case of free mode sequence. Nonlinear programming based methods generally lead to open loop solutions for a given/fixed initial condition. Each time the initial condition is changed, another set of numerical calculations are required to be conducted in order to find the new optimal switching times. The dependency of the solutions on the selected initial conditions leads to the limitation that, for example in finding the optimal switching between gears in a manual transmission car in order to accelerate to a desired speed, a calculated solution will be valid for implementation only if the initial speed of the car is exactly the same as the one using which the problem was numerically solved. Otherwise the solution will not take the car to the desired speed. In [3] the validity of the results was extended to different initial conditions within a pre-selected set through determining the switching parameter such that it minimizes the worst possible cost for all trajectories starting in the selected set of initial states. Discretization of state space is an approach followed by [8], in order to end up with a finite number of choices, and dynamic programming was used for solving the problem. Refs. [2] and [22] investigated the use of (relaxed/approximate) dynamic programming for different problems including optimal switching. Genetic algorithm is also another approach for finding a numerical solution for a given initial condition [9]. An optimization scheme was developed in [9] to find both the optimal mode sequence and the switching time for positive linear systems. The demonstrated potential of Reinforcement Learning (RL) and Approximate Dynamic Programming (ADP) in solving conventional optimal control problems, [23] [37] motivated the author of this study to utilize ADP Assistant Professor of Mechanical Engineering, South Dakota School of Mines and Technology, Rapid City, SD 5770, ali.heydari@sdsmt.edu.

2 for solving optimal switching problems in the past. The results were solutions to problems with fixed switching sequence [38], free switching sequence with autonomous subsystems [39], free switching sequence with controlled subsystems [40], and the applications of the developed ideas to multi-therapeutic treatment of HIV disease [2] and aerospace vehicles [6]. All these developments, however, deal with problems with fixed-final-time, i.e., problems with finite-horizon cost functions. Many real-world problems, however, have infinite horizon, e.g., regulation of a system with on-off actuators. The motivation behind this work is providing a solution for such problems. In a simultaneous, but independent research, Refs. [3] and [4] proposed a different ADP based solution to switching problems. The number of functions needed to be learned at each training iteration grows exponentially with the number of iterations and soon becomes prohibitive, in the proposed method. Moreover, in these developments the training is done for a single selected initial condition. Another investigation for solving switching problems using ADP was reported in [4]. The differences, compared with this study, are the approach and the point that the initial conditions are assumed to be known a priori in that study. Considering this background, the current study is aimed at extending the developments in [2], [6], [38] [40] and particularly the solution proposed in [39] to the problems with infinite-horizon cost functions. For the sake of generality, tracking a time-varying signal is selected, because, once the reference signal is selected as zero, the solution immediately extends to regulation of the states. The main challenge in direct extension of the results presented in [39] to the infinite-horizon problems is the fact that the so called value function, sometimes called cost-to-go function, can be easily learned in a backward fashion in fixed-final-time problems. But, once the horizon is infinite/unlimited, this is not possible, i.e., there is no final time to start from. An idea based on the ADP/RL approaches in solving conventional optimal control problems, is using value iteration [23] to learn the desired function. Doing so leads to multiple questions including the convergence of the iterations, the optimality of the limit function of the sequence resulting from the iterations, the continuity of the result to be approximated by neural networks (NNs), etc. The analytical contribution of this work is developing novel and rigorous, while straightforward and easy to understand answers to these questions. The approach followed for the convergence analyses is motivated by [42] and is unlike the well-established ideas for convergence of ADP in conventional problems including [2] and [27]. The former was adapted in [22] and the latter was adapted in different studies including [30], [32], [4], [43]. More specifically, the proposed idea in this study is establishing an analogy between the time-to-go in finite-horizon problems and the iteration index in value iteration for infinite-horizon problems. This idea is the key to the convergence, optimality, and continuity proofs presented here. Interested readers in convergence and continuity analyses of value iteration are referred to [22] for another approach for the analyses using a transformation of the continuous-time switching problems. Besides these theoretical analyses, another contribution of this work is the resulting controller which approximates the optimal solution to tracking/regulation problems with infinite-horizon cost functions. The proposed controller provides solution for different initial conditions without any need for retraining. Moreover, once the NN is trained, the result will be valid for tracking different reference signals which share the same dynamical model, for example, generated using different initial conditions. Another interesting feature of the proposed solution is calculating solutions in feedback forms, in the sense that the solution is directly calculated based on the instantaneous states of the systems and the reference signal. Finally, the proposed method does not assume a fixed mode sequence or a fixed number of switches. The solution, including the number of switching, the order of the modes, and the switching times, are all subject to be calculated such that a cost function is minimized. While the class of problems investigated in this study is different from the one in [4], the bases of the presented solutions can be compared together. a) Only one neural network is needed to be trained and implemented in the proposed method through this study, while in the other method the required number of critic networks grows exponentially with the iterations. b) The tracking problem is investigated in its general form in this study, while the solution proposed for tracking problems in [4] is limited to a certain type of tracking problems which are convertible to regulation type problems. c) The method proposed in [4] is valid for a single and unique reference signal, while the proposed method in this study provides solution for tracking a family of reference signals. d) The proposed method in this work is valid for different initial conditions without any need for retraining, but, the training algorithm in [4] is based on a selected initial condition, as shown in the training algorithm in [30], on which [4] is based. The rest of this paper is organized as follows. The problem is formulated in Section II and the proposed solution is detailed in Section III. Section IV presents the convergence analyses and the answers to the raised analytical 2

3 questions in section III. Afterwards, section V details the implementation of the proposed method. Section VI discusses the extension of the results to the case of optimal regulation of the states and Section VII presents the numerical analyses and simulations. Finally, concluding remarks are given in section VIII. II. PROBLEM FORMULATION The problem subject to this study is forcing the states of the system to track a given time-varying signal. The decision variable, however, is the active mode in the given switching system, which can be arbitrarily selected at each instant. More specifically, let the system subject to scheduling/control be given by M modes or subsystems with the known dynamics of x k+ = f i (x k ), k N, i I, () where f i : R n R n is continuous i I := {, 2,..., M}, N denotes the set of non-negative integers, and positive integer n is the dimension of the state vector x k. Sub-index k in x k represents the discrete time index and sub-index i in f i (.) represents the respective mode/subsystem. Denoting the active mode at instant k with i k I, a switching schedule identifies i k, k N. Once a switching schedule is selected, the system can operate from k = 0 to k =. The problem is defined as finding a switching schedule that forces the states (or a combination of its elements) to track a reference signal r k R m (or a combination of its elements) with the known dynamics of r k+ = F (r k ), (2) given initial condition r 0 R m, where F : R m R m is a continuous function. This objective can be fulfilled by minimizing cost function J = Q(x k, r k ), (3) k=0 where convex, continuous, and positive (semi-)definite function Q : R n R m R + penalizes the state error, with respect to desired reference r k. For example, Q(x k, r k ) := x k r k 2, if m = n, which represents the objective of x k tracking r k. Another example could be having the square of the nthe element of the state vector track the cube of the mth element of the reference, through Q(x k, r k ) := x 2 k (n) r3 k (m) 2, where the lth element of a vector y is denoted with y(l). In other words, Q(.,.) could be any non-negative convex and continuous function which returns zero only when the desired tracking is achieved. The set of non-negative reals is denoted with R +. Assumption. There exists at least one switching schedule for every given initial conditions x 0 and r 0 in some selected compact sets using which cost function (3) is bounded. Assumption 2. The dynamics of the subsystems are known. Assumption guarantees that the optimal solution exists and leads to a finite cost, otherwise, it will not be optimal compared with the assumed existing switching schedule. Assumption 2 clarifies the point that this study does not incorporate the case of having unmodeled dynamics. Optimal scheduling under the presence of modeling uncertainties may be conducted through extension of the presented method to online learning. III. PROPOSED SOLUTION The idea behind the proposed solution is approximating the so called value function, which outputs the cost-to-go (i.e., the cost incurred by evaluating Eq. (3) along the resulting trajectory) given the current state and the current reference signal, and assuming optimal decisions will be made from the current time to infinity. Denoting the value function with V : R n R m R +, considering the selected cost function, i.e., Eq. (3), one has V (x k, r k ) := Q(x k, r k ) + Q(x j, r j ), (4) j=k+ in which optimal (future) states, denoted with x j, j {k +, k + 2,...}, are calculated using dynamics () and optimal decisions i j I, j {k, k +,...}. Eq. (4) can be formed as a recursive equation as V (x, r) = Q(x, r) + V ( f i (x,r)(x), F (r) ), x R n, r R m, (5) 3

4 where i (x, r) denotes the optimal mode given the current x and r. By the Bellman principle of optimality [44], one has V (x, r) = min i I (Q(x, r) + V ( f i (x), F (r) )) = Q(x, r) + min i I V ( f i (x), F (r) ), x R n, r R m. (6) Moreover, the optimal mode i at each instant, which is also a function of the current x and r, is given by i (x, r) = argmin i I V ( f i (x), F (r) ), x R n, r R m. (7) In other words, i at each instant is selected such that we will have a smaller cost-to-go at the next time step. The key to the solution of the problem is the fact that if value function V (.,.) is obtained versus its inputs, then one can find the optimal mode in a feedback form in online operation, as seen in (7). Motivated by the developments in RL and ADP literature for optimal control problems [2] [32], [43], a reinforcement learning scheme is selected in this study for learning the desired function for all x Ω x R n and r Ω r R m. Domains Ω x and Ω r are selected to be closed and bounded, i.e., compact, representing the domains of interest for the respective variables. They need to be selected based on the physics of the problem and its operation envelope. The learning process starts with selecting an initial guess on V (.,.), denoted with V 0 (.,.), e.g., V 0 (x, r) = 0, x Ω x, r Ω r. Afterwards, one updates the guess using V j+ (x, r) = Q(x, r) + min i I V j( f i (x), F (r) ), x Ω x, r Ω r, (8) where superscript j denotes the iteration index. This selection leads to the standard value iteration approach to reinforcement learning [23] in solving conventional problems. However, considering the switching nature of the problem at hand, the following challenging questions arise. ) Does iterative equation (8) converge as j, i.e., is sequence {V 0 (x, r), V (x, r), V 2 (x, r),...}, denoted with {V j (x, r)} j=0, convergent, x Ω x, r Ω r? 2) Which initial guesses on V 0 (.,.) guarantee the convergence? 3) If the sequence is convergent, does it converge to the optimal solution, i.e., do we have lim V j (x, r) = V (x, r), x Ω x, r Ω r. (9) j 4) Since {V j (x, r)} j=0 is a sequence of functions, is its convergence pointwise, for every given x and r, or uniform throughout the domains Ω x and Ω r, [45]? 5) Assuming the sequence converges to the optimal solution, is the limit function a continuous function, so that one can use NNs for approximating it? Before proceeding to the next section, it should be noted that one eventually uses look-up tables or function approximators for approximating V i+ (.,.), generated from Eq. (8). Since the exact reconstruction of the right hand side of the equation in the general case is not possible, approximation errors will be introduced into the process. This study, however, assumes the function approximators are rich enough such that the approximation errors are negligible. IV. THEORETICAL ANALYSES In this section the theoretical questions raised at the end of the previous section are investigated and the answers are sought. The idea presented in this study for answering the questions is different from the standard approaches proposed in the RL and ADP literature in proof of convergence of the respective iterative equations in optimal control problems, e.g., [2], [27]. The idea, motivated by [42], is establishing an analogy between the iterative learning scheme given by Eq. (8), which is proposed for solving infinite-horizon problems and the solution to finitehorizon optimal control problems with fixed final time. The latter was investigated in [39] for the non-tracking case. Once this analogy is established, the answers to the questions follow in straightforward and easy-to-follow forms. Let the respective optimal tracking problem with a finite-horizon cost function be given by minimizing cost function N J N = ψ(x N, r N ) + Q(x k, r k ), (0) k=0 4

5 subject to system dynamics () and reference signal dynamics (2), where convex, continuous, and positive (semi- )definite function ψ : R n R m R + penalizes the state error at the final time and Q(.,.) is the same as in (3). As seen, the only difference between this problem and the problem subject to this study is the fact that the horizon is fixed and finite in this problem. There is an important difference between infinite-horizon and finite-horizon problems, namely, in infinite-horizon problems the objective is directing the states in certain directions without incorporating any time limitations. In finite-horizon problems, however, the time is limited, hence, the objective should be fulfilled in a given time. For example, in the selected cost function given by Eq. (0) one may select a large value for ψ(.,.) as compared with Q(.,.) to emphasize minimizing the tracking error at the final time compared with the tracking error during the horizon. Interested readers are referred to [39] for more details and several examples. Denoting the value function for the finite-horizon problem at time step k with V,N k (.,.), cost function (0) leads to V,0 (x, r) = ψ(x, r), x R n, r R m, () and V,N k (x, r) = Q(x, r) + V,N (k+)( f i,n k (x,r)(x), F (r) ), x R n, r R m, k K, (2) where i,n k (x, r) denotes the optimal mode at time step k and K := {0,, 2,..., N }. Note that in finitehorizon problems (with fixed final time N), the value function and hence, the solution depend on the remaining time or time-to-go, i.e., N k. In other words, having the same x k and r k, but a different time-to-go may leads to a different solution [44], [36]. The time dependencies of the value function and the optimal decision are incorporated by superscript N k in V,N k (.,.) and i,n k (.,.). The Bellman principle of optimality [44] leads to the solution, that is () along with and V,N k (x, r) = Q(x, r) + min i I V,N (k+)( f i (x), F (r) ), x R n, r R m, k K, (3) i,n k (x, r) = argmin i I V,N (k+)( f i (x), F (r) ), x R n, r R m, k K. (4) The important point in the finite-horizon problem is the fact that the final time is fixed and finite. Considering the value functions at different time-to-go s as separate functions, one can start from the final time to approximate V,0 (.,.) using (). Afterwards, each V,N k (.,.) can be found using (3) step-by-step from k = N to k = 0, i.e., in a backward fashion. Ref. [39] presents the training algorithm and its analysis in details, for finite-horizon (non-tracking) problems. In infinite-horizon problems, however, this approach is not possible, as there is no final time to start with. Considering the finite-horizon problems however, the following results can be obtained. Lemma. If the continuous positive semi-definite initial guess in iterative relation (8) is given by V 0 (.,.) and ψ(.,.) is selected as ψ(.,.) = V 0 (.,.) then one has V j (x, r) = V,j (x, r), x Ω x, r Ω r, j N. (5) Proof : Eq. ψ(.,.) = V 0 (.,.) is given, hence, (5) holds for j = 0, considering (). Assume that Eq. (5) holds for a given j N. Selecting N > j, and k = N (j + ), Eq. (3) leads to V,j+ (x, r) = Q(x, r) + min i I V,j( f i (x), F (r) ), x R n, r R m, j K, (6) Comparing (6) with (8) and considering (5) for the given j, leads to V j+ (x, r) = V,j+ (x, r), x Ω x, r Ω r. (7) Therefore, Eq. (5) holds for all j N, by mathematical induction. Lemma presents an interesting result, namely, the immature value function (with respect to the infinite-horizon problem at hand) subject to iteration at the jth iteration of (8) is exactly the (optimal) value function of a finitehorizon problem with the time-to-go of j. Considering this analogy between the iteration and the time-to-go, the answer to Question is within our reach. The following lemma helps in answering the question. 5

6 Lemma 2. Let the value function of the finite-horizon problem of minimizing (0) subject to () and (2) with the time-to-go of j be given by V,j (x, r). If ψ(.,.) is selected such that 0 ψ(x, r) Q(x, r), x R n, r R m, (8) then sequence {V,j (x, r)} j=0 is a convergent sequence, for every given x and r. Proof : The first step is showing that {V,j (x, r)} j=0 is a non-decreasing sequence, for every given x and r. The proof is done by induction. Considering (), (3) evaluated at k = N, and (8), one has V,0 (x, r) V, (x, r), x Ω x, r Ω r, (9) because only one of non-negative terms which form V, (x, r) is Q(x, r) and this term is greater then or equal to V,0 (x, r). Now, assume that for some j, one has Let s define V(.,.) as where V,j (x, r) V,j (x, r), x Ω x, r Ω r. (20) V(x, r) := Q(x, r) + V,j ( f i,j+ (x,r)(x), F (r) ), x Ω x, r Ω r, (2) i,j+ (x, r) = argmin i I V,j( f i (x), F (r) ), (22) per (4). Comparing (2) with (3), where the latter is evaluated at k = N j one has V,j (x, r) V(x, r), x Ω x, r Ω r, (23) because V,j (.,.) is the result of minimization of the right hand side of (3). Moreover, evaluating (3) at k = N j and considering (4) one has Comparing (24) with (2), one has V,j+ (x, r) = Q(x, r) + V,j( f i,j+ (x,r)(x), F (r) ), x Ω x, r Ω r. (24) because of (20). Finally inequalities (23) and (25) lead to V(x, r) V,j+ (x, r), x Ω x, r Ω r, (25) V,j (x, r) V,j+ (x, r), x Ω x, r Ω r, (26) which together with (9) and (20), proves the pointwise non-decreasing feature of {V,j (x, r)} j=0. On the other hand, there exists some switching schedule, using which, cost function (3) and hence, finitehorizon cost function (0) as N, are bounded, per Assumption. The existence of such a switching schedule leads to the upper boundedness of lim j V,j (x, r), because, otherwise, the utilized switching sequence in generating V,j (x, r) is not optimal compared to the existing switching schedule. Finally, the upper boundedness of {V,j (x, r)} j=0 and its non-decreasing feature lead to its convergence, [46]. Theorem. Iterative relation (8) converges to the optimal solution to the infinite-horizon optimal control problem of minimizing cost function (3) subject to () and (2), i.e., lim V j (x, r) = V (x, r), x Ω x, r Ω r, (27) j if initial guess V 0 (.,.) is a continuous function such that 0 V 0 (x, r) Q(x, r), x Ω x, r Ω r. Proof : Considering the analogy between the iterations of (8) and the solution to a finite-horizon problem with ψ(.,.) = V 0 (.,.), as shown in Lemma, and the convergence result given in Lemma 2, sequence {V j (x, r)} j=0 converges. Denoting the converged value with V (x, r), what is remained to show is V (x, r) = V (x, r). Note that V (x, r) is the value function corresponding to the cost-function (0) as N, while V (x, r) corresponds to cost function (3). Considering Assumption, one has lim Q(x k, r k ) = 0, x 0 Ω x, r 0 Ω r, (28) k 6

7 once the optimal modes are selected during the horizon, otherwise, the cost function becomes unbounded, [46]. Eq. (28), leads to lim N J N = J by their definition given by (3) and (0), considering 0 V 0 (x, r) = ψ(x, r) Q(x, r). Therefore, V (x, r) = V (x, r), otherwise, the smaller value will be the optimal solution to the infitnite-horizon optimal control problem and also the least upper bound to sequence {V j (x, r)} j=0. Theorem answers Questions, 2, and 3, raised in the previous section. The answer to Question 4, however, is particularly important to us, because, it will help us investigate certain features of the limit function of the sequence, which is the desired value function, including its continuity, asked in Question 5. Note that, NNs with continuous neurons are proved to provide uniform approximation if the function subject to approximation is continuous, [47], [48]. Lemma 3, which is based on a lemma developed in [39], proves the continuity of the value function of the respective finite-horizon problems. Afterwards, Lemma 4, based on an idea developed in [2] and adapted in [22] for a similar purpose, is presented which answers Question 4. Then, we proceed to Theorem 2 which answers Question 5, using Lemmas 3 and 4. Lemma 3. If functions F (.), ψ(.,.), Q(.,.), and f i (.), i, are continuous with respect to their inputs, then, the finite-horizon value functions defined by () and (3) are continuous versus inputs x and r. Proof : The proof is done by induction. Starting from V,0 (.,.), it is continuous because of () and the fact that ψ(.,.) is a continuous function versus its inputs. Now, assume that V,j (.,.) is continuous, if it can be shown that this assumption leads to V,j+ (.,.) being continuous, the proof is complete. Note that due to the switching between different is as x and r change, this continuity is not obvious from Eq. (3). Considering (6), which is Eq. (3) in terms of j, instead of N k, the continuity problem can be rephrased to the following. If function V : R n R m I R + is defined as and the piecewise constant function i : R n R m I is given by V ( x, r, i) := Q(x, r) + V,j( f i (x), F (r) ), (29) i (x, r) = argmin i I V ( x, r, i) = argmin i I V,j( f i (x), F (r) ), (30) where V,j (.,.) is continuous versus its inputs, then, prove that function V (.,., i (.,.) ) is a continuous function versus x and r at every x R n and r R m. Note that V ( x, r, i (x, r) ) = V,j+ (x, r), x and r. Therefore, the proof of continuity of V (.,., i (.,.) ) completes the proof of the lemma. Let x be any selected point in R n, for any given r R m set Select an open set α R n such that x belongs to the boundary of α and limit î = ī = i ( x, r). (3) lim x x 0,x α i (x, r), (32) exists, where. denotes the vector norm. If ī = î, for every such α, then there exists some open set β R n containing x such that i k (x, r) is constant for all x β, because i k (x, r) only assumes integer values. In this case the continuity of V (., r, i (., r) ) at x = x follows from the fact that V (., r, i ) is continuous at x = x, for every fixed i I and given r, by composition. The reason is Q(., r), f i (.), and V,j (., r) are continuous functions. Finally, the continuity of the function subject to investigation at every x R n, leads to the continuity of the function in R n. Now assume ī î, for some α. From the continuity of V(., r, î) for the given r and î, one has If it can be shown that, for every selected α, one has V( x, r, î) = lim V( x + δx, r, î) (33) δx 0 V( x, r, ī) = V( x, r, î), (34) then the continuity of V (., r, i (., r) ) versus x follows, because from (33) and (34) one has V( x, r, ī) = lim V( x + δx, r, î), (35) δx 0 7

8 and (35) leads to the continuity by definition, [45]. The proof that (34) holds is done by contradiction. Assume that for some x and some α one has V( x, r, ī) < V( x, r, î), (36) then, due to the continuity of both sides of (36) at x for the fixed r, ī, and î, there exists an open set γ containing x, such that V(x, r, ī) < V(x, r, î), x γ. (37) Inequality (37) implies that at points which are close enough to x, one has i (x, r) î. But, this contradicts Eq. (32) which implies that there always exists a point x arbitrarily close to x at which i (x, r) = î. Therefore, equality (37) cannot hold. Now, assume that V( x, r, ī) > V( x, r, î),. (38) Inequality (38) leads to i k ( x, r) ī. But, this is against (3), hence, (38) also cannot hold. Therefore, (34) holds and hence, V(., r, i (., r)) is continuous at every x R n for every fixed r. Repeating the entire process with a fixed x and varying r, the continuity of the function with respect to r also can be similarly proved. This completes the induction and the proof of the lemma. Lemma 4. If there exists a constant c such that V (x, r) cq(x, r), x Ω x, r Ω r, selecting ψ(.,.) = 0, the sequence of finite-horizon value functions converges uniformly to the optimal value function of the respective infinite-horizon problem in compact sets Ω x and Ω r. Proof : The proof is based on an idea developed in [2] and utilized in [22], by showing that ( V,k ) (x, r) ( + c ) k V (x, r), x Ω x, r Ω r, k K. (39) Considering V,0 (.,.) = 0 Eq. (39) holds for k = 0. Assume it holds for some k. Then V,k+ (x, r) = min i I (Q(x, r) + V,k( f i (x), F (r) )) min i I (( ( ( + c ) k+ )Q(x, r) + ( ( + c ) k + c ) ( + c ) k+ min i I (Q(x, r) + V ( f i (x), F (r) )) ( = ( + c ) k+ )V ( f i (x), F (r) )) = ) ( + c ) k+ V (x, r), x R n, r R m. Therefore, inequality (39) holds for all k. On the other hand, by Lemmas and 2 and Theorem, the non-decreasing feature of {V,k (x, r)} k=0 and its convergence to V (x, r) lead to the upper boundedness of each V,k (x, r) by V (x, r) for any given x and r. Utilizing this upper bound and the lower bound given in (39) one has V (x, r) V,k (x, r) (40) ( + c ) k V (x, r), x Ω x, r Ω r, k K. (4) Replacing V (x, r) with V := sup x Ωx,r Ω r V (x, r), which is a bounded constant, per Assumption, the foregoing inequality leads to the uniform convergence of sequence of finite-horizon value functions to the respective infinite-horizon optimal value function as the horizon is extended to infinity, [45]. Theorem 2. If there exists a constant c such that V (x, r) cq(x, r), x Ω x, r Ω r, and functions F (.), Q(.,.), and f i (.), i, are continuous with respect to their inputs, then, the value function of the infinitehorizon optimal control problem, V (.,.), is a continuous function with respect to its both inputs. Proof : As seen in Lemma 2 and Theorem, selecting for example ψ(.,.) = 0, the sequence of finite-horizon value functions {V,j (.,.)} j=0 converges to the infinite-horizon value function V (.,.). Moreover, Lemma 3 shows that the elements of the sequence of finite-horizon value functions are continuous with respect to both inputs. Since the convergence of the finite-horizon value functions is uniform (Lemma 4), the continuity is preserved, i.e., the limit function, which is the infinite-horizon value function, is also continuous with respect to the both inputs, [45]. 8

9 A. Offline Learning Process V. IMPLEMENTATION OF THE PROPOSED SOLUTION For implementation of the proposed method, one can use NNs as global function approximators. Selecting linear-in-weight NNs, the function is approximated within compact sets Ω x R n and Ω r R m using W T ϕ(x k, r k ) V (x k, r k ), x k Ω x, r k Ω r, (42) where the selected smooth basis functions are given by ϕ : R n R m R l, with l being a positive integer denoting the number of neurons. Unknown matrix W R l, to be found using learning algorithms, is the weight matrix of the network. Note that the inputs to the basis functions correspond to the dependency of the function subject to approximation on the current state and reference signal values. Once the NN structure is selected, the next step is developing the learning algorithm. Denoting the NN weight matrix at the jth iteration with W j, function W jt ϕ(.,.) is supposed to approximate V j (.,.). The learning starts through selecting an initial guess on W 0. Afterwards, one needs to update the weight matrix through Eq. (8). Rewriting Eq. (8) in terms of the NN, leads to W j+t ϕ(x, r) = Q(x, r) + min i I W jt ϕ ( f i (x), F (r) ), x Ω x, r Ω r, (43) hence, W j+ will be calculated based on W j using Eq. (43), until the weight matrix converges. This learning process can be either in a batch or in a sequential form, as detailed in Algorithms and 2, respectively. Algorithm - Batch Learning Step : Randomly select p different x [q] Ω x and r [q] Ω r, q {, 2,.., p}, for p being a large positive integer, where Ω x R n and Ω r R m represent the domains of interest. Step 2: Select initial guess W 0 R l, e.g., W 0 = 0. Step 3: Set j = 0. Step 4: Find W j+ such that W j+t ϕ(x [q], r [q] ) = Q(x [q], r [q] ) + min i I W jt ϕ ( f i (x [q] ), F (r [q] ) ), q {, 2,.., p}. (44) Step 5: If W j+ W j β, where β is a small positive real number, selected as the tolerance, then proceed to Step 6, otherwise, set j = j + and go back to Step 4. Step 6: Set W = W j+ and stop the training. Algorithm 2 - Sequential Learning Step : Select an initial guess on W 0 R l, e.g., W 0 = 0. Step 2: Set j = 0. Step 3: Randomly select x Ω x and r Ω r, where Ω x R n and Ω r R m represent the domains of interest. Step 4: Train weight W j+ of neural network W j+t ϕ(.,.) using inputs x and r and target Q(x, r) + min i I W jt ϕ ( f i (x), F (r) ). Step 5: If W j+ W j β for several consecutive runs of Steps 3 and 4, where β is a small positive real number, selected as the tolerance, then proceed to Step 6. Otherwise, set j = j + and go back to Step 3. Step 6: Set W = W j+ and stop the training. If Algorithm is selected, one can use the method of least squares for solving Eq. (44) in one shot and updating the weight matrix. Interested readers are referred to [39] for details on forming the least squares. Another option for updating the weights in both algorithms is using gradient descent based training laws. It should be noted that at each iteration of the learning algorithm, only one set of weights will be stored to be used in the next iteration. Also, set I, among whose elements the minimization in Eq. (43) is carried out, will have constant number of elements. These points lead to less storage and computation load compared with the proposed scheme in [3], [4] where the number of weight matrices and the number of elements in the respective set in the minimization grow exponentially with the iteration index. 9

10 Finally, before concluding this section, it should be noted that the selection of linear-in-weight form for the NN, as done in (42), is not required for the theory developed in this study to be valid. One can utilize multi-layer perceptrons for improving the approximation capability of the NN. In this case Eq. (42) changes to N (W, x, r) V (x, r), x Ω x, r Ω r, (45) where function N : R l R n R m R denotes the NN mapping, with the first argument being the tunable weights of the NN with l elements, and the next two argument being its inputs. B. Online Control Once the NN weight matrix is learned through Algorithms or 2 in offline training, the resulting final weights can be used for online control/switching of the system. This is done in real-time through feeding the current x and r to the following equations which calculate i (x, r) in a feedback form i (x, r) = argmin i I W T ϕ ( f i (x), F (r) ), x Ω x, r Ω r. (46) Note that Eq. (46) is the same as Eq. (7), except that it is rephrased in terms of the NN approximation of the value function. Since I is a discrete set with a finite number of elements, the minimization in Eq. (46) can be carried out easily in real-time. As a matter of fact, the computational burden is as low as evaluating M scalar-valued functions and selecting the i corresponding to the least value. Finally, it should be noted that the NN produces an approximation of the optimal solution as long as the x and r are within the domains using which the NN is trained. These domains need to be selected carefully to cover the entire operation envelop of the specific problem at hand. The validity of the results within those domains leads to an interesting characteristic of the proposed solution, that is, it provides solutions for different initial conditions x 0 and r 0, as long as the resulting trajectory stays in the domains. Therefore, no retraining is needed each time the initial conditions of the system, or of the reference signal changes and the same trained NN can be used for optimal control/switching of the system. VI. EXTENSION OF THE RESULTS TO REGULATION The proposed method solves the optimal regulation problems as well, i.e., minimizing J = Q(x k ), (47) k=0 subject to dynamics (), where Q : R n R +. Because, regulation of the states is a particular case of tracking, in which, the reference signal is zero. It can be seen that in regulation problems the value function will be only a function of x, i.e., V (x). Therefore, the NN given by W T ϕ(x k ) V (x k ), x k Ω x, (48) will be suitable for approximating the solution and there is no need to feed the zero to the network as the reference signal. The rest of the process is the same as discussed for the tracking problem. As for theoretical analyses, since regulation is a particular case of tracking, all the obtained results are valid for the regulation as well. VII. NUMERICAL ANALYSES A nonlinear second order system with three modes, simulated in [5] and [39], is selected. The source codes for the simulations are available at [49]. The objective of this problem is controlling the fluid level in a two-tank setup. The fluid flow into the upper tank can be adjusted through a valve which has three positions: fully open, half open, and fully closed. Each tank leaks fluid with a rate proportional to the square root of the height of the fluid in the respective tank. The upper tank leaks into the lower tank, and the lower tank leaks to the outside of the setup. Representing the fluid height in the upper tank with scalar y and in the lower tank with scalar z, the dynamics of the state vector x = [y, z] T are given by the following three modes, corresponding to the three positions of the valve, [ ] [ ] [ ] y y y + ẋ = f (x) :=, ẋ = f y z 2 (x) :=, ẋ = f y z 3 (x) :=. (49) y z 0

11 The selected objective is forcing the fluid level in the lower tank, i.e., z, to track reference signal r(t) R with the dynamics of ṙ(t) = r 3 (t). (50) Since the problem is in continuous time, sampling time of 0.05s was used for discretizing the problem using forward Euler integration. Then, cost function (3) was selected for evaluating the performance of the method with Q(x, r) = 0(z r) 2. The basis functions for this example were selected as polynomials y n zn 2 rn 3, where non-negative integers n i, i =, 2, 3, are such that 0 n + n 2 + n 3 4. This selection led to 75 neurons. Domains Ω x = {[y, z] T R 2 : 0 y, z < } and Ω r = {r R : 0 r < } were used for the training. The batch training scheme was conducted using least squares [39], such that p = 2000 random states were selected in implementing Algorithm. It was observed that the training converged after almost 60 iteration, as seen in Fig. which shows the evolution of the weight elements during the training iterations. The training process took almost 60 seconds on a desktop computer with Intel Core i7-3770, 3.40 GHz processor and 8 GB of memory, running Windows 7 and MATLAB 203 (single threading). Once the network was trained, initial conditions x 0 = [0, ] T and r 0 = were used to simulate the problem. The results are given in Fig. 2. As seen in the figure, the method did a nice job controlling the fluid level of the lower tank by tracking the desired reference signal. Next, the capability of the neurocontroller in controlling different initial conditions within Ω x is investigated. Note that, if the initial state x 0 is within the selected Ω x, the state trajectory will stay within Ω x, regardless of the applied switching schedule, due to the dynamics of the system. A new initial condition, namely x 0 = [, ] T is utilizes as the next simulation and the trained network (without re-training) is used for controlling it. The results, given in Fig. 3, show the capability of the controller in controlling different initial states without any need for retraining. Finally, a new initial condition for the reference signal is selected, namely, r 0 = 0.2. The dynamics of the reference signal are also such that whenever the initial condition is within Ω r, the whole trajectory will stay in Ω r. Assuming the initial states of x 0 = [0, 0] T the NN is used for tracking the new reference signal generated through the new r 0. The results, presented in Fig. 4, show that the controller has been successful in this scenario as well. In other words, the same trained NN can be used for tracking a family of reference signal which share the same dynamics, but, are generated using different initial conditions. It should be noted, however, that as seen in Eq. (8), the network is trained based on the assumed F (.). Therefore, even though we are feeding the current reference signal value to the network, it will only provide approximate optimal tracking solution if the fed reference signal has the dynamics modeled by F (.), as given in (2). Otherwise, the results will not be reliable. 400 Weight Elements Training Iterations Fig.. Evolution of the NN weights during the training/learning process. VIII. CONCLUSIONS A value iteration based scheme was presented for infinite-horizon optimal tracking/regulation of nonlinear switching systems. The iterative nature of the solution along with the need for using a function approximator for learning the input-output mapping, led to several fundamental questions, including its convergence. The raised questions were addressed analytically and rigorous answers were obtained. After providing the training algorithms and the process for online control, the performance was evaluated on a benchmark nonlinear switching system. It was shown that the controller provides approximate optimal solution for different initial conditions and different reference signals, as long as certain conditions hold. The low real-time computational burden of the proposed method makes it attractive for implementation in embedded systems for different real-world problems.

12 r K y k z k Active Mode Time Steps k Fig. 2. Simulation result for x 0 = [0, ] T, and r 0 = r K y k z k Active Mode Time Steps k Fig. 3. Simulation result for x 0 = [, ] T, and r 0 =. 0.4 r K 0.3 y k z k Active Mode Time Steps k Fig. 4. Simulation result for x 0 = [0, 0] T, and r 0 =

13 REFERENCES [] X. Xu and P. J. Antsaklis, Optimal control of switched systems via non-linear optimization based on direct differentiations of value functions, International Journal of Control, vol. 75, no. 6-7, pp , [2] X. Xu and P. Antsaklis, Optimal control of switched systems based on parameterization of the switching instants, IEEE Transactions on Automatic Control, vol. 49, pp. 2 6, Jan [3] H. Axelsson, M. Boccadoro, M. Egerstedt, P. Valigi, and Y. Wardi, Optimal mode-switching for hybrid systems with varying initial states, Nonlinear Analysis: Hybrid Systems, vol. 2, no. 3, pp , [4] X. Ding, A. Schild, M. Egerstedt, and L. Jan, Real-time optimal feedback control of switched autonomous systems, IFAC Proceedings Volumes (IFAC-PapersOnline), vol. 3, pp. 08 3, [5] H. Axelsson, M. Egerstedt, Y. Wardi, and G. Vachtsevanos, Algorithm for switching-time optimization in hybrid dynamical systems, in Proceedings of the IEEE International Symposium on Intelligent Control, pp , June [6] Y. Wardi and M. Egerstedt, Algorithm for optimal mode scheduling in switched systems, in Proceedings of the American Control Conference, 202. [7] M. Kamgarpour and C. Tomlin, On optimal control of non-autonomous switched systems with a fixed mode sequence, Automatica, vol. 48, no. 6, pp. 77 8, 202. [8] M. Rungger and O. Stursberg, A numerical method for hybrid optimal control based on dynamic programming, Nonlinear Analysis: Hybrid Systems, vol. 5, no. 2, pp , 20. [9] M. Sakly, A. Sakly, N. Majdoub, and M. Benrejeb, Optimization of switching instants for optimal control of linear switched systems based on genetic algorithms, in IFAC Proceedings Volumes (IFAC-PapersOnline), vol. 2, [0] C.-H. Lien, K.-W. Yu, H.-C. Chang, L.-Y. Chung, and J.-D. Chen, Switching signal design for exponential stability of discrete switched systems with interval time-varying delay, Journal of the Franklin Institute, vol. 349, no. 6, pp , 202. [] S. Zhai and X.-S. Yang, Exponential stability of time-delay feedback switched systems in the presence of asynchronous switching, Journal of the Franklin Institute, vol. 350, no., pp , 203. [2] A. Heydari and S. Balakrishnan, Optimal multi-therapeutic HIV treatment using a global optimal switching scheme, Applied Mathematics and Computation, vol. 29, no. 4, pp , 203. [3] C. Qin, H. Zhang, Y. Luo, and B. Wang, Finite horizon optimal control of non-linear discrete-time switched systems using adaptive dynamic programming with epsilon-error bound, International Journal of Systems Science, 203. [4] W. Lu and S. Ferrari, An approximate dynamic programming approach for model-free control of switched systems, Proceedings of the IEEE Conference on Decision and Control, pp , 203. [5] M. Rinehart, M. Dahleh, D. Reed, and I. Kolmanovsky, Suboptimal control of switched systems with an application to the disc engine, IEEE Transactions on Control Systems Technology, vol. 6, no. 2, pp , [6] A. Heydari and S. N. Balakrishnan, Optimal orbit transfer with on-off actuators using a closed form optimal switching scheme, in AIAA Guidance, Navigation, and Control Conference, 203. [7] K. Benmansour, A. Benalia, M. Djema, and J. de Leon, Hybrid control of a multicellular converter, Nonlinear Analysis: Hybrid Systems, vol., no., pp. 6 29, [8] C. Liu and Z. Gong, Modelling and optimal control of a time-delayed switched system in fed-batch process, Journal of the Franklin Institute, vol. 35, no. 2, pp , 204. [9] E. Hernandez-Vargas, P. Colaneri, R. Middleton, and F. Blanchini, Discrete-time control for switched positive systems with application to mitigating viral escape, Int. J. Robust and Nonlinear Control, pp. 093, 20. [20] J. Zhai, B. Shen, J. Gao, E. Feng, and H. Yin, Optimal control of switched systems and its parallel optimization algorithm, Journal of Computational and Applied Mathematics, vol. 26, pp , 204. [2] B. Lincoln and A. Rantzer, Relaxing dynamic programming, IEEE Transactions on Automatic Control, vol. 5, pp , Aug [22] M. Rinehart, M. Dahleh, and I. Kolmanovsky, Value iteration for (switched) homogeneous systems, IEEE Transactions on Automatic Control, vol. 54, no. 6, pp , [23] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. MIT Press, 2nd ed., 202. [24] P. J. Werbos, Approximate dynamic programming for real-time control and neural modeling, in Handbook of Intelligent Control (D. A. White and D. A. Sofge, eds.), Multiscience Press, 992. [25] S. N. Balakrishnan and V. Biega, Adaptive-critic based neural networks for aircraft optimal control, Journal of Guidance, Control and Dynamics, vol. 9, pp , 996. [26] D. Prokhorov and D. Wunsch, Adaptive critic designs, IEEE Transactions on Neural Networks, vol. 8, pp , 997. [27] A. Al-Tamimi, F. Lewis, and M. Abu-Khalaf, Discrete-time nonlinear hjb solution using approximate dynamic programming: Convergence proof, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 38, pp , Aug [28] G. Venayagamoorthy, R. Harley, and D. Wunsch, Comparison of heuristic dynamic programming and dual heuristic programming adaptive critics for neurocontrol of a turbogenerator, IEEE Transactions on Neural Networks, vol. 3, pp , May [29] P. He and S. Jagannathan, Reinforcement learning-based output feedback control of nonlinear systems with input constraints, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 35, no., pp , [30] H. Zhang, Q. Wei, and Y. Luo, A novel infinite-time optimal tracking control scheme for a class of discrete-time nonlinear systems via the greedy HDP iteration algorithm, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 38, no. 4, pp , [3] T. Dierks, B. T. Thumati, and S. Jagannathan, Optimal control of unknown affine nonlinear discrete-time systems using offline-trained neural networks with proof of convergence, Neural Networks, vol. 22, no. 5-6, pp , [32] D. Wang, D. Liu, Q. Wei, D. Zhao, and N. Jin, Optimal control of unknown nonaffine nonlinear discrete-time systems based on adaptive dynamic programming, Automatica, vol. 48, no. 8, pp ,

14 [33] F. Lewis, D. Vrabie, and K. Vamvoudakis, Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controllers, IEEE Control Systems, vol. 32, pp , Dec 202. [34] M. Fairbank, E. Alonso, and D. Prokhorov, An equivalence between adaptive dynamic programming with a critic and backpropagation through time, IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 2, pp , 203. [35] X. Chen, Y. Gao, and R. Wang, Online selective kernel-based temporal difference learning, IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 2, pp , 203. [36] A. Heydari and S. N. Balakrishnan, Fixed-final-time optimal control of nonlinear systems with terminal constraints, Neural Networks, vol. 48, pp. 6 7, 203. [37] Q. Zhao, H. Xu, and S. Jagannathan, Optimal control of uncertain quantized linear discrete-time systems, International Journal of Adaptive Control and Signal Processing, 204. [38] A. Heydari and S. Balakrishnan, Optimal switching and control of nonlinear switching systems using approximate dynamic programming, IEEE Transactions on Neural Networks and Learning Systems, vol. 25, pp , 204. [39] A. Heydari and S. Balakrishnan, Optimal switching between autonomous subsystems, Journal of the Franklin Institute, vol. 35, 204. [40] A. Heydari and S. Balakrishnan, Optimal switching between controlled subsystems with free mode sequence, vol. 49, no. 0, pp , 205. [4] C. Qin, H. Zhang, and Y. Luo, Optimal tracking control of a class of nonlinear discrete-time switched systems using adaptive dynamic programming, Neural Computing and Applications, vol. 24, no. 3-4, pp , 204. [42] A. Heydari, Revisiting approximate dynamic programming and its convergence, IEEE Transactions on Cybernetics, vol. 44, no. 2, pp , 204. [43] A. Heydari and S. N. Balakrishnan, Finite-horizon control-constrained nonlinear optimal control using single network adaptive critics., IEEE Trans. Neural Netw. Learning Syst., vol. 24, no., pp , 203. [44] D. E. Kirk, Optimal control theory; an introduction. Prentice-Hall, 970. pp [45] W. F. Trench, Introduction to Real Analysis vailable online at pp [46] W. Rudin, Principles of Mathematical Analysis. McGraw-Hill, 3rd ed., 976. pp. 55, 60. [47] K. Hornik, M. Stinchcombe, and H. White, Multilayer feedforward networks are universal approximators, Neural Networks, vol. 2, no. 5, pp , 989. [48] H. Jeffreys and B. S. Jeffreys, Weierstrass s theorem on approximation by polynomials, in Methods of Mathematical Physics, pp , Cambridge University Press, 3rd ed., 988. [49] Available online at 4

Stability Analysis of Optimal Adaptive Control under Value Iteration using a Stabilizing Initial Policy

Stability Analysis of Optimal Adaptive Control under Value Iteration using a Stabilizing Initial Policy Ali Heydari, Member, IEEE Abstract Adaptive optimal control using value iteration initiated from