Stability Analysis of Optimal Adaptive Control under Value Iteration using a Stabilizing Initial Policy

Size: px

Start display at page:

Download "Stability Analysis of Optimal Adaptive Control under Value Iteration using a Stabilizing Initial Policy"

Derek Kennedy
5 years ago
Views:

1 Stability Analysis of Optimal Adaptive Control under Value Iteration using a Stabilizing Initial Policy Ali Heydari, Member, IEEE Abstract Adaptive optimal control using value iteration initiated from a stabilizing control policy is theoretically analyzed. The analysis is in terms of stability of the system during the learning stage and includes the system controlled by any fixed control policy and also by an evolving policy. A feature of the presented results is finding subsets of the region of attraction. This is done so that if the initial condition belongs to this region, the entire state trajectory remains within the training region. Therefore, the function approximation results remain reliable, as no extrapolation will be conducted. Index Terms- Value iteration; approximate/adaptive dynamic programing; adaptive optimal control; stabilizing value iteration. I. INTRODUCTION Approximate/adaptive dynamic programming (ADP) as a framework for learning optimal control has received an enormous attention in the last two decades, [2] [4]. ADPbased learning algorithms are typically classified as policy iteration (PI) and value iteration (VI) algorithms, [3], [4]. These methods are used for machine learning [3] and also for feedback control of dynamical systems [2], [4]. The control policy during the learning stage remains stabilizing in PI, [5]. Therefore, PI is naturally more attractive for online learning. However, the learning needs to start with a stabilizing initial control. But, VI can be initiated using an arbitrary policy and the control policy during the learning stage (i.e., immature control policy) may not stabilize the system. Stability under the VI-based results after the conclusion of (a finite number of) iterations was investigated in [6], [7]. In this study it is proved that any immature control policy generated using VI also will stabilize the system if the iteration is started using an initial stabilizing control policy, as in PI. Afterwards, the important concern that learning-based results are valid only if the future states stay within the region on which the controller is tuned is addressed. It should be noted that in the general case, it is not guaranteed that a state trajectory initiated from this region will remain inside it. If it exits the region, then the trained controller becomes invalid, as the controller is not reliable for extrapolation. This problem is solved by obtaining a subset of the region of attraction (SROA) [8, Section 8.2] for the closed loop system, in this study. Once done, if the initial condition of the system is inside the region, the entire trajectory is guaranteed to remain in the region over which the controller is valid. Then, it is discussed that the A. Heydari is an Assistant Professor of Mechanical Engineering with the Southern Methodist University, Dallas, TX, aheydari@smu.edu. This material is based upon work supported by the National Science Foundation under Grant No Initial results of this research were presented at American Control Conference 26 through Ref. []. provided stability proof, which is not substantially different from [9], is based on operating the system using a fixed control policy. However, in online learning, the control policy evolves, i.e., the policy changes versus time. But, if a time-varying control policy is applied, the previous stability result is no longer applicable. Therefore, another set of stability results for the evolving/changing control policy is developed with an idea for establishing its respective SROA, as another contribution of this work. In Ref. [7], stability of VI-based algorithms with arbitrary initial guesses and after the training stage was investigated. However, the current study investigates VI initiated using a stabilizing guess and during the training phase. Compared with [9] and also with existing results for PI, this study establishes SROAs and the closed loop stability under evolving policies. Another relatively similar result is Ref. []. A difference between this work and [] is the point that no termination assumption is made here. Under the termination assumption, starting from a non-zero initial state, there exists a finite time after which, the cost-to-go becomes zero. Finally, it may be mentioned that initial results of this research were presented at Ref. []. The main differences compared with that conference paper are listed next. a) The rigor of the analyses is improve. b) Proof of Lemma is included. c) The results in Theorem 2 is extended to the case of applying each policy for more than one time step while Ref. [] required each policy to be applied exactly for one step. d) An idea for establishing SROA for the case of applying evolving policies is presented. e) Numerical analysis are included in the current study. As for organization of the study, the problem formulation is given in Section II and the value iteration based solution is reviewed in Section III. The main results, i.e., the stability analyses are presented in Section IV, followed by numerical simulations and conclusions in Sections V and VI, respectively. II. PROBLEM FORMULATION Let the discrete-time nonlinear dynamics x k+ = f(x k, u k ), k N, () be considered where f : R n R m R n is a continuous function in x R n, as the state vector, and in u R m, as the control vector, with f(, ) =. The sets of real numbers and non-negative integers are denoted with R and N, respectively and sub-index k denotes the discrete time index. Cost function J = U(x k, u k ), (2)

2 is selected where utility function U(x k, u k ) := Q(x k )+R(u k ) is assumed. Let continuous functions Q : R n R + and R : R m R + be positive semi-definite and positive definite, respectively. Set R + denotes the non-negative reals. Let a control policy be given by π : R n R m for feedback control calculation, i.e., u k = π(x k ). The objective is finding the optimal control policy, denoted with π ( ), that is the policy using which cost function (2) is minimized, subject to dynamics (). In online learning, this process is done through selecting an initial control policy and updating it, until it converges to the optimal control policy. Definition. Let Ω R n be a compact and connected set containing the origin as an interior point. Also, let V π : R n R + denote the value function of policy π( ), i.e., V π ( ) = U ( x π k, π(x π k) ), (3) where x π k denotes the kth element on the state history started from and generated using control policy π( ). Then, control policy π( ) is called admissible in Ω if the following two conditions hold. ) The policy is a continuous function in R n satisfying π() =. 2) There exists a continuous positive definite function W : R n R + such that V π (x) W(x), x Ω. The defined admissibility is slightly different from the typical definitions as in []. While the continuity of the value function is a requirement for its uniform approximation [2] and also for using it as a candidate Lyapunov function, the milder condition of the value function being bounded by a continuous function is selected. It will be shown that the upper boundedness will lead to the desired continuity which in turns leads to its boundedness in compact sets, [3, Theorem 4.5]. The following two assumptions apply to the results presented in the rest of this study. Assumption. There exists an admissible policy in Ω. Assumption 2. The intersection of the set of n-vectors x at which U(x, ) = with the invariant set of f(., ) only contains the origin, i.e., no solution of x k+ = f(x k, ) can remain in {x R n : U(x, ) = }, other than x k =, k. By assumption, there is no state vector in Ω whose optimal value function, defined in the next section, is infinite. It may be mentioned that feedback linearization, [4], [5] is an example approach for finding the initial admissible policy. Another approach is using a control Lyapunov function (CLF) as used in [6]. Such a CLF guarantees ) continuity of the resulting policy, 2) continuity of the upper bound of the value function, and 3) the feature of the upper bound vanishing at the origin, [6, Section III.B]. Finally, Assumption 2 guarantees that there is no set of states in which the state trajectory can hide without convergence to the origin. If, for example, U(, ) is positive definite, this assumption is trivially satisfied. III. REVISITING VALUE ITERATION-BASED SOLUTION Value function of control policy π( ) satisfies V π (x) = U ( x, π(x) ) + V π (f ( x, π(x) )), x R n, (4) per Eq. (3). The optimal value function may be defined as the value function of the optimal control policy. Denoting it with V ( ), Bellman equation [7, p. 7], [] provides the solution to the problem: ( π (x) arg min U ( x, u ) + V ( f ( x, u ))), (5) u R m V (x) = min U ( x, u ) + V ( f ( x, u ))). (6) It is worth mentioning that the minimizing u in (5) may not be unique. Motivated by [], notation is used here to allow selecting any of the minimizers. Solving Bellman equation is computationally intractable for general nonlinear systems (curse of dimensionality, [7, p. 78].) The idea of approximating the optimal value function is pursued in ADP. This approximation is done using function approximators, e.g., neural networks (NNs), or look-up tables. This approximation/tuning is conducted over a connected and compact set with the origin as an interior point, namely, the region of interest, denoted by Ω. This region needs to be selected based on the expected operation envelop of the system, i.e., the expected states to be visited during online control. If the states exit this region, the tuned controller becomes invalid. Approximation of optimal value function can be done using VI. Starting with a selected V ( ), one iterates through the policy update equation π i (x) arg min and the value update equation U ( x, u ) + V i( f ( x, u ))), x Ω, (7) V i+ (x) = U ( x, π i (x) ) + V i( f ( x, π i (x) )), x Ω, (8) in VI. Equivalently, the iterations may be given by V i+ (x) = min U ( x, u ) + V i( f ( x, u ))), x Ω, (9) The iterations are done for i =,,... until they converge. If the iterations converge to the optimal value function, i.e., if V i ( ) V ( ) as i, the resulting V ( ) can be used in (5) for finding the (approximate) optimal policy. IV. STABILITY ANALYSIS UNDER VALUE ITERATION Let the initial V ( ), be selected as the value function of an admissible control policy. For brevity, the resulting VI is called stabilizing VI, as defined next. Definition 2. Stabilizing value iteration is defined as the value iteration algorithm (9) initiated by the value function of an admissible control policy. Selecting the initial admissible policy π( ), its value function, V π ( ), can be obtained using (4). One way of solving (4) for V π ( ) is using the successive approximation given by Vπ j+ (x) = U ( x, π(x) ) + Vπ j ( f ( x, π(x) )), x Ω, () where the superscript on V j π( ) is the index of iteration. Starting with the initial guess of V π(x) =, x, the iterations converge to V π ( ), [], [4]. Selecting V ( ) = V π ( ), as the initial guess in VI, the stability of the system using π i ( )s can be established. Before that, some theoretical results are needed. Let V( ) C(Ω) denote that function V( ) is continuous in Ω. 2

3 Lemma. Let π( ) be an admissible control policy in Ω and Assumption 2 hold. Then, V( ) C(Ω). Proof : The proof is by contradiction. Assume that V π ( ) is discontinuous at some y Ω. Then ɛ >, δ >, Ω : () V π ( ) V π (y ) > ɛ while y < δ, where. denotes a vector norm,. represents absolute value, and : denotes such that. The idea is showing that () is not possible. To this end, initially it may be noted that at jth iteration of (), one has j Vπ(x j ) = Vπ( π j ) + U ( x π k, π(x π k) ). (2) Selecting V π( ) =, from (3) and (2) one has Therefore, V π ( ) = V j π( ) + V π (x π j ). (3) V π ( ) V π (y ) = V j π( )+V π (x π j ) V j π(y ) V π (y π j ), (4) which leads to V π ( ) V π (y ) V j π( ) V j π(y ) + V π (x π j ) + V π (y π j ), (5) by triangle inequality of absolute values. Inequality (5) is the key to the solution, as it will be shown that the right hand side of the inequality can be made arbitrarily small if is close enough to y and j is large enough. By the admissibility of π( ), one has V π (x) W (x), x, therefore, V π ( ) V π (y ) Vπ(x j ) Vπ(y j ) + W (x π j ) + W (yj π ). (6) Also, by admissibility of π( ), the sequence of partial sums in the right hand side of (3), i.e., { i U( x π k, π(xπ k )) } i= is upper bounded by a continuous function. Therefore, the partial sums are finite in a compact set, given finiteness of continuous functions in such sets, [3, Theorem 4.5]. Moreover, the sequence is non-decreasing, given the nonnegative summands. Hence, it converges, [3, Theorem 3.4]. Therefore, U ( x π k, π(xπ k )) as k, [3, Theorem 3.23]. This leads to x π j as j, by Assumption 2. By W ( ) C(Ω) and W () =, which follows from positivedefiniteness of W ( ), one has y Ω, ɛ >, j = j (y, ɛ) : Moreover, by W ( ) C(Ω) y π j Ω, ɛ >, δ = δ (y π j, ɛ) : Hence, j j W (y π j ) < ɛ/4. x π j y π j < δ W (x π j ) W (y π j ) < ɛ/4. (7) (8) x π j y π j < δ W (x π j ) < ɛ/4 + W (y π j ). (9) On the other hand, due to the continuity of the closed loop system f (, π( ) ) in R n, the state trajectory at each finite time, for example j, continuously depends on the initial conditions. This may be seen through noting that the state at any finite time j is the result of composition of continuous function f (, π( ) ) for j times. Composition of a finite number of continuous functions is a continuous function, [3, Theorem 4.7]. Therefore, the trajectory yk π, k =,,..., j changes continuously as y changes. Hence, y Ω, δ >, δ 2 = δ 2 (y, δ, j) : y < δ 2 x π j y π j < δ. (2) Moreover, Vπ( ) j C(Ω), for j <, as it is a finite sum of continuous functions, per (2), evaluated along a trajectory which is a continuous function of the argument of Vπ( ). j Therefore, y Ω, ɛ >, δ 3 = δ 3 (y, ɛ, j) : y < δ 3 Vπ(x j ) Vπ(y j (2) ) < ɛ/4. Enough inequalities are now found for contradicting (). For any point of discontinuity y and ɛ whose existence is guaranteed by (), let us find j = j (y, ɛ) which leads to W (y π j ) < ɛ/4, (22) per (7). Then, let δ = δ (y π j, ɛ). By (9) and (22), x π j y π j < δ W (x π j ) < ɛ/4 + W (y π j ) < ɛ/2. (23) Let us select δ 2 = δ 2 (y, δ, j ) to have y < δ 2 x π j y π j < δ, (24) per (2). Finally, set δ 3 = δ 3 (y, ɛ, j ) to have y < δ 3 Vπ j ( ) Vπ j (y ) < ɛ/4, (25) per (2). Let δ = min(δ 2, δ 3 ). Using (22), (23), (24), and (25) in (6) one has y < δ V π ( ) V π (y ) Vπ j ( ) Vπ j (y ) + W (x π j ) + W (yj π ) < ɛ, which contradicts (). Therefore, V π ( ) C(Ω). (26) Lemma 2. Let Assumption hold. Sequence of functions {V j (x)} j= := {V (x), V (x),...} generated through stabilizing value iteration is pointwise non-increasing in Ω. Proof : This monotonicity is a well known feature of VI, [6], []. For the specific case in here that the initial guess is a value function, it is established as follows. Considering (4) which provides V ( ) = V π ( ) and (9), which for i = provides V ( ), one has V (x) V (x), x Ω, (27) because V ( ) is the minimum of the right hand side of (9) for i =, while V ( ) is based on using the selected π( ). Now, assume that for some i. By (9) one has V i (x) = min V i (x) V i (x), x Ω, (28) U ( x, u ) + V i ( f ( x, u ))), x Ω. (29) Comparing (29) with (9) and considering (28) one has V i+ (x) V i (x), x Ω. (3) This completes the induction and proves the lemma. Next step is showing that each V i ( ) is continuous in Ω. While functions f(, ) and U(, ) are continuous with respect to their inputs, the presence of arg min operator in Eq. (7) may result in a discontinuous π i ( ), which may then lead to a discontinuous V i+ ( ) in Eq. (8). Therefore, this continuity is not obvious. 3

4 Lemma 3. Let π( ) be an initial admissible policy in Ω used for stabilizing value iteration. Then, V i ( ) C(Ω), i N. Proof : It was shown in [7] that if V i ( ) C(Ω), then V i+ ( ) C(Ω). Given this result and the point that V ( ) C(Ω) (by Lemma ), it follows that V i ( ) C(Ω) for any given (finite) i, by induction. Remark. Unlike Lemma, where the continuity of V ( ) was established using continuity of the initial policy, the continuity of rest of value functions are established in Lemma 3 without assuming continuity of π i ( )s. Continuity of value functions, established in Lemmas and 3, is desired because it provides uniform approximation capability, [2] which is particularly suitable in generalization/interpolation, i.e., approximating the function at states not visited in the training stage. Moreover, this continuity provides the possibility of utilizing the value functions as candidate Lyapunov functions, in establishing stability, as done in Theorem. Before that, the term SROA needs to be formally defined, motivated by a similar concept in continuous-time systems, [8, Section 8.2]. Definition 3. A subset of the region of attraction (SROA) for the closed loop system is a region in the state space such that any trajectory initiated inside this region is defined and converges to the origin as time goes to infinity. Theorem. Let Assumptions and 2 hold. For every fixed i N, control policy π i ( ) generated using stabilizing value iteration renders the origin an asymptotically stable point. Moreover, compact set β i r := {x R n : V i (x) r} for any r > using which β i r Ω will be a subset of the region of attraction for the closed loop system. Proof : The claim is proved through using V i ( ) as a candidate Lyapunov function for policy π i ( ). Function V ( ) is continuous (by Lemma ). It is also positive definite by positive semi-definiteness of U(, ) and Assumption 2. As there is no non-zero x with the value function of zero. For any positive definite V i ( ), from (8) it follows that V i+ ( ) also is positive definite. The reason is, if U(x, ) = for some nonzero x, then f(x, ) by Assumption 2. Therefore, V i+ ( ) is positive definite, i N, by induction. Also, V i ( ) C(Ω) by Lemma 3. Given Eq. (8), one has V i( f ( x, π i (x) )) V i+ (x) = U ( x, π i (x) ), x Ω. (3) On the other hand, by Lemma 2, V i+ (x) V i (x), i, x Ω. Therefore, replacing V i+ (x) in (3) with V i (x) leads to V i( f ( x, π i (x) )) V i (x) U ( x, π i (x) ), x Ω. (32) Let S := {x R n : U(x, ) = }. The right hand side of (32) vanishes only if x S. Since, no non-zero state history can stays in S by Assumption 2, the asymptotic stability of the origin under π i ( ) follows from (32), [8, Corollary.3]. Set βr i is an SROA for the closed loop system, because, V i (x k+ ) V i (x k ) by (32), hence, x k βr i leads to x k+ βr, i k N. In other words, a state trajectory initiated within βr i will stay inside the region and hence, inside Ω. Given this feature along with the asymptotic stability result established in the previous paragraph, the state trajectory converges to the origin as k. Therefore, β i r will be an SROA. Finally, since β i r Ω, it is bounded. Also, β i r is closed as it is the inverse image of closed set [, r] under a continuous mapping function (Lemma 3), [3, p. 87]. Therefore, it is compact. The origin is an interior point of the set, because V i () =, r >, and V i ( ) is continuous in Ω. This completes the proof. Comparing the results given by Theorem with the existing literature, the closest one is [9], in which a novel VI algorithm, called θ-adp, was introduced. The point that θ-adp requires to be initiated from a function which acts similar to a control Lyapunov function (CLF) of the respective system (in order for the control under iterations to remain stabilizing) corresponds to the required initial admissible guess for VI in this study. However, admitting a positive semi-definite utility function as opposed to the positive definite one in that work and more importantly, establishing an SROA are the main differences of Theorem compared with [9]. Finally, it is worth noting that considering the analogy between the value iteration and finite-horizon optimal control, [9], the stability results given by Theorem resembles a method of stability proof in receding horizon control (RHC) literature, [6]. In that study, a CLF is utilized as the terminal cost in the respective finite-horizon problems to maintain stability. It was shown by Theorem that each selected/fixed π i ( ) will steer the states toward the origin. But, in online learning, the policy will be subject to change. More specifically, if π i ( ) is applied at the current time, policy π i+ ( ) may be applied next. Even though Theorem established asymptotic stability of the origin for the autonomous system x k+ = F (x k ) := f ( x k, π i (x k ) ) for any selected i, it does not include the nonautonomous system x k+ = F (x k, k) := f ( x k, π k (x k ) ). Hence, another stability analysis is required to show that the states under the evolving policies also converge to the origin. This is done next, for the general case of applying each policy π i ( ) for M i N steps before switching to the next policy, i.e., π i+ ( ) (and applying it for M i+ N steps.) Theorem 2. Let Assumptions and 2 hold and also let the sequence of control policies {π i ( )} i= resulting from stabilizing value iteration be used for operating the system such that each π i ( ) is applied for M i N time steps. Then, every trajectory which stays in Ω will converge to the origin. Proof : Let the state vector at time k, generated through the scenario of applying each π i ( ) for M i steps, be denoted with x + k and let x+ =. Eq. (8) and the monotonicity of value function (Lemma 2) lead to V ( ) = U (, π ( ) ) + V ( f (, π ( ) )) Therefore, V ( ), Ω, U ( x +, π (x + )) + V (x + ) V (x + ), x+ (33) Ω. (34) 4

5 The idea is using (34) in itself for M times to get U ( x + k, π (x + k )) + V (x + M ) V (x + ), x+. (35) This may be done by evaluating (34) at x + U ( x +, π (x + )) + V (x + 2 ) V (x + ), x+ to get Ω, (36) and replacing the V (x + ) in (34) with the left hand side of (36), which is not greater than V (x + ) per (36) to get U ( x +, π (x + )) + U ( x +, π (x + )) + V (x + 2 ) V (x + ), x+ Ω. (37) Repeating this process for M 2 times gives (35). Similarly, using Eq. (8) and the monotonicity, one has hence, V 2 (x) = U ( x, π (x) ) + V ( f ( x, π (x) )) V (x) V (x), x Ω. (38) U ( x + M, π (x + M ) ) + V (x + M + ) V (x + M ), x + M, (39) which once similarly repeated in itself for M times, leads to U ( x + M +k, π (x + M +k )) + V (x + M +M ) V (x + M ), x + M Ω. (4) The right hand side of (4) is less than V (x + M ), per the monotonicity of the value function, hence, one can replace V (x + M ) in (35) with the left hand side of (4), leading to U ( M x + k, π (x + k )) + U ( x + M +k, π (x + M +k )) + V (x + M +M ) V (x + ), x+ Ω. (4) So far, two generations of control policies, namely, π ( ) and π ( ) were applied and the result in the foregoing inequality was obtained. Repeating this process for N 2 more times, the following inequality can be obtained which handles applying N generations of control policies, with not necessarily identical utilization periods M i s. N i= i U ( x + i j= Mj+k, πi (x + i j= Mj+k)) + V N (x + N ) V (x + j= Mj ), x+ Ω. Since V N (x), x, the foregoing equations leads to N i (42) U ( x + i Mj+k, πi (x + i Mj+k)) j= j= (43) i= V (x + ), x+ Ω. therefore, the sequence of partial sums in the left hand side is upper bounded by the constant term V (x + ) and because of being non-decreasing, it converges, as N, [3, Theorem 3.4]. Therefore, U ( x + k, πi (x + k )) as k, [3, Theorem 3.23]. This leads to x + k by Assumption 2 if the state trajectory remains in Ω. Finally, it may be noted that the summation in the left hand side of (43) is evaluated along the trajectory of interest, i.e., x + k, k =,,... This concludes the proof. The right hand side of (43), as N, is actually the cost-to-go or value function of applying the evolving policy. Therefore, it can be used as a candidate Lyapunov function (which is time-dependent, as the dynamics of the system under evolving policies are time-dependent.) Denoting the cost-togo at time j with V(x j, j), where j corresponds to an instant during the period of applying π N ( ) for any given N N, one has M N l V(x + j, j) = U ( x + j+k, πn (x + j+k )) + i=n+ M i U ( x + i j= Mj+k, πi (x + i j= Mj+k)), x + Ω, (44) where l is the remaining number of time steps for applying policy π N ( ), i.e., l = j N M k. Function V(x j, j) satisfies V(x + j+, j + ) V(x+ j, j) = U( x + j, πn (x + j )). (45) Inequality (45) along with the fact that the right hand side does not vanish along any trajectory (per. Assumption 2) leads to the desired stability, [8, Theorem ]. However, before making this conclusion, given the time-dependency of the Lyapunov function, one needs to show that it is lower and upper bounded by some time-independent positive-definite functions, [8, Theorem ]. A lower bound is given by U ( x + j, ) + U ( x + j+, ), which is positive-definite, i.e., does not vanish for any non-zero x + j, per Assumption 2. The upper boundedness is given by V (x + j ), given (43). Note that the Lyapunov function for time-varying systems is not required to be continuous and the continuity of ) the difference given by (45), 2) the lower bound of the candidate Lyapunov function, and 3) the upper bound of the function suffices, [8, Theorem ]. Continuity of these three functions follows from the continuity of the utility function U(, ) and that of the V ( ) given by Lemma. As seen in the statement of Theorem 2, the stability is conditional on the state trajectory staying inside Ω. But, the theorem does not provide an SROA to guarantee this, for the case of applying an evolving policy. An idea for establishing an SROA is given next. As discussed above, function (44) is a candidate Lyapunov function for proof of stability of the evolving policy. Therefore, for establishing the SROA for the evolving policy, this candidate Lyapunov function can be used. Defining ˆ j := {x R n : V(x, j) r}, from x k ˆ j one has x k+ ˆ, j k = j, j +,... because of (45). The rest is similar to the last paragraph of proof of Theorem. Finally, convergence of the stabilizing value iteration is established. While, the convergence is not used for stability results in this study, it is of interest for implementation of stabilizing VI. Lemma 4. Let Assumption hold. The stabilizing value iteration, given by Eq. (9), converges to the optimal solution 5

6 in the selected compact and connected region Ω containing the origin as an interior point. Proof : The sequence of value function under the stabilizing VI is non-increasing (Lemma 2) and lower bounded (more specifically, non-negative per the proof of Theorem.) Therefore, it converges, [3, Theorem 3.4]. The limit function, i.e., the function to which the sequence of value functions converges, denoted with V ( ), can be shown to be the same as V ( ), either by resorting to the uniqueness of the solution to the Bellman equation, [2] (as both V ( ) and V ( ) satisfy it) or through the analogy between the value iteration and finitehorizon optimal control problems detailed in [9]. V. NUMERICAL EXAMPLE Some of the results presented in this study are numerically illustrated through an example. Van der Pol s oscillator, with continuous-time dynamics z = ( z 2 )ż z + u is selected. The problem was taken into state space by defining x = [X, Y ] T := [z, ż] T and discretized with sampling time t =.5s using Euler forward integration. Moreover, cost function terms Q(x) =.25x T x, R(u) =.5u 2, and U(x k, u k ) := Q(x k ) + R(u k ) were selected in (2). For implementation of the stabilizing VI, the initial admissible policy was selected as (feedback linearization based) policy π(x) = ( X 2 )Y X 5Y. The function approximator was selected in a polynomial form made of elements of x up to the fourth order. The region of interest was selected as Ω := [.5,.5] [.5,.5] R 2. Two hundred random xs were selected from Ω in each evaluation of Eq. (9) and least squares method was utilized for finding the parameters (coefficients of the polynomial terms). The minimizer in (7) can be found by setting the gradient of the term subject to minimization to zero, leading to u = 2 R g T V i( f(x, u) ), (46) where V i (x) := ( V (x)/ x) T and g := t[, ] T. Given the point that the unknown u exists on both sides of Eq. (46), the following successive approximation may be used for finding the unknown, [9]. u j+ = 2 R g T V i( f(x, u j ) ), (47) The learning iterations were observed to converge in 48 iterations, as shown through Fig., where histories of parameters of the value function approximator are plotted. To evaluate the optimality of the converged parameters, given by Lemma 4, the optimal trajectory was numerically found for the selected initial condition of = [.5, ] T and compared with the VIbased result in Fig. 2. Given the similarity of the resulting trajectories, it is concluded that at least for the selected initial state, the VI-based result is (near) optimal. Selecting the iteration index of i = 2, calculation of SROA denoted with βr i in Theorem is the next step. Numerically it was found that r = 8.9 is the greatest r using which βr 2 Ω. Given this value for r, region βr 2 is plotted in Fig. 3. Also, different initial conditions were selected and the respective state trajectories under the control policy h 2 ( ) are plotted in the same figure. It can be observed that the state trajectories Weights/Parameters Iterations Fig.. History of weights/parameters of the value function during learning iterations. States X (VI Results) Y (VI Results) X (Open Loop) Y (Open Loop) Time (s) Fig. 2. State trajectories for initial condition = [.5, ] T for ) using h ( ) = h 48 ( ) generated using VI and 2) using open loop numerical solution. did not leave the SROA and hence stayed in Ω and converged to the origin, as expected. These results confirm the ones given by Theorem. To emphasize the importance of finding SROA, the trajectory initiated from = [.45,.45] T is plotted also in Fig. 3, where it is shown that while Ω, the trajectory has exited Ω at some time steps. This has led to some extrapolations by the function approximator, as the approximator was tuned only for Ω. It may be noted that while the trajectory has returned to Ω, this was not guaranteed., it will be guaranteed for the trajectory to remain inside Ω. Finally, the monotonicity of sequence of value functions resulting from stabilizing VI, given by Lemma 2, is numerically illustrated. To this end, regions βr, βr,..., βr for r = 8.88 are plotted in Fig. 4. The monotonicity of value functions leads to βr βr... βr, per the definition of these domains. This feature of the domains is observed to hold through Fig. 4. Also, an initial condition was used for control under the two case of using fixed policy h 2 ( ) and evolving policy of M i = 4, i and the results are shown in this figure. The idea is showing that the trajectories could be considerably different, and therefore, analysis of one may not directly apply to the other one. This difference can be seen through the trajectories in Fig. 4. However, if β 2 r VI. CONCLUSIONS Stability of the system under value iteration initiated using an admissible guess was established. Afterwards, the results were extended to the case of applying an evolving control policy. Finally, regions of attraction were established, such that if the initial condition is within the region, the entire trajectory stays inside the region over which the controller is tuned. This study, however, is mainly a theoretical result as it does not include effects of approximation errors prevalent in practice. Future work is on incorporation of these errors. VII. ACKNOWLEDGMENT The author is thankful for constructive comments of anonymous reviewers and associate editor. 6

7 Y =[.3,.] T =[.45,.45] T Region of Interest =[.7,.44] T 2 = SROA =[.9,.63] T =[.2,.37] T =[.88,.46] T X Fig. 3. State trajectories for different initial conditions generated using fixed policy h 2 ( ) and the subset of region of attraction with r = 8.9. proof, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 38, pp , 28. [2] K. Hornik, M. Stinchcombe, and H. White, Multilayer feedforward networks are universal approximators, Neural Networks, vol. 2, no. 5, pp , 989. [3] W. Rudin, Principles of Mathematical Analysis. McGraw-Hill, 3rd ed., 976. [4] B. Jakubczyk, Feedback linearization of discrete-time systems, Systems & Control Letters, vol. 9, no. 5, pp. 4 46, 987. [5] E. Aranda-Bricaire, Ü. Kotta, and C. Moog, Linearization of discretetime systems, SIAM Journal on Control and Optimization, vol. 34, no. 6, pp , 996. [6] A. Jadbabaie, J. Yu, and J. Hauser, Unconstrained receding-horizon control of nonlinear systems, IEEE Transactions on Automatic Control, vol. 46, no. 5, pp , 2. [7] D. E. Kirk, Optimal control theory; an introduction. Prentice-Hall, 97. [8] R. Kalman and J. Bertram, Control system analysis and design via the second method of lyapunov, Trans. ASME, vol., pp , 96. [9] A. Heydari, Revisiting approximate dynamic programming and its convergence, IEEE Transactions on Cybernetics, vol. 44, no. 2, pp , 24. [2] A. Heydari, Analyzing policy iteration in optimal control, in American Control Conference, pp , Region of Interest 48 = βr *.5 3 Y.5 48 = βr * Trajectory under evolving policy Trajectory under fixed policy h 2 (.) X Fig. 4. State trajectories generated using fixed policy h 2 ( ) and evolving policy with M i = 4, i with = [.54,.24] T and regions βr i s with r = REFERENCES [] A. Heydari, Analysis of stabilizing value iteration for adaptive optimal control, in Proceedings of the American Control Conference, 26. [2] P. J. Werbos, Approximate dynamic programming for real-time control and neural modeling, in Handbook of Intelligent Control (D. A. White and D. A. Sofge, eds.), Multiscience Press, 992. [3] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. MIT Press, 998. [4] F. Lewis, D. Vrabie, and K. Vamvoudakis, Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controllers, IEEE Control Systems, vol. 32, pp. 76 5, 22. [5] D. Liu and Q. Wei, Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems, IEEE Transactions on Neural Networks and Learning Systems, vol. 25, pp , 24. [6] Q. Wei, D. Liu, and H. Lin, Value iteration adaptive dynamic programming for optimal control of discrete-time nonlinear systems, IEEE Transactions on Cybernetics, vol. 46, no. 3, pp , 26. [7] A. Heydari, Theoretical and numerical analysis of approximate dynamic programming with approximation errors, Journal of Guidance, Control, and Dynamics, vol. 39, pp. 3 3, 26. [8] H. Khalil, Nonlinear Systems. Prentice-Hall, 22. [9] Q. Wei and D. Liu, A novel iterative θ-adaptive dynamic programming for discrete-time nonlinear systems, IEEE Transactions on Automation Science and Engineering, vol., no. 4, pp. 76 9, 24. [] D. P. Bertsekas, Value and policy iterations in optimal control and adaptive dynamic programming, IEEE Transactions on Neural Networks and Learning Systems, vol. 28, pp. 5 59, 27. [] A. Al-Tamimi, F. Lewis, and M. Abu-Khalaf, Discrete-time nonlinear HJB solution using approximate dynamic programming: Convergence 7

arxiv: v1 [math.oc] 23 Oct 2017

arxiv: v1 [math.oc] 23 Oct 2017 Stability Analysis of Optimal Adaptive Control using Value Iteration Approximation Errors Ali Heydari arxiv:1710.08530v1 [math.oc] 23 Oct 2017 Abstract Adaptive optimal control using value iteration initiated