Online Adaptive Approximate Optimal Tracking Control with Simplified Dual Approximation Structure for Continuous-time Unknown Nonlinear Systems

Size: px

Start display at page:

Download "Online Adaptive Approximate Optimal Tracking Control with Simplified Dual Approximation Structure for Continuous-time Unknown Nonlinear Systems"

Imogene Howard
6 years ago
Views:

1 4 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL., NO. 4, OCTOBER 04 Online Adaptive Approximate Optimal Tracking Control with Simplified Dual Approximation Structure for Continuous-time Unknown Nonlinear Systems Jing Na Guido Herrmann Abstract This paper proposes an online adaptive approximate solution for the infinite-horizon optimal tracking control problem of continuous-time nonlinear systems with unknown dynamics. The requirement of the complete knowledge of system dynamics is avoided by employing an adaptive identifier in conjunction with a novel adaptive law, such that the estimated identifier weights converge to a small neighborhood of their ideal values. An adaptive steady-state controller is developed to maintain the desired tracking performance at the steady-state, and an adaptive optimal controller is designed to stabilize the tracking error dynamics in an optimal manner. For this purpose, a critic neural network NN) is utilized to approximate the optimal value function of the Hamilton-Jacobi-Bellman HJB) equation, which is used in the construction of the optimal controller. The learning of two NNs, i.e., the identifier NN and the critic NN, is continuous and simultaneous by means of a novel adaptive law design methodology based on the parameter estimation error. Stability of the whole system consisting of the identifier NN, the critic NN and the optimal tracking control is guaranteed using Lyapunov theory; convergence to a near-optimal control law is proved. Simulation results exemplify the effectiveness of the proposed method. Index Terms Adaptive control, optimal control, approximate dynamic programming, system identification. I. INTRODUCTION AMONG various modern control methodologies, optimal control has been well-recognized and successfully verified in some real-world applications, which is concerned with finding a control policy that drives a dynamical system to a desired reference in an optimal way, i.e., a prescribed cost function is minimized. In general, the optimal control Manuscript received July 7, 03; accepted March 4, 04. This work was supported by National Natural Science Foundation of China ). Recommended by Associate Editor Zhongsheng Hou Citation: Jing Na, Guido Herrmann. Online adaptive approximate optimal tracking control with simplified dual approximation structure for continuoustime unknown nonlinear systems. IEEE/CAA Journal of Automatica Sinica, 04, 4): 4 4 Jing Na is with the Faculty of Mechanical and Electrical Engineering, Kunming University of Science and Technology, , China najing5@63.com). Guido Herrmann is with the Department of Mechanical Engineering, University of Bristol, BS8 TR, UK g.herrmann@bristol.ac.uk). can be derived by using Pontryagin s minimum principle, or by solving the Hamilton-Jacobi-Bellman HJB) equation. Although mathematically elegant, traditional optimal control designs are obtained offline and impose the assumption on the complete knowledge of system dynamics. To allow for uncertainties in system dynamics, adaptive control 3 4 has been developed, where the unknown system parameters are online updated/estimated by using the tracking error, such that the tracking error convergence and the boundedness of the parameter estimates can be guaranteed. However, classical adaptive control methods are generally far from optimal. With the wish to achieve adaptive optimal control, one may add optimality features to an adaptive controller, i.e., to drive the adaptation by an optimality criterion. An alternative solution is to incorporate adaptive features into an optimal control design, e.g., improve the optimal control policy by means of the updated system parameters. Recently, a bioinspired method, reinforcement learning RL) 5 7, that was developed in the computational intelligence and machine learning societies, has provided a means to design adaptive controllers in an optimal manner. Considering the similarities between optimal control and RL, Werbos 8 introduced an RL-based actor-critic framework, called approximate dynamic programming ADP), where neural networks NNs) are trained to approximately solve the optimal control problem based on the named value iteration VI) method. A survey of ADPbased feedback control designs can be found in 9. The discrete/iterative nature of the ADP formulation lends itself naturally to the design of discrete-time DT) optimal control 3 5. However, the extension of the RL-based controllers to continuous-time CT) systems entails challenges in proving stability and convergence for a model-free algorithm that can be solved online. Some of the existing ADP algorithms for CT nonlinear systems lacked a rigorous stability analysis 6, 6. By incorporating NNs into the actorcritic structure, an offline method was proposed in 7 to find approximate solutions of optimal control for CT nonlinear systems. In, 8, an online integral RL technique was developed to find the optimal control for CT systems

2 NA AND HERRMANN: ONLINE ADAPTIVE APPROXIMATE OPTIMAL TRACKING CONTROL WITH 43 without using the system drift dynamics, which led to a hybrid continuous-time/discrete-time sampled data controller based on policy iteration PI) with a two time-scale actor critic learning process. This learning procedure was based on sequential updates of the critic policy evaluation) NN and actor policy improvement) NN. Thus, while one NN was tuned, the other one remained constant. Vamvoudakis and Lewis 9 further extended this idea by designing an improved online ADP algorithm called synchronous PI, which involved simultaneous tuning of both actor and critic NNs by minimizing the Bellman error, i.e., both NNs were tuned at the same time by using the proposed adaptive laws to approximately solve the CT infinite horizon optimal control problem. To avoid the need for the complete knowledge of system dynamics in 9, a novel actor-critic-identifier architecture was proposed in 0, where an extra NN of the identifier was employed in conjunction with an actor-critic controller to identify the unknown system dynamics. Although the states of the identifier converged to their true values, the identifier NN weight convergence was not guaranteed. Moreover, the knowledge of the input dynamics was still required. On the other hand, most of the ADP based optimal control methods have been developed to address the stabilization or regulation problem, and only a few results have been reported for optimal tracking control 3. For these results, the key idea is to superimpose an optimal control that stabilizes the error dynamics at the transient stage in an optimal way under the assumption of a traditional steady-state tracking controller e.g., feedback linearization control, adaptive control). In 3, an observer was adopted to reconstruct unknown system states, while in an adaptive NN identifier was used to online estimate unknown system dynamics. Although the obtained control input was ensured to be close to the optimal control within a small bound, it was not guaranteed that the NN identifier weights stayed bounded in a compact neighborhood of their ideal values. In this paper, we will provide a solution where the convergence of the identifier weights is guaranteed and the convergence of the critic NN weights to a nearly optimal control solution is shown. To the best of our knowledge, ADP-based optimal tracking control has rarely been designed for CT systems with unknown nonlinear dynamics and guaranteed parameter estimation convergence. In this paper, we propose a new ADP algorithm for solving the optimal tracking control problem of nonlinear systems with unknown dynamics. Inspired by the work of 0, the requirement of the complete or at least partial knowledge of system dynamics in the existing ADP algorithms for CT systems is eliminated. This is achieved by constructing an adaptive NN for the identifier of system dynamics; a novel adaptive law based on the parameter estimation error 4 is utilized such that, even in the presence of an NN approximation error, the identifier NN weights are guaranteed to converge to a small region around their true values under a standard persistent excitation PE) condition or a slightly more relaxed singular value condition for a filtered regressor matrix. To achieve optimal tracking control, an adaptive steady-state control for maintaining desired tracking at the steady-state is augmented with an adaptive optimal control for stabilizing the tracking error dynamics in an optimal manner. To design such an optimal control, a critic NN is employed to online approximate the solution to the HJB equation. Thus, the optimal value function is obtained, which is then used to calculate the control action. The identifier parameters and critic NN weights are online updated continuously and simultaneously. In particular, a direct parameter estimation scheme is used to estimate NN weights; this is in contrast to the minimization of the Bellman error or the residual approximation error in the HJB equation by using least-squares 0 or the modified Levenberg- Marquardt algorithms 9. We will also show that the identifier weight estimation error affects the critic NN convergence; the conventional PE condition or again a relaxed condition on a filtered regressor matrix is sufficient to guarantee parameter estimation convergence. To this end, a novel adaptation scheme based on the parameter estimation error that was originally proposed in our previous work 4 is employed for updating both identifier weights and critic NN weights; this may lead to fast convergence and provides an easy online-check of the required convergence condition 4. Finally, the stability of the overall system and the uniform ultimate boundedness UUB) of the identifier and critic weights are proved by using Lyapunov theory, and the obtained control guarantees the tracking of a desired trajectory, while also asymptotically converging to a small bound around the optimal policy. The main contributions can be summarized as follows. ) The optimal tracking control problem of nonlinear CT systems is studied by proposing a new critic-identifier based ADP control configuration. The actor NN is not necessary to prove the overall stability. Thus, instead of the tripleapproximation structure, this introduces a simplified dualapproximation method. To achieve tracking control, a steadystate control is used in conjunction with an adaptive optimal control such that the overall control converges to the optimal solution within a small bound. ) A novel adaptation design methodology based on the parameter estimation error is proposed such that the weights of both the identifier NN and critic NN are online updated simultaneously. With this framework, all these weights are directly estimated with guaranteed convergence rather than updated to minimize the identifier error and Bellman error by using the gradient-based schemes e.g., least-squares in 0). It is shown that the convergence of the identifier weights to their true values in a bounded sense is achieved, which is also important for the convergence of the optimal control. The paper is organized as follows. Section II provides the formulation of the optimal control problem. Section III discusses the design of the identifier to accommodate unknown system dynamics. Section IV presents the adaptive tracking control design and the closed-loop stability analysis. Section V presents simulation examples that show the effectiveness of the proposed method, and Section VI gives some conclusions.

3 44 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL., NO. 4, OCTOBER 04 II. PROBLEM FORMULATION Consider a continuous-time nonlinear system as ẋ = F x, u), ) where x R n, u R m are the output and input of the studied system, F x, u) R n R m R n is a Lipschitz continuous nonlinear function on a compact set Ω R n R m that contains the origin, such that solution x of system is unique for any finite initial condition x 0 and control u. This paper will address the optimal tracking control for system ), i.e., to find an adaptive controller u which ensures that the system output x tracks a given trajectory x d and fulfills the following infinite horizon cost function in a sub-optimal sense): V et)) = min uτ) ΨΩ) t reτ), uτ))dτ, ) where ΨΩ) is a set of admissible control policies 7, e = x x d is defined as the tracking error and r, ) : R n R m R n, reτ), uτ)) 0 is the utility function; the utility function is to be defined later. Note that in this paper the command reference x d and its derivative ẋ d are all continuous and bounded. It should also be noted that the tracking error e rather than the system state x is used in cost function ), because the tracking control rather than the regulation problem is studied in this paper. Remark. Many industrial processes can be modeled as system ), such as missile systems 5, robotic manipulators 4 and biochemical processes 6. Although some results,,, 9 0 have been recently developed to address the optimal regulation problem of partially unknown) system ) by means of ADP, only a few results, 5 have been reported concerning the tracking control of system ). In, the plant system is assumed to be precisely known. To facilitate the control design, the following assumption is made about system ). Assumption 7. The function F x, u) in ) is continuous and satisfies a locally Lipschitz condition such that ) has a unique solution on the set Ω that contains the origin. The control action u has a control-affine form as in 9 0 with constant input gain B. Since the dynamics of ) are unknown, the optimal tracking control design presented in this paper can be divided into two steps as in 0 : ) Propose an adaptive identifier by using input-output data to reconstruct the unknown system dynamics for ); ) Design an adaptive optimal tracking controller based on the identified dynamics and ADP methods. III. ADAPTIVE IDENTIFIER BASED ON PARAMETER ESTIMATION ERROR In this section, an adaptive identifier is established to reconstruct the unknown system dynamics using available input output measurements. From Assumption, system ) can be rewritten in the form of a recursive neural network RNN), 7 8 : ẋ = Ax + Bu + C T fx) + ε, 3) where A R n n, B R n m are known matrices and C R p n is the unknown weight matrix, ε R n is a bounded approximation error of the RNN, and fx) R p is a nonlinear regressor function vector, which is Lipschitz continuous function such that fx) fy) κ x y holds for some positive constant κ > 0. To determine the unknown parameters C, we define the filtered variables x f, u f, f f of x, u, f as kẋ f + x f = x, x f 0) = 0, k u f + u f = u, u f 0) = 0, kf f + f f = f, f f 0) = 0, where k R is a positive scalar constant filter parameter. Then for any positive scalar constant l > 0, we define the filtered and integrated regressor matrices P R p p and Q R p n as P = lp + f f f T f, P 0) = 0, Q = lq + f f x xf k Ax f Bu f T, Q 0) = 0, and another auxiliary matrix M R p n calculated based on P and Q as M = P Ĉ Q, 6) where Ĉ is the estimation of C. Then the adaptive law for estimating Ĉ is provided by Ĉ = Γ M, 7) with Γ > 0 being a constant, positive definite learning gain matrix. Lemma 4. Under the assumption that variables x and u in 3) are bounded, vector M in 6) can be reformulated as M = P C + ψ for bounded ψ t) = t 0 e lt r) f f r)ε T f r)dr, where C = C Ĉ is the estimation error. Proof. For the ordinary matrix differential equation of 5), one can obtain its solution as 4 P = t 0 e lt r) f f r)f T f r)dr, Q = t 0 e lt r) f f r) xr) xf r) k Ax f r) Bu f r) T dr. On the other hand, by applying the linear filter operation 4) on both sides of 3) it can be obtained that 4) 5) 8) ẋ f = Ax f + Bu f + C T f f + ε f, 9) where ε f is the filtered version of bounded error ε in terms of k ε f + ε f = ε Vector ε f will be used only for analysis). Then from the first equation of 4), it is found that ẋ f = x x f. 0) k

4 NA AND HERRMANN: ONLINE ADAPTIVE APPROXIMATE OPTIMAL TRACKING CONTROL WITH 45 Consequently, we can obtain from 9) and 0) that x x f k = Ax f + Bu f + C T f f + ε f. ) By substituting ) into 8), we have Q = P C ψ with ψ t) = t 0 e lt r) f f r)ε T f r)dr being a bounded variable, i.e., ψ ε ψ for a constant ε ψ > 0 because the NN regressor function fx) and approximation error ε are all bounded for bounded x and u. Then, 6) can be rewritten as M = P Ĉ Q = P C + ψ, ) where C = C Ĉ is the estimation error. Moreover, to prove the parameter estimation convergence, we need to analyze the positive definiteness property of P. Denote λ max ) and λ min ) as the maximum and minimum eigenvalues of the corresponding matrices, then we have the following lemma. Lemma 4. If the regressor function vector fx) defined in 3) is persistently excited 3, then the matrix P defined in 5) is positive definite, i.e., its minimum eigenvalue λ min P ) > σ > 0 with σ being a positive constant. We refer to 4 for the detailed proof of Lemma. Now, we have the following result. Theorem. If x and u in 3) are bounded and the minimum eigenvalue of P satisfies λ min P ) > σ > 0 for system 3) with the parameter estimation 7), then we have ) For ε = 0 i.e., no reconstruction error), the estimation error C exponentially converges to zero; ) For ε 0 i.e., with bounded approximation error), the estimation error C converges to a compact set around zero. Proof. Consider the Lyapunov function candidate as V = tr C T Γ C), then the derivative V is obtained from 7) as V = tr C T Γ C) = tr C T P C) + tr CT ψ ). 3) ) In case that ε = 0 and thus ψ = 0, 3) is reduced to V = tr C T P C) σ C µ V, 4) where µ = σ /λ max Γ ) is a positive constant. Then according to Lyapunov s theorem Theorem 3.4. in 4, p0), the parameter estimation error C converges to zero exponentially, where the convergence rate depends on the excitation level σ and the learning gain Γ. ) In case that there is a bounded approximation error ε 0, 3) can be further presented as V = tr C T P C) + tr CT ψ ) C σ V ε ψ ) 5) for σ = σ /λmax Γ ) being a positive constant. Then according to the extended Lyapunov theorem Theorem in 4, p), the parameter estimation error C ultimately uniformly converges to the compact set Ω := { } C V ε ψ / σ, whose size depends on the bound of the approximation error ε ψ and the excitation level σ. This completes the proof. Remark. For adaptive law 7), variable M of 6) obtained based on P, Q by 5) contains the information on the weight estimation error P C as shown in ), where the residual error ψ will vanish for vanishing NN approximation error ε 0. It is well known that ε 0 holds for sufficiently large hidden layer nodes in identifier 3), i.e., p +. Thus M can be used to drive parameter estimation 7). Consequently, parameter estimation Ĉ can be directly obtained without using an observer/predictor error in comparison to 0 see Theorem in 0 ). Remark 3. Lemma shows that the required condition i.e., λ min P ) > σ > 0) for the parameter estimation convergence in this paper can be fulfilled under a conventional PE condition 3. In general, the direct online validation of the PE condition is difficult in particular for a nonlinear system. To this end, Lemma provides a numerically verifiable way to online validate the required convergence condition of the novel adaptation law 7), i.e., by calculating the minimum eigenvalue of matrix P to test λ min P ) > σ > 0. This condition does not necessarily imply the PE condition of fx). It is also to be noticed that the PE condition of fx) can be suitably weakened when a well-designed control is imposed 4, 9, e.g., transformed into an a priori verifiable sufficient richness SR) requirement on the command reference. IV. ADAPTIVE APPROXIMATE OPTIMAL TRACKING CONTROL As shown in Section III, the unknown weight parameter C can be online estimated. Without loss of generality, we assume that there is an unavoidable approximation error ε in 3) such that the estimated weight matrix Ĉ would converge to a compact set around its true value C. In this case, system ) can be rewritten as ẋ = Ax + Bu + ĈT fx) + ε + ε N, 6) where ε N = C T fx) can be taken as an adaptation error for which a later analysis will show the boundedness, i.e., ε N φ N for a constant φ N > 0 in the compact set Ω. Then the optimal control of ) is transformed into the optimal control of 6). In this section, the optimal controller design of 6) will be provided in detail. To achieve optimal tracking control, it is noted that the overall control u can be composed of two parts as u = u s +u e, where u s is the adaptive steady-state control used to maintain the tracking error close to zero at the steady-state stage, and u e is the adaptive optimal control designed to stabilize the tracking error dynamics in the control transient in an optimal manner, 5. Consider the tracking error as e = x x d, so that ė = Ax + Bu + ĈT fx) ẋ d + ε + ε N. 7) Since the adaptive steady-state control u s is used to guarantee a steady state at zero for the tracking error, it should be designed to retain the steady-state dynamics ė = ẋ ẋ d = 0 in

5 46 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL., NO. 4, OCTOBER 04 7), i.e., u = u s needs to guarantee x = x d for ε + ε N = 0. Thus, the steady-state control signal u s can be selected as u s = B ẋ d Ax d ĈT fx d )), 8) where B denotes the generalized inverse of B. Note that the input gain B in 3) is assumed to be known but may not be invertible i.e., it may have the rank lower than n). It is shown that u s depends on the available variables x d, ẋ d, A, Ĉ, and thus can be implemented based on identifier 3) with adaptive law 7). Substituting 8) into 7), the tracking error dynamics can be rewritten as ė = Ae + ĈT fx) fx d ) + Bu e + ε + ε N + ε ϕ, 9) where ε ϕ = BB I)ẋ d Ax d ĈT fx d )) denotes the residual error due to the generalized inverse of B, which is clearly bounded because x d and ẋ d are bounded and f is Lipschitz continuous, i.e., ε ϕ φ p for a constant φ p > 0. It is noted that ε ϕ will be null under the so-called matching condition, e.g., BB I)A = 0 or BB I)ĈT = 0, which is a standard condition raised in nonlinear control when counteracting disturbances As shown above, by using the adaptive steady-state control u s in 8), the error dynamics can be further presented as 9), which is not necessarily stable, in particular for a system with identifier error ε + ε N. In this sense, the tracking problem of system 6) can be reduced into the regulation problem of 9). Hence, the adaptive optimal control u e will be designed to stabilize the tracking error dynamics 9) in an approximately optimal manner. In this case, the optimal value function ) for system ) can be reformulated using u e from system 9) to provide the value function V et)) = t reτ), u e eτ)))dτ, 0) where the utility function can be chosen as reτ), u e eτ))) = e T Qe + u T e Ru e with Q R n n and R R m m being symmetric positive definite matrices. Thus, the tracking problem is optimized by using control u e, which optimally stabilizes e. It will be shown below that u e is a function of control error e. Definition 7. A control policy µe) is defined as admissible with respect to 0) on a compact set Ω, denoted by µe) ΨΩ). If µe) is continuous on Ω, µ0) = 0, µe) = ue) stabilizes 9) on Ω, and V e) is finite e Ω. The remaining problem can be formulated as: given the CT error system 9) with the admissible control set µe) ΨΩ) and the infinite horizon cost function 0), find an admissible control policy u e e) µe) such that cost 0) associated with system 9) is minimized. For this purpose, we define the Hamiltonian of system 9) as He, u e, V ) =V T e Ae + ĈT fx) fx d )) + Bu e + ε + ε N + ε ϕ + e T Qe + u T e Ru e, ) where V e := V e denotes the partial derivative of the value function V with respect to e. The optimal cost function V e) is defined as ) V e) = reτ), u e eτ)))dτ, ) min u e ΨΩ) which satisfies the HJB equation t 0 = min u e ΨΩ) He, u e, V ). 3) Then we can obtain the optimal control u e He, u e, V )/ u e = 0 as by solving u e = R B T V e), 4) e where V is the solution to the HJB equation 3). Remark 4. In order to find the optimal control 4), one needs to solve the HJB equation 3) for the value function V e) and then substitute the solution into 4) to obtain the optimal control u e. For linear systems, considering a quadratic cost functional, the equivalent of the HJB equation is the well known Riccati equation. However, for nonlinear systems, the HJB equation 3) is a nonlinear partial differential equation PDE) which is difficult to solve. In the literature, there are a number of results concerning the optimal control for 9) in terms of critic-actor based ADP schemes, where two NNs, i.e., a critic NN and an actor NN, are employed to approximate the value function and its corresponding policy. However, some of them run in an offline manner 7 and/or require at least partial knowledge of system dynamics,8 0,. In the following, an online adaptive algorithm will be proposed to derive the optimal control solution for system 9) using the NN identifier introduced in the previous section and another critic NN for approximating the value function of the HJB equation 3). Instead of sequentially updating the critic and actor NNs, 8, both networks are updated simultaneously in real time, and thus lead to the synchronous online implementation. A. Value Function Approximation via NN Assuming the optimal value function is continuous and defined on compact sets, then a single-layer NN can be used to approximate it 9, such that the solution V e) and its derivative V e)/ e with respect to e can be uniformly approximated by and V e) = W T Φe) + ε, 5) V e) e = Φ T W + ε, 6) where W R l are the unknown ideal weights and Φe) = Φ,, Φ l T R l is the NN activation function vector, l is the number of neurons in the hidden layer, and ε is the NN approximation error. Φ := Φ/ e and ε := ε / e denote the partial derivative of Φe) and ε with regard to e, respectively.

6 NA AND HERRMANN: ONLINE ADAPTIVE APPROXIMATE OPTIMAL TRACKING CONTROL WITH 47 Some standard NN assumptions that will be used throughout the remainder of this paper are summarized here. Assumption 7, 9. The ideal NN weights W are bounded by a positive constant W N, i.e., W W N ; the NN activation function Φ ) and its derivative Φ ) with respect to argument e are bounded, e.g., Φ φ M ; and the function approximation error ε and its derivative ε with respect to e are bounded, e.g., ε φ ε. In a practical application, the NN activation functions {Φ i e) : i =,, l} can be selected so that as l +, Φe) provides a complete independent basis for V e). Then using Assumption and the Weierstrass higher-order approximation theorem, both V e) and V e)/ e can be uniformly approximated by NNs in 5) and 6), i.e., as l +, the approximation errors ε 0, ε 0 as shown in 7, 9. Then the critic NN ˆV e) that approximates the optimal value function V e) is given by ˆV e) = Ŵ T Φe), 7) where Ŵ is the estimation of the unknown weights W in critic NN 5), which will be specified by adaptive law 38). In this case, one may obtain the approximated optimal control as û e = R B T ˆV e) = e R B T Φ T Ŵ, 8) such that the overall optimal tracking control for system 6) can be given as u = u s + û e 9) with u s being the steady-state control given in 8). Consider that the ideal optimal control u e can be determined based on 4) and 6) as u e = R B T V e) = e R B T Φ T W + ε ), 30) and then substitute the estimated optimal control û e of 8) into the error dynamics 9), we have ė =Ae + ĈT fx) fx d ) + Bu e + ε + ε N + ε ϕ = Ae + ĈT fx) fx d ) + B R B T Φ T Ŵ + R B T Φ T W + ε ) + Bu e + ε + ε N + ε ϕ = Ae + ĈT fx) fx d ) + BR B T Φ T W + Bu e + BR B T ε + ε + ε N + ε ϕ. 3) Remark 5. Since the overall control 9) is derived using the steady-state control 8) and the approximate optimal control 8) that depends on the estimated optimal value function Φ T Ŵ, the critic NN in 7) can be used to determine the control action without using another NN as the actor in 9. This can reduce the computational cost and improve the learning process. However, alternatively, a separate actor NN, e.g., Φ T Ŵ a, may be used in a similar way for producing the approximated optimal control action û e = R B T Φ T Ŵ a as that shown in 9. B. Adaptive Law for Critic NN The problem now is to update the critic NN weights Ŵ, such that Ŵ converge to a small bounded region around the ideal values W. To derive the adaptive law, we denote f d = fx d ), substitute 6) into Hamiltonian function ), and thus rewrite the HJB equation 3) as 0 = He, u e, V ) = W T ΦAe + ĈT f f d ) + Bu e + e T Qe + u T e Ru e + ε HJB, 3) where ε HJB = ε Ae + ĈT f f d ) + Bu e + ε + ε N + ε ϕ + W T Φε N + ε ϕ + ε) is the residual error due to the NN approximation errors, which can be made arbitrarily small by using a sufficiently large number of NN nodes 7, 9, i.e., ε 0 as p + and ε 0 as l +. Equally, Theorem implies that estimation error ε N = Cfx) converges to zero as p + for bounded control and states. In contrast, ε ϕ = 0 when B is of rank n. To facilitate the design of the adaptive law, we denote the known terms in 3) as Ξ = Φ Ae + ĈT f f d ) + Bu e and Θ = e T Qe + u T e Ru e, and then represent 3) as Θ = W T Ξ ε HJB. 33) In 33), the unknown critic NN weights W appear in a linearly parameterized form, and will be directly estimated in the following development by utilizing the parameter estimation error method proposed in Section III. Remark 6. It is shown in 3) that the residual HJB equation error ε HJB is due to the critic NN approximation error ε in 6), the identifier error ε + ε N in 6) and the matching condition error ε ϕ. As claimed in 7, 9, the critic NN approximation error ε converges uniformly to zero as the number of hidden layer nodes increases, i.e., ε 0 as long as l +. That is, µ > 0, Nµ) : sup ε µ. Moreover, in case that there is no NN approximation error in 3), i.e., ε = 0, the effect of the identifier error ε N in 6) will vanish i.e., ε N 0 as p + ) because C 0 holds for ε = 0 as proved in Theorem for bounded state x and control input u. Finally, the fact ε ϕ = 0 is also true under the matching condition in 9). Consequently, if there are no approximation errors ε in identifier 6) and ε in critic NN 6), and the matching condition holds, the residual error in 33) is null, i.e., ε HJB = 0. Remark 7. Some available ADP based optimal controls are designed to online update the critic NN weights Ŵ by minimizing the squared residual Bellman error in the approximated HJB equation 9, 3, where the Least-squares 0 or modified Levenberg-Marquardt algorithms 9 are employed. In the following, we will extend our previous results 4 to design the adaptive law to directly estimate unknown critic NN weights W based on 33) rather than to reduce the Bellman error.

7 48 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL., NO. 4, OCTOBER 04 Similar to Section III, we define the auxiliary filtered regressor matrix P R l l and vector Q R l as { P = lp + ΞΞ T, P 0) = 0, 34) Q = lq + ΞΘ, Q 0) = 0, where l > 0 is the design parameter. Then one can obtain { P = t 0 e lt r) Ξr)Ξ T r)dr, Q = t 35) 0 e lt r) Ξr)Θr)dr. Define another auxiliary vector M R l based on P and Q in 34) as M = P Ŵ + Q, 36) where Ŵ is the estimation of W, which will be given in the following adaptive law 38). By substituting 33) into 35), we have Q = P W + ψ with ψ = t 0 e lt r) ε HJB r)ξr)dr being a bounded variable for bounded state x and control u, e.g., ψ ε ψ for some ε ψ > 0. In this case, 36) can be rewritten as M = P Ŵ + Q = P W + ψ, 37) where W = W Ŵ is the NN weight error. Then the adaptive law for estimating Ŵ is provided by Ŵ = Γ M, 38) with Γ > 0 being a constant matrix. Similar to Section III, the condition that λ min P ) > σ > 0 is needed if one desires to precisely estimate the unknown critic NN weights W so that the approximated value function 7) converges to its true value 5). As shown in 9 0, a small probing noise can be added to the control input to retain the PE condition if this condition is not satisfactory. This implies λ min P ) > σ > 0, as stated in Lemma 3. Lemma 3 4. If the regressor function vector Ξ defined in 33) is persistently excited, then matrix P defined in 35) is positive definite, i.e., its minimum eigenvalue λ min P ) > σ > 0. Then we have the following theorem. Theorem. For the critic NN adaptive law 38) with regressor vector Ξ satisfying λ min P ) > σ > 0 and for bounded state x and control u, one has ) For ε HJB = 0 i.e., no NN approximation errors), the estimation error W exponentially converges to zero; ) For ε HJB 0 i.e., with bounded approximation errors), the estimation error W converges to a bounded set around zero. Proof. Consider the Lyapunov function candidate as V = W T Γ W, then the derivative V can be calculated along 38) as V = W T Γ W = W T P W + W T ψ. 39) ) In case that ε HJB = 0, i.e., ψ = 0, thus 39) can be reduced as V = W T P W σ W µ V, 40) where µ = σ /λ max Γ ) is a positive constant. Then according to Lyapunov s theorem Theorem 3.4. in 4, p0), the weight estimation error W converges to zero exponentially, where the convergence rate depends on the excitation level σ and the learning gain Γ. ) In case that there are bounded approximation errors, i.e., ε HJB 0, 39) can be written as V = W T P W + W T ψ W σ V ε ψ ) 4) for σ = σ /λmax Γ ). Then according to the extended Lyapunov theorem Theorem in 4, p), the weight estimation error W { uniformly ultimately converges to the } compact set Ω := W V ε ψ / σ, whose size depends on the bound of approximation error ε ψ and the excitation level σ. This completes the proof. C. Stability Analysis Now, we summarize the main results of this paper as follows. Theorem 3. For system 3) with controls 8), 8) and adaptive laws 7), 38) being used, if the initial control action is chosen to be admissible and regressor vectors f and Ξ satisfy λ min P ) > σ > 0 and λ min P ) > σ > 0, then the following semi-global results hold: ) In the absence of approximation errors, the tracking error e and the parameter estimation errors C and W converge to zero, and adaptive control û e in 8) converges to its optimal solution u e in 8), i.e., û e u e if ε = 0. ) In the presence of approximation errors, the tracking error e and the parameter estimation errors C and W are uniformly ultimately bounded, and the adaptive control û e in 8) converges to a small bound around its optimal solution u e in 4), i.e., û e u e ε u for a small positive constant ε u. Please refer to Appendix for the detailed proof of Theorem 3. V. SIMULATIONS In this section, a numerical example is provided to demonstrate the effectiveness of the proposed approach. Consider the following nonlinear continuous-time system ẋ = x + x, ẋ = 0.5x 0.5x cosx ) + ) )+ cosx ) + u. 4) The results are to be compared with the exact results in 3. Then weight matrices Q and R of cost ) are chosen as identity matrices of appropriate dimensions. The control objective is to make system states x track the desired trajectory x d = sint) and x d = cost) + sint). It is assumed that system dynamics are partially unknown, and we first use identifier 3) to reconstruct system dynamics with A = 0.5 0, B= 0 being known

8 NA AND HERRMANN: ONLINE ADAPTIVE APPROXIMATE OPTIMAL TRACKING CONTROL WITH 49 T matrices, C= is the unknown identifier weights to be estimated, and the activation function is chosen as fx)= x, x cosx ) + ), cosx ) T. The parameters for simulation are set as k = 0.00, l =, Γ = 350. The initial weight parameter is set as Ĉ0)=0. Two different scenarios are investigated, the adaptive algorithm without and with injection of additional noise. The noise has a uniform distribution and a maximal amplitude of 0. induced at the measurements for x and x. It is removed after a duration of 4 seconds. Fig. shows the profile of the estimated identifier weights Ĉ with adaptive law 7), where one may find that the identifier weight estimation converges to their true value C after a 3.5 second transient without noise. It is evident that the algorithm with noise injection converges slightly faster. for the online estimation of the critic NN weights Ŵ is shown in Fig. ; this indicates that Ŵ converges in about.5 seconds. In particular, Ŵ and Ŵ3 converge close to its optimal value of and 0, while Ŵ for the noise induced case carries a larger error. Ŵ does not affect the closed loop behavior, but has influence on the value function estimate. This means that the designed adaptive optimal control 8) converges close to its optimal control action in 44). An error in the weights is to be expected as ε ϕ 0. The novel identifier and critic NN weight update laws 7) and 38), based on the information of the parameter estimation error, lead to faster convergence of weights compared to 9. Moreover, for the noise-free case, the system states for tracking the given external command are shown in Fig. 4, the tracking error profile is given in Fig. 5, and the associated control action is provided in Fig. 6. The noise induced case provides again very similar trajectories, which are not displayed here for space reasons. Fig.. Convergence of identifier parameters Ĉ. In the following, the control performance will be verified. For this purpose, the adaptive steady-state control 8) for system 4) to maintain the steady-state performance can be written as u s = 0, cost) Ĉ T cost) sint) sint) cost) + sint) cost) + sint) cost) sint))cos sint)) + ) cos sint)) ). 43) As input matrix B is of rank, it is evident that ε ϕ = BB I)ẋ d Ax d ĈT fx d )) is not zero. Thus, the computation of optimal control 8) using adaptive law 38) for the critic NNs may be subjected to a small error. To this end, following 9, 3, the optimal value function and the associated optimal control for system 4) are V e) = e + e and u e = R B T V e) e = e. 44) Similar to 9 0, we select the activation function for the critic NN as Φe) = e, e e, e T, then the optimal weights W = 0.5, 0, T can be derived. Note that only the last nonzero coefficient W 3 = affects the closed loop. The time trace Fig.. Convergence of critic NN weights Ŵ. Fig. 3. Excitation conditions λ minp ) and λ minp ). A critical issue in using the proposed adaptive laws 7) and 38) is to ensure sufficient excitation of regressor vectors fx) and Ξ. This condition can be fulfilled in the studied system as shown in Fig. 3, where the online evolutions of λ min P ) and λ min P ) are provided. The scalar λ min P ) remains positive at all time. The value of λ min P ) is sufficiently large till the

9 40 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL., NO. 4, OCTOBER 04 time instant of 4 second when the noise is removed for the noise induced case; λ min P ) for the noise-free case remains sufficiently large until the time instant of second, i.e., after the NN weight convergence is obtained See Figs. and ). Fig. 6. Control action profile u. Fig. 4. Evaluation of tracking performance. Fig. 5. Convergence of tracking error e = x x d. VI. CONCLUSIONS An adaptive optimal tracking control is proposed for a class of continuous-time nonlinear systems with unknown dynamics. To achieve the optimal tracking control, an adaptive steady-state control for maintaining the desired steadystate tracking performance is accomplished with an adaptive optimal control for stabilizing the tracking error dynamics at transient stage in an optimal manner. To eliminate the need for precisely known system dynamics, an adaptive identifier is used to estimate the unknown system dynamics. A critic NN is used to online learn the approximate solution of the HJB equation, which is then used to provide the approximately optimal control action. Novel adaptive laws based on the parameter estimation error are developed for updating the unknown weights in both identifier and critic NN, such that the online learning of identifier and the optimal policy is achieved simultaneously. The PE conditions or more relaxed filtered regressor matrix conditions are required to ensure the error convergence to a bounded region around the optimal control and stability of the closed-loop system. Simulation results demonstrate the improved performance of the proposed method. APPENDIX Proof of Theorem 3. Consider the Lyapunov function as V = V + V + V 3 + V 4 = tr C T Γ C) + W T Γ W + Γe T e + KV e) + Σψ T ψ, A) where V e) is the optimal value function 0), and K, Γ and Σ are positive constants. This Lyapunov function is investigated in a compact set Ω R p n R l R n R R n R n in tuple C, W, e, ψ, x d, ẋ d ), which contains the element 0, 0, 0, 0, 0, 0) in its interior, and C, W, e, ψ, x d, ẋ d ) Ω implies e + x d, u s x d, ẋ d ) + u e e)) Ω. Ω and Ω should be both chosen to be sufficiently large but of fixed size. In particular, any temporal initial value of C, W, e, ψ, x d, ẋ d ) is assumed to be within the interior Ω, while in particular x d and ẋ d are chosen to remain within Ω. Thus, for any initial trajectory, state x and control u remain bounded for at least finite time t 0, T, which again implies in particular ψ to be bounded in this time interval. Thus, consider inequality ab a η/ + b /η for η > 0, then derivative V along 7) is derived as V = tr C T Γ C) = tr C T P C) + tr CT ψ ) σ /η) C + ηε ψ/, A) and derivative V along 38) is derived as V = W T Γ W = W T P W + W T ψ σ /η) W + η ψ /. Moreover, one may deduce V 3 from 0) and 3) as V 3 = Γe T ė + K e T Qe u T e Ru e) = e T Γ Ae + ĈT f f d ) + BR B T Φ T W + Bu e + A) BR B T ε + ε + ε N + ε ϕ + K e T Qe u T e Ru e) Kλ minq) Γ4 + A + Ĉ κ ) e + Γ BR B T Φ T W Kλ minr) Γ B ) u e + Γ BR B T ε ) T ε + Γε T ε + Γε T N ε N + Γε T ϕε ϕ. A4)

10 NA AND HERRMANN: ONLINE ADAPTIVE APPROXIMATE OPTIMAL TRACKING CONTROL WITH 4 It is evident that ψ = lψ + Ξε HJB. Hence, similar to the parameter η > 0, parameter µ > 0 is introduced to compute an upper bound of the derivative of V 4 = Σψ T ψ as V 4 = Σψ T ψ = { Σψ T lψ + Ξ W T Φ ε N + ε ϕ + ε) + ε Ae + ĈT f f d ) + Bu e + ε + ε N + ε ϕ ) } Σl 5µ) ψ + µ Σ ΞW T Φ + ε ) ε ϕ + ε)) + µ Σ Ξ εĉt f f d )) + µ Σ Ξ ε BR B T Φ T Ŵ + µ Σ Ξ εae + µ Σ ΞW T Φ + ε ) ε N. Considering that ε N = Cfx), we have A5) V = V + V + V 3 + V 4 σ η Γ + µ Σ ΞW T Φ + ε ) ) f C σ η Γφ M BR B T µ Σ Ξ ε BR B T Φ T ) W Kλ minq) Γ 4 + A + Ĉ κ ) µ Σ Ξ ε Ĉ T κ + Ξ ε A ) e KλminR) Γ B ) u e Σl 5µ) η ψ + BR Γ B T ε) T ε + Γε T ε + Γε T ϕε ϕ + ηε ψ+ ΞW µ Σ T Φ + ε ) ε ϕ + ε)) + µ Σ Ξ ε BR B T Φ T W. A6) The design parameters η, µ, Γ, Σ and K are appropriately chosen such that Kλ min R) Γ B ) > 0 and the scalars a, a, a 3 and a 4 are positive and larger than certain positive constant a > 0, where a = σ η Γ + ΞW µ Σ T Φ + ε ) ) f, a = σ η Γφ M BR B T µ Σ Ξ ε BR B T Φ T, 4 + A + Ĉ κ ) a 3 = Kλ minq) Γ µ Σ Ξ ε Ĉ T κ + Ξ ε A ), a 4 = Σl 5µ) η. This can be achieved by selecting η > 0 and K > 0 large enough, while a > 0, µ > 0, Γ > 0 and Σ > 0 are chosen to be small enough to satisfy in particular minσ, σ ) > a > 0 and a 4 > a > 0. Note also that Lipschitz continuity of f ) and smoothness of Φ ) and V ) imply that f ), Ξ and Φ ) are bounded on Ω. Thus, A6) can be further presented as V a C a W a 3 e a 4 ψ + γ, A7) where γ = Γ BR B T ε ) T ε + Γε T ε + Γε T ϕε ϕ + ηε ψ + µ Σ ΞW T Φ + ε ) ε ϕ + ε)) + µ Σ Ξ ε BR B T Φ T W defines the effect of the identifier errors ε, ψ, the critic NN approximation error ε and the matching error ε ϕ. ) In case that there are no approximation errors in both identifier and critic NN, i.e., ε N = ε = ψ = ψ = ε ϕ = 0, then we have γ = 0, such that A7) can be deduced as V a C a W a 3 e a 4 ψ 0. A8) Thus, there is a compact set ˆΩ Ω, in C, W, e, ψ ) with 0, 0, 0, 0) in its interior, which is a set of attraction. Then within ˆΩ according to Lyapunov s theorem, V 0 holds as t + such that the estimation errors C, W and e all converge to zero. In this case, by assuming the critic NN approximation error ε = 0, we have û e u e = R B T Φ T Ŵ + R B T Φ T W = such that R B T Φ T W, A9) lim t + û e u e φ M R B T lim t + W = 0. A0) ) In case that there are bounded approximation errors in both identifier and critic NN, then we have γ 0. Consequently, according to A7), it can be shown that V is negative if C > γ/a, W > γ/a, e > γ/a 3, ψ > γ/a 4. A) Then again for some set ˆΩ Ω, the estimation errors C, W, ψ and e are all uniformly ultimately bounded according to Lyapunov s theorem within the set of attraction ˆΩ. Next we will prove û e u e ε u. Recalling the expressions of u e from 4) or 30) and û e from 8), we have û e u e = R B T Φ T Ŵ + R B T Φ T W + ε ) = R B T Φ T W + R B T ε. A) When t, the upper bound of A) is û e u e R B T Φ T W + R B T ε ε u. A3) Clearly, the upper bound ε u depends on the critic NN approximation error W and the NN estimation error ε.

Adaptive Control: Stability, Convergence, and Robustness. New Jersey: Prentice Hall, 989. 4 Ioannou P A, Sun J. Robust Adaptive Control. New Jersey: Prentice Hall, 996. 5 Sutton R S, Barto A G.

11 4 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL., NO. 4, OCTOBER 04 REFERENCES Lewis F L, Vrabie D, Syrmos V L. Optimal Control. Wiley. com, 0. Vrabie D, Lewis F L. Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear systems. Neural Networks, 009, 3): Sastry S, Bodson M. Adaptive Control: Stability, Convergence, and Robustness. New Jersey: Prentice Hall, Ioannou P A, Sun J. Robust Adaptive Control. New Jersey: Prentice Hall, Sutton R S, Barto A G. Reinforcement Learning: An Introduction. Cambridge: Cambridge University Press, Doya K J. Reinforcement learning in continuous time and space. Neural computation, 000, ): Sutton R S, Barto A G, Williams R J. Reinforcement learning is direct adaptive optimal control. IEEE Control Systems Magazine, 99, ): 9 8 Werbos P J. A menu of designs for reinforcement learning over time. Neural Networks for Control. MA, USA: MIT Press Cambridge, Si J, Barto A G, Powell W B, Wunsch D C. Handbook of Learning and Approximate Dynamic Programming. Los Alamitos: IEEE Press, Wang F Y, Zhang H G, Liu D R. Adaptive dynamic programming: an introduction. IEEE Computational Intelligence Magazine, 009, 4): Lewis F L, Vrabie D. Reinforcement learning and adaptive dynamic programming for feedback control. IEEE Circuits and Systems Magazine, ): 3 50 Zhang H G, Zhang X, Luo Y H, Yang J. An overview of research on adaptive dynamic programming. Acata Automatica Sinica, 03, 394): Dierks T, Thumati B T, Jagannathan S. Optimal control of unknown affine nonlinear discrete-time systems using offline-trained neural networks with proof of convergence. Neural Networks, 009, 5): Al-Tamimi A, Lewis F L, Abu-Khalaf M. Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 008, 384): Wang D, Liu D R, Wei Q L, Zhao D B, Jin N. Optimal control of unknown nonaffine nonlinear discrete-time systems based on adaptive dynamic programming. Automatica, 0, 488): Hanselmann T, Noakes L, Zaknich A. Continuous-time adaptive critics. IEEE Transactions on Neural Networks, 007, 83): Abu-Khalaf M, Lewis F L. Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach. Automatica, 005, 45): Vrabie D, Pastravanu O, Abu-Khalaf M, Lewis F L. Adaptive optimal control for continuous-time linear systems based on policy iteration. Automatica, 009, 45): Vamvoudakis K G, Lewis F L. Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem. Automatica, 00, 465): Bhasin S, Kamalapurkar R, Johnson M, Vamvoudakis K G, Lewis F L, Dixon W E. A novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear systems. Automatica, 03, 49): 8 9 Zhang H G, Cui L, Zhang X, Luo Y. Data-driven robust approximate optimal tracking control for unknown general nonlinear systems using adaptive dynamic programming method. IEEE Transactions on Neural Networks, 0, ): 6 36 Mannava A, Balakrishnan S N, Tang L, Landers R G. Optimal tracking control of motion systems. IEEE Transactions on Control Systems Technology, 0, 06): Nodland D, Zargarzadeh H, Jagannathan S. Neural network-based optimal adaptive output feedback control of a helicopter UAV. IEEE Transactions on Neural Networks and Learning Systems, 03, 47): Na J, Herrmann G, Ren X M, Mahyuddin M N, Barber P. Robust adaptive finite-time parameter estimation and control of nonlinear systems. In: Proceedings of IEEE International Symposium on Intelligent Control ISIC). Denver, CO: IEEE, Uang H J, Chen B S. Robust adaptive optimal tracking design for uncertain missile systems: a fuzzy approach. Fuzzy Sets and Systems, 00, 6): Krstic M, Kokotovic P V, Kanellakopoulos I. Nonlinear and Adaptive Control Design. New York: Wiley, Kosmatopoulos E B, Polycarpou M M, Christodoulou M A, Ioannou P A. High-order neural network structures for identification of dynamical systems. IEEE Transactions on Neural Networks, 995, 6): Abdollahi F, Talebi H A, Patel R V. A stable neural network-based observer with application to flexible-joint manipulators. IEEE Transactions on Neural Networks, 006, 7): Lin J S, Kanellakopoulos I. Nonlinearities enhance parameter convergence in strict feedback systems. IEEE Transactions on Automatic Control, 999, 44): Edwards C, Spurgeon S K. Sliding Mode Control: Theory and Applications. Boca Raton: CRC Press, Sira-Ramirez H. Differential geometric methods in variable-structure control. International Journal of Control, 988, 48 4): Nevistic V, Primbs J A. Constrained Nonlinear Optimal Control: A Converse HJB Approach, Technical Report CIT-CDS 96-0, California Institute of Technology, Pasadena, CA, 996. Jing Na Professor in Kunming University of Science and Technology. He received his Ph. D. degree from Beijing Institute of Technology in 00. From 0 to 0, he was a Postdoctoral Fellow with the ITER Organization. His research interest covers intelligent control, adaptive parameter estimation, neural networks, repetitive control, and nonlinear control & applications. Corresponding author of this paper. Guido Herrmann Received his Ph. D. degree from University of Leicester, UK, in 00. From 00 to 003, he was a Senior Research Fellow in the Data Storage Institute in Singapore. From 003 until 007, he was a research associate, fellow, and lecturer in University of Leicester. He joined University of Bristol, UK, as a lecturer in March 007. He was promoted to a Senior Lecturer in 009 and a Reader in Control and Dynamics in 0. He is a Senior Member of the IEEE. His research interest covers the development and application of novel, robust and nonlinear control systems.

Stability Analysis of Optimal Adaptive Control under Value Iteration using a Stabilizing Initial Policy

Stability Analysis of Optimal Adaptive Control under Value Iteration using a Stabilizing Initial Policy Ali Heydari, Member, IEEE Abstract Adaptive optimal control using value iteration initiated from