VARIOUS complex systems in social and engineering

Size: px

Start display at page:

Download "VARIOUS complex systems in social and engineering"

Anthony Baker
5 years ago
Views:

1 418 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 2, FEBRUARY 2014 Decentralzed Stablzaton for a Class of Contnuous-Tme Nonlnear Interconnected Systems Usng Onlne Learnng Optmal Control Approach Derong Lu, Fellow, IEEE, Dng Wang, and Honglang L Abstract In ths paper, usng a neural-network-based onlne learnng optmal control approach, a novel decentralzed control strategy s developed to stablze a class of contnuoustme nonlnear nterconnected large-scale systems. Frst, optmal controllers of the solated subsystems are desgned wth cost functons reflectng the bounds of nterconnectons. Then, t s proven that the decentralzed control strategy of the overall system can be establshed by addng approprate feedback gans to the optmal control polces of the solated subsystems. Next, an onlne polcy teraton algorthm s presented to solve the Hamlton Jacob Bellman equatons related to the optmal control problem. Through constructng a set of crtc neural networks, the cost functons can be obtaned approxmately, followed by the control polces. Furthermore, the dynamcs of the estmaton errors of the crtc networks are verfed to be unformly and ultmately bounded. Fnally, a smulaton example s provded to llustrate the effectveness of the present decentralzed control scheme. Index Terms Adaptve dynamc programmng, decentralzed control, large-scale systems, neural networks, nonlnear nterconnected systems, optmal control, polcy teraton, renforcement learnng. I. INTRODUCTION VARIOUS complex systems n socal and engneerng areas, such as ecosystems, transportaton systems, and power systems, are consdered as large-scale systems. Generally speakng, a large-scale system s comprsed of several subsystems wth obvous nterconnectons, whch leads to the ncreasng dffculty of analyss and synthess when usng classcal centralzed control technques. Bakule [1] ponted out wth smlar results that t s, therefore, necessary to partton the desgn ssue of the overall system nto manageable subproblems. Then, the overall plant s no longer controlled by a sngle controller but by an array of ndependent controllers that all together represent a decentralzed controller. Therefore, the decentralzed control has been a control of choce for large-scale systems because t s computatonally effcent to Manuscrpt receved January 5, 2013; revsed May 6, 2013; accepted July 25, Date of publcaton September 16, 2013; date of current verson January 10, Ths work was supported n part by the Natonal Natural Scence Foundaton of Chna under Grants , , and , and n part by the Early Career Development Award of SKLMCCS. The actng Edtor-n-Chef who handled the revew of ths paper was Danl Prokhorov. The authors are wth The State Key Laboratory of Management and Control for Complex Systems, Insttute of Automaton, Chnese Academy of Scences, Bejng , Chna (e-mal: derong.lu@a.ac.cn; dng.wang@a.ac.cn; honglang.l@a.ac.cn). Color versons of one or more of the fgures n ths paper are avalable onlne at Dgtal Object Identfer /TNNLS formulate control law that use only locally avalable subsystem states or outputs [2]. Actually, consderable attenton has been pad to the decentralzed stablzaton of large-scale systems durng the last several decades [3] [7]. As prevously mentoned, a decentralzed strategy conssts of some nonnteractng local controllers correspondng to the solated subsystems, not the overall system. Thus, n many stuatons, the desgn of the solated subsystems s a matter of great sgnfcance. In [8], t was shown that the decentralzed control of the nterconnected system was related to the optmal control of the solated subsystems. Therefore, the optmal control method can be employed to facltate the desgn process of the decentralzed control strategy. However, n [8], the cost functons of the solated subsystems were not chosen as the general forms, not to menton that the detaled procedure was not gven. For ths reason, n ths paper, by employng the onlne polcy teraton algorthm, we wll nvestgate the decentralzed stablzaton problem usng neural-network-based learnng optmal control approach. The optmal control of nonlnear system often leads to solvng the Hamlton Jacob Bellman (HJB) equaton nstead of the Rccat equaton of the lnear case. Though dynamc programmng s a useful technque to solve the optmzaton and optmal control problems, n many cases, t s computatonally dffcult to apply t because of the curse of dmensonalty. Fortunately, based on functon approxmators, such as neural networks, adaptve (or approxmate) dynamc programmng (ADP) was proposed by Werbos [9], [10] as an alternatve method to solve the optmal control problems forward-n-tme. There are several synonyms used for ADP, ncludng adaptve dynamc programmng [11] [15], approxmate dynamc programmng [16] [18], neuro-dynamc programmng [19], neural dynamc programmng [20], adaptve crtc desgns [21], and renforcement learnng [22]. In the recent years, great efforts have been made to ADP and related research n theory and applcatons. Numerous excellent results have been obtaned that greatly promotes the development of relevant dscplnes [23] [40]. In lght of [13], the ADP technque s closely related to renforcement learnng when engagng n the research of feedback control. In general, value and polcy teratons are fundamental algorthms for the renforcement learnng-based ADP n optmal control. Polcy teraton starts wth a stablzng control, whereas value teraton cannot always guarantee the stablty of control durng the mplementaton process. Al-Tamm et al. [18], Zhang et al. [23], and Lu et al. [26] X 2013 IEEE. Personal use s permtted, but republcaton/redstrbuton requres IEEE permsson. See for more nformaton.

2 LIU et al.: ONLINE LEARNING OPTIMAL CONTROL APPROACH 419 studed the optmal control problem of dscrete-tme nonlnear systems usng value teraton algorthm. Specfcally, polcy teraton represents a class of algorthms contanng two basc teratons,.e., polcy evaluaton and polcy mprovement [41] [46]. Abu-Khalaf and Lews [41] derved an offlne optmal control scheme for nonlnear systems wth saturatng actuators. Then, Vrabe and Lews [42] and Vamvoudaks and Lews [43] used onlne polcy teraton algorthm to study the nfnte horzon optmal control of contnuous-tme nonlnear systems, respectvely. The former was performed based on the sequental updates of two neural networks, namely, crtc network and acton network, whereas n the latter, the two networks were traned smultaneously. Recently, Lu et al. [44] extended the polcy teraton algorthm to nonlnear optmal control problem wth unknown nternal dynamcs and dscounted cost functon. Besdes, Bhasn et al. [46] constructed an actor-crtc-dentfer archtecture to deal wth the nfnte horzon optmal control of uncertan nonlnear systems, characterzed by the ntroducton of a robust dynamc neural network. In ths paper, we employ the onlne polcy teraton algorthm to tackle the decentralzed control of a class of nonlnear nterconnected systems. To desgn the decentralzed control scheme of the overall system, the optmal controllers of the solated subsystems are desgned at frst wth the cost functons modfed to account for the nterconnectons. Then, the decentralzed control strategy can be establshed by addng approprate feedback gans to the local optmal control polces. Next, the onlne polcy teraton algorthm s developed to solve the HJB equatons related to the optmal control by constructng and tranng some crtc networks. It s shown that the approxmate closed-form expressons of the optmal control polces are avalable. Hence, there s no need to buld acton networks. Addtonally, the unform ultmate boundedness (UUB) of the dynamcs of the weght estmaton errors s analyzed usng the Lyapunov approach. Remarkably, consderng the effectveness of ADP and renforcement learnng technques n solvng the nonlnear optmal control problem, the decentralzed control approach establshed here s natural and convenent. More mportantly, t can be employed to stablze a broad class of nonlnear large-scale systems. Ths paper s organzed as follows. In Secton II, the decentralzed control problem of the large-scale system s descrbed. In Secton III, the optmal control of solated subsystems s presented n the framework of HJB equatons, based on whch, the decentralzed control strategy can be developed. In Secton IV, the onlne polcy teraton algorthm s ntroduced to solve the HJB equatons wth convergence analyss. In addton, crtc networks are constructed for facltatng the mplementaton of onlne algorthm. The UUB of the dynamcs of the weght estmaton errors s proved as well. In Secton V, an example s gven to demonstrate the effectveness of the establshed approach. In Secton VI, concludng remarks are provded. II. PROBLEM STATEMENT In ths paper, we study a class of contnuous-tme nonlnear large-scale systems composed of N nterconnected subsystems Fg. 1. Structural dagram of the decentralzed control problem of the nterconnected system. descrbed by ẋ (t) = f (x (t)) + g (x (t)) ( ū (x (t)) + Z (x(t)) ) = 1, 2,...,N (1) where x (t) R n and ū (x (t)) R m are the state and control vectors of the th subsystem, respectvely. In large-scale system (1), x = [ x1 T x 2 T... xn T ] T R n denotes the overall state, where n = N =1 n. Correspondngly, x 1, x 2,..., x N are called local states, whereas ū 1 (x 1 ), ū 2 (x 2 ),..., ū N (x N ) are local controls. Note that for subsystem, f (x ), g (x ),and g (x ) Z (x) represent the nonlnear nternal dynamcs, nput gan matrx, and nterconnected term, respectvely. Let x (0) = x 0 be the ntal state of the th subsystem, = 1, 2,..., N. Addtonally, we let the followng assumptons hold throughout ths paper. Assumpton 1: The state vector x = 0 s the equlbrum of the th subsystem, = 1, 2,...,N. Assumpton 2: The functons f ( ) and g ( ) are dfferentable n ther arguments wth f (0) = 0, where = 1, 2,...,N. Assumpton 3: The feedback control vector ū (x ) = 0 when x = 0, where = 1, 2,...,N. Let R R m m, = 1, 2,...,N, be symmetrc postve defnte matrces. Then, we denote Z (x) = R 1/2 Z (x), where Z (x) R m, = 1, 2,...,N, are bounded as follows: Z (x) ρ j h j (x j ), = 1, 2,...,N. (2) j=1 In (2), ρ j are nonnegatve constants and h j (x j ) are postve semdefnte functons wth, j = 1, 2,...,N. If we defne h (x ) = max{h 1 (x ), h 2 (x ),...,h N (x )}, = 1, 2,...,N, then (2) can be formulated as Z (x) λ j h j (x j ), = 1, 2,...,N (3) j=1 where λ j ρ j h j (x j )/h j (x j ),, j = 1, 2,...,N, arealso nonnegatve constants. When dealng wth the decentralzed control problem, we am at fndng N control polces ū 1 (x 1 ), ū 2 (x 2 ),..., ū N (x N ) to stablze the large-scale system (1). It s mportant to note that n the control par (ū 1 (x 1 ), ū 2 (x 2 ),...,ū N (x N )), ū (x ) s only a functon of the correspondng local state, namely x, where = 1, 2,..., N. The schematc dagram of the decentralzed control problem s shown n Fg. 1.

3 420 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 2, FEBRUARY 2014 III. DECENTRALIZED CONTROLLER DESIGN VIA OPTIMAL CONTROL SCHEME In ths secton, we nvestgate the methodology for decentralzed controller desgn. Two sectons are ncluded n ths part. In the frst secton, the optmal control of the solated subsystems s descrbed under the framework of HJB equatons, whereas n the second secton, the decentralzed control strategy can be constructed based on the optmal control polces. A. Optmal Control and the HJB Equatons Now, we consder the N solated subsystems correspondng to (1) that are gven by ẋ (t) = f (x (t)) + g (x (t))u (x (t)), = 1, 2,...,N. (4) For the th solated subsystem, we further assume that f + g u s Lpschtz contnuous on a set n R n contanng the orgn, and the subsystem s controllable n the sense that there exsts a contnuous control polcy on that asymptotcally stablzes the subsystem. In ths paper, to deal wth the nfnte horzon optmal control problem, we have to fnd the control polces u (x ), = 1, 2,...,N, whch mnmze the local cost functons { J (x 0 ) = Q 2 (x (τ)) + u T (x (τ))r u (x (τ)) } dτ 0 = 1, 2,...,N (5) where Q (x ), = 1, 2,...,N, are postve defnte functons satsfyng h (x ) Q (x ), = 1, 2,...,N. (6) Based on optmal control theory, here, the desgned feedback controls must not only stablze the subsystems on, = 1, 2,...,N, but also guarantee that the cost functons (5) are fnte. In other words, the control polces must be admssble. Below s the defnton of admssble control. Defnton 1: Consder the solated subsystem, a control polcy μ (x ) s defned as admssble wth respect to (5) on, denoted by μ ( ),fμ (x ) s contnuous on, μ (0) = 0, u (x ) = μ (x ) stablzes (4) on,and J (x 0 ) s fnte for all x 0. For any set of admssble control polces μ ( ), = 1, 2,...,N, f the assocated cost functons { V (x 0 ) = Q 2 (x (τ)) + μ T (x (τ))r μ (x (τ)) } dτ 0 = 1, 2,...,N (7) are contnuously dfferentable, then the nfntesmal versons of (7) are the so-called nonlnear Lyapunov equatons 0 = Q 2 (x ) + μ T (x )R μ (x ) + ( V (x )) T ( f (x ) + g (x )μ (x )), = 1, 2,...,N (8) wth V (0) = 0, = 1, 2,...,N. In (8), the terms V (x ), = 1, 2,...,N, denote the partal dervatves of the local cost functons V (x ) wth respect to local states x,.e., V (x ) = V (x )/ x,where = 1, 2,...,N. Defne the Hamltonan functons of the N solated subsystems as follows: H (x,μ, V (x )) = Q 2 (x ) + μ T (x )R μ (x ) + ( V (x )) T ( f (x ) + g (x )μ (x )) (9) where = 1, 2,...,N. The optmal cost functons of the N solated subsystems can be formulated as J (x { 0) = mn Q 2 (x (τ)) + μ T (x (τ))r μ ( ) 0 μ (x (τ)) } dτ, = 1, 2,...,N. (10) In vew of optmal control theory, the optmal cost functons J (x ), = 1, 2,...,N, satsfy the HJB equatons 0 = mn H (x,μ, J (x )), = 1, 2,...,N (11) μ ( ) where J (x ) = J (x )/ x, = 1, 2,...,N. Assume that the mnma on the rght hand sde of (11) exst and are unque. Then, the optmal control polces for the N solated subsystems are u (x ) = arg mn H (x,μ, J (x )) μ ( ) = 1 2 R 1 g T (x ) J (x ), = 1, 2,...,N. (12) Substtutng the optmal control polces (12) nto the nonlnear Lyapunov equatons (8), we can obtan the formulaton of the HJB equatons n terms of J (x ), = 1, 2,...,N, as follows: 0 = Q 2 (x ) + ( J (x )) T f (x ) 1 4 ( J (x )) T g (x)r 1 g T (x ) J (x ) (13) wth J (0) = 0and = 1, 2,...,N. Remark 1: The formulas developed n (12) dsplay an array of closed-form expressons of the optmal control polces, whch obvates the need to search for the optmal control polces va optmzaton process. However, the knowledge of J (x ), = 1, 2,...,N, s requred, whch mples the mportance of the solutons of HJB equatons. B. Decentralzed Control Strategy Accordng to (12), we have expressed the optmal control polces,.e., u 1 (x 1), u 2 (x 2),..., u N (x N ), for the N solated subsystems (4). In the followng, we wll show that by proportonally ncreasng some local feedback gans, a stablzng decentralzed control scheme can be establshed for the nterconnected system (1). Now, we gve the followng lemma, ndcatng how the feedback gans can be added, to guarantee the asymptotc stablty of the solated subsystems. Lemma 1: Consder the solated subsystems (4), the feedback controls ū (x ) = π u (x ) = 1 2 π R 1 g T (x ) J (x ), = 1, 2,...,N (14)

4 LIU et al.: ONLINE LEARNING OPTIMAL CONTROL APPROACH 421 ensure that the N closed-loop solated subsystems are asymptotcally stable for all π 1/2, where = 1, 2,...,N. Proof: The lemma can be proved by showng J (x ), = 1, 2,..., N, are Lyapunov functons. Frst of all, n lght of (10), we can fnd that J (x )>0foranyx = 0 and J (x ) = 0whenx = 0, whch mples that J (x ), = 1, 2,..., N, are postve defnte functons. Next, the dervatves of J (x ), = 1, 2,...,N, along the correspondng trajectores of the closed-loop solated subsystems are gven by J (x ) = ( J (x )) T ẋ = ( J (x )) T ( f (x ) + g (x )ū (x )) (15) where = 1, 2,..., N. Then, by addng and subtractng (1/2)( J (x )) T g (x )u (x ) to (15) and consderng (12) (14), we have J (x ) = ( J (x )) T f (x ) 1 4 ( J (x )) T g (x )R 1 ) ( J (x )) T ( π g (x )R 1 g T (x ) J (x ) g T (x ) J (x ) = Q 2 (x ) 1 ( π 1 ) R 1/2 g T (x ) J 2 2 (x ) 2 (16) where = 1, 2,..., N. Observng (16), we can obtan that J (x ) < 0 for all π 1/2 and x = 0, where = 1, 2,...,N. Therefore, the condtons for Lyapunov local stablty theory are satsfed and the proof s completed. Remark 2: Lemma 1 reveals that any feedback controls ū (x ), = 1, 2,...,N, can ensure the asymptotc stablty of the closed-loop solated subsystems as long as π 1/2, = 1, 2,...,N. However, only when π = 1, = 1, 2,...,N, the feedback controls are optmal. In fact, smlar results have been gven n [47] [49], showng that the optmal controls u (x ), = 1, 2,...,N, are robust n the sense that they have nfnte gan margns. Now, we present the man theorem of ths paper based on that the acqured decentralzed control strategy can be establshed. Theorem 1: For nterconnected system (1), there exst N postve numbers π > 0, = 1, 2,...,N, such that for any π π, = 1, 2,...,N, the feedback controls developed by (14) ensure that the closed-loop nterconnected system s asymptotcally stable. In other words, the control par (ū 1 (x 1 ), ū 2 (x 2 ),..., ū N (x N )) s the decentralzed control strategy of large-scale system (1). Proof: In accordance wth Lemma 1, we observe that J (x ), = 1, 2,...,N, are all Lyapunov functons. Here, we select a composte Lyapunov functon gven by L(x) = =1 θ J (x ) (17) where θ, = 1, 2,...,N, are arbtrary postve constants. Takng the tme dervatve of L(x) along the trajectores of the closed-loop nterconnected system, we can obtan L(x) = = =1 =1 θ J (x ) θ { ( J (x )) T ( f (x ) + g (x )ū (x )) + ( J (x )) T g (x ) Z (x) }. (18) Then, takng (3), (6), and (16) nto consderaton, (18) can be turned nto the followng form: L(x) θ {Q 2 (x ) =1 + 1 ( π 1 ) R 1/2 g T 2 2 (x ) J (x ) } ( J (x )) T g (x )R 1/2 Z (x) θ {Q 2 (x ) =1 Here, we denote + 1 ( π 1 ) ( J 2 2 ( J (x )) T g (x )R 1/2 2 (x )) T g (x )R 1/2 j=1 2 } λ j Q j (x j ). (19) = dag{θ 1,θ 2,..., θ N } (20) λ 11 λ 12 λ 1N λ 21 λ 22 λ 2N =..... (21). λ N1 λ N2 λ NN and { ( 1 = dag π 1 1 ), 1 ( π 2 1 ),..., 1 ( π N 1 )}. (22) Therefore, through ntroducng a 2N-dmensonal vector Q 1 (x 1 ) Q 2 (x 2 ). Q N (x N ) ξ = ( J 1 (x 1)) T g 1 (x 1 )R 1/2 1 (23) ( J2 (x 2)) T g 2 (x 2 )R 1/2 2. ( JN (x N )) T g N (x N )R 1/2 N

5 422 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 2, FEBRUARY 2014 we can transform (19) to the followng compact form: L(x) ξ T 1 2 T 1 ξ 2 ξ T A ξ. (24) Accordng to (24), suffcently large π, = 1, 2,...,N, can be chosen such that the matrx A s postve defnte. That s to say, there exst π, = 1, 2,...,N, sothatany π π, = 1, 2,...,N, are large enough to guarantee the postve defnteness of A. Then, we have L(x) <0. Therefore, the closed-loop nterconnected system s asymptotcally stable under the acton of the control par (ū 1 (x 1 ), ū 2 (x 2 ),...,ū N (x N )). The proof s completed. Clearly, the focal pont of desgnng the decentralzed control strategy becomes to derve the optmal controllers for the N solated subsystems on the bass of Theorem 1. Then, we should put our emphass on solvng the HJB equatons, whch yet, s regarded as a dffcult task [12], [13]. Hence, n the followng, we wll employ a more pragmatc approach to obtan the approxmate solutons based on onlne polcy teraton algorthm and neural network technques. IV. NEURAL-NETWORK-BASED LEARNING OPTIMAL CONTROL OF THE ISOLATED SUBSYSTEMS USING ONLINE POLICY ITERATION ALGORITHM Three sectons are ncluded here. In the frst secton, the onlne polcy algorthm s ntroduced to tackle the optmal control problem of the solated subsystems, whereas the neural network mplementaton process s gven n the second secton. The stablty proof of the dynamcs of the estmaton errors s developed n the last secton. A. Onlne Polcy Iteraton Algorthm and Its Convergence Here, the onlne polcy teraton algorthm s ntroduced to solve the HJB equatons. The polcy teraton algorthm conssts of polcy evaluaton based on (8) and polcy mprovement based on (12), as shown n [22]. Specfcally, ts teraton procedure can be descrbed as follows. Step 1: Choose a small postve number ɛ. Letp = 0and V (0) (x ) = 0, where = 1, 2,...,N. Then, start wth N ntal admssble control polces μ (0) 1 (x 1), μ (0) 2 (x 2),..., μ (0) N (x N ). Step 2: Based on the control polces μ (p) (x ), = 1, 2,...,N, solve the followng nonlnear Lyapunov equatons ( ) 0 = Q 2 (x ) + μ (p) T (x ) R μ (p) (x ) ( ) + V (p+1) T ( ) (x ) f (x ) + g (x )μ (p) (x ) (25) wth V (p+1) (0) = 0and = 1, 2,...,N. Step 3: Update the control polces va μ (p+1) (x ) = 1 2 R 1 where = 1, 2,...,N. g T (x ) V (p+1) (x ) (26) Step 4: If V (p+1) (x ) V (p) (x ) ɛ, = 1, 2,...,N, stop and obtan the approxmate optmal controls of the N solated subsystems; else, let p = p + 1and go back to Step 2. Note that N ntal admssble control polces are requred n the above algorthm. In the followng, we present the convergence analyss of the onlne polcy teraton algorthm for the solated subsystems. Theorem 2: Consder the N solated subsystems (4), gven N ntal admssble control polces μ (0) 1 (x 1), μ (0) 2 (x 2),..., μ (0) N (x N ). Then, usng the polcy teraton algorthm establshed n (25) and (26), the cost functons and control polces converge to the optmal ones as p,.e., V (p) (x ) J (x ) and μ (p) (x ) u (x ) as p,where = 1, 2,...,N. Proof: Frst, we consder the subsystem. Accordng to [41] and [44], when gven an ntal admssble control polcy μ (0) (x ),wehaveμ (p) (x ) ( ) for any p 0. Addtonally, for any ζ>0, there exsts an nteger p 0,such that for any p p 0, the formulas sup V (p) (x ) J (x ) <ζ (27) x and sup μ (p) (x ) u (x ) <ζ (28) x hold smultaneously. Next, we consder the N solated subsystems. When gven μ (0) 1 (x 1), μ (0) 2 (x 2),..., μ (0) N (x N ), whereμ (0) (x ) s the ntal admssble control polcy correspondng to the th subsystem, we can acqure that μ (p) (x ) ( ) for any p 0, where = 1, 2,..., N. In addton, we denote p 0 = max{p 01, p 02,...,p 0N }. Thus, we can conclude that for any ζ > 0, there exsts an nteger p 0, such that for any p p 0, (27) and (28) are true wth = 1, 2,...,N. In other words, the algorthm wll converge to the optmal cost functons and optmal control polces of the N solated subsystems. The proof s completed. B. Implementaton Procedure va Neural Networks For the N solated subsystems, assume that the cost functons V (x ), = 1, 2,...,N, are contnuously dfferentable. Then, accordng to the unversal approxmaton property of neural networks, V (x ) can be reconstructed by a sngle-layer neural network on a compact set as V (x ) = ω T c σ c(x ) + ε c (x ), = 1, 2,...,N (29) where ω c R l s the deal weght, σ c (x ) R l s the actvaton functon, l s the number of neurons n the hdden layer, and ε c (x ) s the approxmaton error of the th neural network, = 1, 2,...,N. The dervatves of the cost functons wth respect to ther state vectors are formulated as V (x ) = ( σ c (x )) T ω c + ε c (x ), = 1, 2,...,N (30)

6 LIU et al.: ONLINE LEARNING OPTIMAL CONTROL APPROACH 423 where σ c (x ) = σ c (x )/ x R l n and ε c (x ) = ε c (x )/ x R n are the gradent of the actvaton functon and approxmaton error of the th neural network, respectvely, = 1, 2,...,N. Based on (30), the Lyapunov equatons (8) becomes 0 = Q 2 (x ) +μ T R μ + ( ω T c σ c(x ) + ( ε c (x )) T ) ẋ (31) where = 1, 2,...,N. For the th neural network, = 1, 2,...,N, assume that the neural network weght vector ω c, the gradent σ c (x ), and the approxmaton error ε c (x ) and ts dervatve ε c (x ) are all bounded on the compact set. In addton, accordng to [43], we have ε c (x ) 0and ε c (x ) 0asl, where = 1, 2,...,N. Because the deal weghts are unknown, N crtc neural networks can be bult to approxmate the cost functons as follows: ˆV (x ) =ˆω T c σ c(x ), = 1, 2,...,N (32) where ˆω c, = 1, 2,...,N, s the estmated weghts. Here, σ c (x ), = 1, 2,...,N, s selected such that ˆV (x )>0for any x = 0and ˆV (x ) = 0whenx = 0. Smlarly, the dervatves of the approxmate cost functons wth respect to the state vectors can be expressed by ˆV (x ) = ( σ c (x )) T ˆω c, = 1, 2,...,N (33) where ˆV (x ) = ˆV (x )/ x, = 1, 2,...,N. Then, the approxmate Hamltonan functons can be expressed as H (x,μ, ˆω c ) = Q 2 (x ) + μ T R μ +ˆω T c σ c(x )ẋ = e c, = 1, 2,...,N. (34) For the purpose of tranng the crtc networks of the solated subsystems, t s desred to desgn ˆω c, = 1, 2,...,N, to mnmze the followng objectve functons: E c = 1 2 et c e c, = 1, 2,...,N. (35) The standard steepest descent algorthm s ntroduced to tune the crtc networks, then ther weghts are updated through [ ] Ec ˆω c = α c, = 1, 2,...,N (36) ˆω c where α c > 0, = 1, 2,...,N, s the learnng rates of the crtc networks. On the other hand, based on (30), the Hamltonan functons take the followng forms: H (x,μ,ω c ) = Q 2 (x ) + μ T R μ + ω T c σ c(x )ẋ = e ch, = 1, 2,...,N (37) where e ch = ( ε c (x )) T ẋ, = 1, 2,...,N, s the resdual errors because of the neural network approxmaton. Denote δ = σ c (x )ẋ, = 1, 2,...,N. We assume that there exst N postve constants δ M, = 1, 2,...,N, such that δ δ M, = 1, 2,...,N. (38) In addton, we defne the weght estmaton errors of the crtc networks as ω c = ω c ˆω c,where = 1, 2,...,N. Then, combnng (34) wth (37) yelds e ch e c = ω T c δ, = 1, 2,...,N. (39) Therefore, the dynamcs of the weght estmaton errors can be gven as follows: ω c = α c (e ch ω c T δ )δ, = 1, 2,...,N. (40) Incdentally, the persstency of exctaton condton s requred to tune the th crtc network to guarantee that δ δ m, where δ m, = 1, 2,...,N, are postve constants. Thus, a set of probng noses wll be added to the solated subsystems to satsfy the condton n practce. When mplementng the onlne polcy teraton algorthm, to accomplsh the polcy mprovement, we should obtan the control polces that mnmze the current cost functons. Hence, accordng to (12) and (30), we have μ (x ) = 1 2 R 1 g T (x ) V (x ) = 1 2 R 1 g T (x ) ( ( σ c (x )) T ω c + ε c (x ) ) (41) where = 1, 2,..., N. Correspondngly, the approxmate control polces can be obtaned by ˆμ (x ) = 1 2 R 1 g T (x ) ˆV (x ) = 1 2 R 1 g T (x )( σ c (x )) T ˆω c (42) where = 1, 2,...,N. Remark 3: Accordng to (42), t s obvous to observe that the approxmate control polces of the N solated subsystems can be derved drectly based on the traned crtc networks. Therefore, unlke the tradtonal actor-crtc archtecture, the acton neural networks are not requred any more. C. Stablty Analyss When consderng the crtc networks, the weght estmaton dynamcs are UUB as descrbed n the followng theorem. Theorem 3: For the N solated subsystems (4), the weght update laws for tunng the crtc networks are gven by (36). Then, the dynamcs of the weght estmaton errors of the crtc networks are UUB. Proof: Choose N Lyapunov functon canddates descrbed as follows: L (t) = 1 tr ( ω c T α ω ) c, = 1, 2,...,N. (43) c The tme dervatves of the Lyapunov functons L (t), = 1, 2,...,N, along the trajectores of the error dynamcs (40) are L (t) = 2 tr ( ω T α ω ) c c c = 2 tr ( ω c T α α ( c ech ω c T δ ) ) δ (44) c

7 424 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 2, FEBRUARY 2014 Fg. 2. Convergence of the weght vector of the crtc network 1 (ω ac11, ω ac12,andω ac13 represent ˆω c11, ˆω c12,and ˆω c13, respectvely). Fg D plot of the approxmaton error of the cost functon of solated subsystem 1,.e., J1 (x 1) ˆV 1 (x 1 ). Fg. 3. Convergence of the weght vector of the crtc network 2 (ω ac21, ω ac22,andω ac23 represent ˆω c21, ˆω c22,and ˆω c23, respectvely). Fg D plot of the approxmaton error of the control polcy of solated subsystem 1,.e., u 1 (x 1) ˆμ 1 (x 1 ). where = 1, 2,...,N. After some basc manpulatons, t yelds L (t) (2 α c ) ω T c δ ech 2 (45) α c where = 1, 2,..., N. In vew of the Cauchy Schwarz nequalty and (38), we can conclude that L (t) <0 as long as 0 <α c < 2 e 2 (46) ch ω c > α c (2 α c )δ 2 M where = 1, 2,..., N. In accordance wth the Lyapunov stablty theory, we obtan that the dynamcs of the weght estmaton errors of the crtc networks are all UUB. Meanwhle, the norms of the weght estmaton errors are bounded as well. The proof s completed. Remark 4: Let ˆū (x ) = π ˆμ (x ), where ˆμ (x ), = 1, 2,..., N, are obtaned by (42). Accordng to the selectons of the actvaton functons of the crtc networks, we can easly fnd that the approxmate optmal cost functons ˆV (x ), = 1, 2,...,N, are also Lyapunov functons. Furthermore, smlar to the proof of Theorem 1, we have L(x) ξ T A ξ + e,where e s the sum of the approxmaton errors. Hence, we can conclude that based on the approxmate optmal control polces ˆμ (x ), = 1, 2,...,N, the developed control par ( ˆū 1 (x 1 ), ˆū 2 (x 2 ),..., ˆū N (x N )) can ensure the UUB of the state trajectores of the closed-loop nterconnected system. It s n ths sense that we accomplsh the desgn of the decentralzed control scheme by adoptng the learnng optmal control approach based on onlne polcy teraton algorthm. Remark 5: Note that the controller presented here s a decentralzed stablzaton one. Though the optmal decentralzed controller of nterconnected systems has been studed before [50], n ths paper, we am at developng a novel decentralzed control strategy based on ADP. How to extend the present results to the desgn of optmal decentralzed control for nonlnear nterconnected systems s part of our future research.

8 LIU et al.: ONLINE LEARNING OPTIMAL CONTROL APPROACH 425 Fg D plot of the approxmaton error of the cost functon of solated subsystem 2,.e., J2 (x 2) ˆV 2 (x 2 ). Fg D plot of the approxmaton error of the control polcy of solated subsystem 2,.e., u 2 (x 2) ˆμ 2 (x 2 ). V. SIMULATION STUDY A smulaton example s provded n ths secton to show the applcablty of the decentralzed control strategy establshed n ths paper. Consder the followng contnuous-tme nonlnear largescale system consstng of two nterconnected subsystems: [ ] x ẋ 1 = 11 + x x x x 12 (cos(2x 11 ) + 2) 2 [ ] 0 ( ) + ū cos(2x 11 ) (x 1 )+(x 11+x 22) sn x12 2 cos(0.5x 21) [ ] x ẋ 2 = 22 x x x21 2 x 22 [ ] 0 ( + ū x 2 (x 2 ) + 0.5(x 12 + x 22 ) cos ( ) ) e x2 21 (47) 21 where x 1 =[x 11 x 12 ] T R 2 and ū 1 (x 1 ) R are the state and control varables of subsystem 1, and x 2 = [x 21 x 22 ] T R 2 and ū 2 (x 2 ) R are the state and control varables of subsystem 2. Let R 1 = R 2 = I, wherei denotes the dentty matrx wth sutable dmenson. Addtonally, let h 1 (x 1 ) = x 1 and h 2 (x 2 ) = x 22. Then, we fnd that Z 1 (x) and Z 2 (x) wth x =[x1 T x 2 T ]T are upper bounded as n (3). For example, we can select λ 11 = λ 12 = 1andλ 21 = λ 22 = 1/2. To desgn the decentralzed controller of nterconnected system (47), we frst deal wth the optmal control problem of two solated subsystems. Here, we choose Q 1 (x 1 ) = x 1 and Q 2 (x 2 ) = x 22. Hence, the cost functons of the optmal control problem are { J 1 (x 10 ) = x x ut 1 u } 1 dτ (48) and J 2 (x 20 ) = 0 0 { x u2 T u } 2 dτ. (49) We adopt the onlne polcy teraton algorthm to tackle the optmal control problem, where two crtc networks are constructed to approxmate the cost functons. We denote the weght vectors of the two crtc networks as Fg. 8. State trajectory of subsystem 1 under the acton of the decentralzed control strategy (π 1 ˆμ 1 (x 1 ), π 2 ˆμ 2 (x 2 )). ˆω c1 = [ˆω c11 ˆω c12 ˆω c13 ] T and ˆω c2 = [ˆω c21 ˆω c22 ˆω c23 ] T. Durng the smulaton process, the ntal weghts of the crtc networks are chosen randomly n [0, 2]. In addton, the actvaton functons of the two crtc networks are chosen as σ c1 (x 1 ) =[x 2 11 x 11x 12 x 2 12 ]T and σ c2 (x 2 ) =[x 2 21 x 21x 22 x 2 22 ]T. Besdes, let the learnng rates of the crtc networks be α c1 = α c2 = 0.1 and the ntal states of the two solated subsystems be x 10 = x 20 =[1 1] T. Durng the mplementaton process of the onlne polcy teraton algorthm, for each solated subsystem, we add a probng nose to satsfy the persstency of exctaton condton. We can observe that the convergence results of the weghts have occurred after 750 and 180 s, respectvely. Then, the probng sgnals are turned off. Actually, the weghts of the crtc networks converge to and ˆω c1 =[ ] T (50) ˆω c2 =[ ] T (51) that are shown n Fgs. 2 and 3.

9 426 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 2, FEBRUARY 2014 The cost functons can be approxmated by constructng several crtc networks and the expressons of the control polces can be obtaned drectly. In addton, the dynamcs of the estmaton errors are proved to be UUB. Smulaton study s presented to demonstrate the valdty of the decentralzed control strategy n the end. REFERENCES Fg. 9. State trajectory of subsystem 2 under the acton of the decentralzed control strategy (π 1 ˆμ 1 (x 1 ), π 2 ˆμ 2 (x 2 )). Based on the convergent weghts ˆω c1 and ˆω c2, we can obtan the approxmate optmal cost functon and control polcy for each solated subsystem, namely, ˆV 1 (x 1 ), ˆμ 1 (x 1 ), ˆV 2 (x 2 ),and ˆμ 2 (x 2 ). In comparson, for the method proposed n [43], the optmal cost functon and control polcy of solated subsystem 1 are J1 (x 1) = 0.5x x 12 2 and u 1 (x 1) = (cos(2x 11 ) + 2)x 12, respectvely. Smlarly, the optmal cost functon and control polcy of solated subsystem 2 are J2 (x 2) = x x 22 2 and u 2 (x 2) = x 21 x 22. Therefore, for solated subsystem 1, the error between the optmal cost functon and the approxmate one s shown n Fg. 4. In addton, the error between the optmal control polcy and the approxmate verson s shown n Fg. 5. It s clear to see that both the approxmaton errors are close to zero, whch verfes the good performance of the onlne learnng algorthm. When regardng the solated subsystem 2, we obtan the same smulaton results shown n Fgs. 6 and 7. Next, by choosng θ 1 = θ 2 = 1andπ 1 = π 2 = 2, we can guarantee the postve defnteness of the matrx A. Thus, (π 1 ˆμ 1 (x 1 ), π 2 ˆμ 2 (x 2 )) s the decentralzed control strategy of the orgnal nterconnected system (47). Here, we apply the decentralzed control scheme to controlled plant (47) for 40 s and obtan the evoluton processes of the state trajectores shown n Fgs. 8 and 9. Through zoomng n the state trajectores near the zero, t s demonstrated that the state trajectores of the closed-loop system are UUB. Obvously, these smulaton results authentcate the valdty of the decentralzed control approach developed n ths paper. VI. CONCLUSION In ths paper, a novel decentralzed control strategy s developed to deal wth the stablzaton problem of a class of contnuous-tme nonlnear large-scale systems usng onlne polcy teraton algorthm. Intally, the optmal controllers of the solated subsystems are desgned. Then, t s shown that the decentralzed control strategy of the overall system can be establshed by addng feedback gans to the obtaned optmal control polces. In addton, the onlne polcy teraton algorthm s ntroduced to solve the HJB equatons teratvely. [1] L. Bakule, Decentralzed control: An overvew, Annu. Rev. Control, vol. 32, no. 1, pp , Apr [2] D. D. Sljak and A. I. Zecevc, Control of large-scale systems: Beyond decentralzed feedback, Annu. Rev. Control, vol. 29, no. 2, pp , Dec [3] J. Lavae, Decentralzed mplementaton of centralzed controllers for nterconnected systems, IEEE Trans. Autom. Control, vol. 57, no. 7, pp , Jul [4] H. F. Grp, A. Saber, and T. A. Johansen, Observers for nterconnected nonlnear and lnear systems, Automatca, vol. 48, no. 7, pp , Jul [5] S. Mehraeen, S. Jagannathan, and M. L. Crow, Power system stablzaton usng adaptve neural network-based dynamc surface control, IEEE Trans. Power Syst., vol. 26, no. 2, pp , May [6] K. Kals, J. Lan, and S. H. Zak, Decentralzed dynamc output feedback control of nonlnear nterconnected systems, IEEE Trans. Autom. Control, vol. 55, no. 8, pp , Aug [7] Z. G. Hou, M. M. Gupta, P. N. Nkforuk, M. Tan, and L. Cheng, A recurrent neural network for herarchcal control of nterconnected dynamc systems, IEEE Trans. Neural Netw., vol. 18, no. 2, pp , Mar [8] A. Saber, On optmalty of decentralzed control for a class of nonlnear nterconnected systems, Automatca, vol. 24, no. 1, pp , Jan [9] P. J. Werbos, Advanced forecastng methods for global crss warnng and models of ntellgence, n Proc. General Syst., Jun. 1977, pp [10] P. J. Werbos, Approxmate dynamc programmng for real-tme control and neural modelng, n Handbook of Intellgent Control: Neural, Fuzzy, and Adaptve Approaches, D.A.WhteandD.A.Sofge,Eds. New York, NY, USA: Van Nostrand Renhold, 1992, ch. 13. [11] H. Zhang, D. Lu, Y. Luo, and D. Wang, Adaptve Dynamc Programmng for Control: Algorthms and Stablty. London, U.K.: Sprnger-Verlag, [12] F. Y. Wang, H. Zhang, and D. Lu, Adaptve dynamc programmng: An ntroducton, IEEE Comput. Intell. Mag., vol. 4, no. 2, pp , May [13] F. L. Lews and D. Vrabe, Renforcement learnng and adaptve dynamc programmng for feedback control, IEEE Crcuts Syst. Mag., vol. 9, no. 3, pp , Jul [14] J. Fu, H. He, and X. Zhou, Adaptve learnng and control for MIMO system based on adaptve dynamc programmng, IEEE Trans. Neural Netw., vol. 22, no. 7, pp , Jul [15] F. Y. Wang, N. Jn, D. Lu, and Q. We, Adaptve dynamc programmng for fnte-horzon optmal control of dscrete-tme nonlnear systems wth ε-error bound, IEEE Trans. Neural Netw., vol. 22, no. 1, pp , Jan [16] F. L. Lews and D. Lu, Renforcement Learnng and Approxmate Dynamc Programmng for Feedback Control. New York, NY, USA: Wley, [17] S. N. Balakrshnan, J. Dng, and F. L. Lews, Issues on stablty of ADP feedback controllers for dynamc systems, IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 38, no. 4, pp , Aug [18] A. Al-Tamm, F. L. Lews, and M. Abu-Khalaf, Dscrete-tme nonlnear HJB soluton usng approxmate dynamc programmng: Convergence proof, IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 38, no. 4, pp , Aug [19] D. P. Bertsekas, M. L. Homer, D. A. Logan, S. D. Patek, and N. R. Sandell, Mssle defense and nterceptor allocaton by neurodynamc programmng, IEEE Trans. Syst., Man, Cybern. A, Syst. Humans, vol. 30, no. 1, pp , Jan [20] J. S and Y. T. Wang, On-lne learnng control by assocaton and renforcement, IEEE Trans. Neural Netw., vol. 12, no. 2, pp , Mar [21] D. V. Prokhorov and D. C. Wunsch, Adaptve crtc desgns, IEEE Trans. Neural Netw., vol. 8, no. 5, pp , Sep

LIU et al.: ONLINE LEARNING OPTIMAL CONTROL APPROACH 427 [22] R. S. Sutton and A. G. Barto, Renforcement Learnng: An Introducton. Cambrdge, MA, USA: MIT Press, 1998. [23] H. Zhang, Y. Luo, and D.

10 LIU et al.: ONLINE LEARNING OPTIMAL CONTROL APPROACH 427 [22] R. S. Sutton and A. G. Barto, Renforcement Learnng: An Introducton. Cambrdge, MA, USA: MIT Press, [23] H. Zhang, Y. Luo, and D. Lu, Neural-network-based near-optmal control for a class of dscrete-tme affne nonlnear systems wth control constrants, IEEE Trans. Neural Netw., vol. 20, no. 9, pp , Sep [24] H. Zhang, Q. We, and D. Lu, An teratve adaptve dynamc programmng method for solvng a class of nonlnear zero-sum dfferental games, Automatca, vol. 47, no. 1, pp , Jan [25] D. Wang, D. Lu, and Q. We, Fnte-horzon neuro-optmal trackng control for a class of dscrete-tme nonlnear systems usng adaptve dynamc programmng approach, Neurocomputng, vol. 78, no. 1, pp , Feb [26] D. Lu, D. Wang, D. Zhao, Q. We, and N. Jn, Neural-network-based optmal control for a class of unknown dscrete-tme nonlnear systems usng globalzed dual heurstc programmng, IEEE Trans. Autom. Sc. Eng., vol. 9, no. 3, pp , Jul [27] D. Lu, H. L, and D. Wang, H control of unknown dscretetme nonlnear systems wth control constrants usng adaptve dynamc programmng, n Proc. Int. Jont Conf. Neural Netw., Jun. 2012, pp [28] D. Lu, D. Wang, and X. Yang, An teratve adaptve dynamc programmng algorthm for optmal control of unknown dscrete-tme nonlnear systems wth constraned nputs, Inf. Sc., vol. 220, no. 20, pp , Jan [29] T. Derks and S. Jagannathan, Onlne optmal control of affne nonlnear dscrete-tme systems wth unknown nternal dynamcs by usng tmebased polcy update, IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 7, pp , Jul [30] Y. Jang and Z. P. Jang, Robust adaptve dynamc programmng for large-scale systems wth an applcaton to multmachne power systems, IEEE Trans. Crcuts Syst. II, Exp. Brefs, vol. 59, no. 10, pp , Oct [31] H. Xu, S. Jagannathan, and F. L. Lews, Stochastc optmal control of unknown lnear networked control system n the presence of random delays and packet losses, Automatca, vol. 48, no. 6, pp , Jun [32] S. Mehraeen and S. Jagannathan, Decentralzed optmal control of a class of nterconnected nonlnear dscrete-tme systems by usng onlne Hamlton-Jacob-Bellman formulaton, IEEE Trans. Neural Netw., vol. 22, no. 11, pp , Nov [33] J. W. Park, R. G. Harley, and G. K. Venayagamoorthy, Decentralzed optmal neuro-controllers for generaton and transmsson devces n an electrc power network, Eng. Appl. Artf. Intell., vol. 18, no. 1, pp , Feb [34] J. Lang, G. K. Venayagamoorthy, and R. G. Harley, Wde-area measurement based dynamc stochastc optmal power flow control for smart grds wth hgh varablty and uncertanty, IEEE Trans. Smart Grd, vol. 3, no. 1, pp , Mar [35] H. N. Wu and B. Luo, Neural network based onlne smultaneous polcy update algorthm for solvng the HJI equaton n nonlnear H control, IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 12, pp , Dec [36] S. G. Khan, G. Herrmann, F. L. Lews, T. Ppe, and C. Melhush, Renforcement learnng and optmal adaptve control: An overvew and mplementaton examples, Annu. Rev. Contorl, vol. 36, no. 1, pp , Apr [37] Z. N, H. He, and J. Wen, Adaptve learnng n trackng control based on the dual crtc network desgn, IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 6, pp , Jun [38] X. Xu, Z. Hou, C. Lan, and H. He, Onlne learnng control usng adaptve crtc desgns wth sparse kernel machnes, IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 5, pp , May [39] Y. Jang and Z. P. Jang, Computatonal adaptve optmal control for contnuous-tme lnear systems wth completely unknown dynamcs, Automatca, vol. 48, no. 10, pp , Oct [40] A. Heydar and S. N. Balakrshnan, Fnte-horzon control-constraned nonlnear optmal control usng sngle network adaptve crtcs, IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 1, pp , Jan [41] M. Abu-Khalaf and F. L. Lews, Nearly optmal control laws for nonlnear systems wth saturatng actuators usng a neural network HJB approach, Automatca, vol. 41, no. 5, pp , May [42] D. Vrabe and F. L. Lews, Neural network approach to contnuoustme drect adaptve optmal control for partally unknown nonlnear systems, Neural Netw., vol. 22, no. 3, pp , Apr [43] K. G. Vamvoudaks and F. L. Lews, Onlne actor-crtc algorthm to solve the contnuous-tme nfnte horzon optmal control problem, Automatca, vol. 46, no. 5, pp , May [44] D. Lu, X. Yang, and H. L, Adaptve optmal control for a class of contnuous-tme affne nonlnear systems wth unknown nternal dynamcs, n Neural Computng and Applcatons. New York, NY, USA: Sprnger-Verlag, Nov [45] S. Bhasn, M. Johnson, and W. E. Dxon, A model-free robust polcy teraton algorthm for optmal control of nonlnear systems, n Proc. 49th IEEE Conf. Decson Control, Dec. 2010, pp [46] S. Bhasn, R. Kamalapurkar, M. Johnson, K. G. Vamvoudaks, F. L. Lews, and W. E. Dxon, A novel actor crtc dentfer archtecture for approxmate optmal control of uncertan nonlnear systems, Automatca, vol. 49, no. 1, pp , Jan [47] S. T. Glad, On the gan margn of nonlnear and optmal regulators, IEEE Trans. Autom. Control, vol. 29, no. 7, pp , Jul [48] J. N. Tstskls and M. Athans, Guaranteed robustness propertes of multvarable nonlnear stochastc optmal regulators, IEEE Trans. Autom. Control, vol. 29, no. 8, pp , Aug [49] R. W. Beard, G. N. Sards, and J. T. Wen, Galerkn approxmatons of the generalzed Hamlton-Jacob-Bellman equaton, Automatca, vol. 33, no. 12, pp , Dec [50] D. D. Sljak, Decentralzed Control of Complex Systems. Boston, MA, USA: Academc, Derong Lu (S 91 M 94 SM 96 F 05) receved the B.S. degree n mechancal engneerng from the East Chna Insttute of Technology (now Nanjng Unversty of Scence and Technology), Nanjng, Chna, n 1982, the M.S. degree n automatc control theory and applcatons from the Insttute of Automaton, Chnese Academy of Scences, Bejng, Chna, n 1987, and the Ph.D. degree n electrcal engneerng from the Unversty of Notre Dame, Notre Dame, IN, USA, n He was a Product Desgn Engneer wth Chna North Industres Corporaton, Jln, Chna, from 1982 to He was an Instructor wth the Graduate School of the Chnese Academy of Scences, Bejng, from 1987 to He was a Staff Fellow wth the General Motors Research and Development Center, Warren, MI, USA, from 1993 to He was an Assstant Professor wth the Department of Electrcal and Computer Engneerng, Stevens Insttute of Technology, Hoboken, NJ, USA, from 1995 to He joned the Unversty of Illnos at Chcago, Chcago, IL, USA, n 1999, and became a Full Professor of electrcal and computer engneerng and computer scence n He was selected for the 100 Talents Program by the Chnese Academy of Scences n He has publshed 14 books (sx research monographs and eght edted volumes). Dr. Lu was an Assocate Edtor of Automatca from 2006 to He serves as an Assocate Edtor of Neurocomputng, Internatonal Journal of Neural Systems, Soft Computng, Neural Computng and Applcatons, Journal of Control Scence and Engneerng, andscence n Chna Seres F: Informaton Scences. He was an elected member of the Board of Governors of the Internatonal Neural Network Socety from 2010 to He s a Governng Board Member of Asa Pacfc Neural Network Assembly. He was a member of the Conference Edtoral Board of the IEEE Control Systems Socety from 1995 to 2000, an Assocate Edtor of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I: FUNDAMENTAL THEORY AND APPLICATIONS from 1997 to 1999, the IEEE TRANSACTIONS ON SIGNAL PROCESSING from 2001 to 2003, the IEEE TRANSACTIONS ON NEURAL NETWORKS from 2004 to 2009, the IEEE Computatonal Intellgence Magazne from 2006 to 2009, and the IEEE Crcuts and Systems Magazne from 2008 to 2009, and the Letters Edtor of the IEEE TRANSACTIONS ON NEURAL NETWORKS from 2006 to He was the Foundng Edtor of the IEEE Computatonal Intellgence Socety s Electronc Letter from 2004 to Currently, he s the Edtor-n-Chef of the IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS and an Assocate Edtor of the IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY and the IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS. He s the General Char of the 2014 IEEE World Congress on Computatonal Intellgence, Bejng, Chna. He was an elected AdCom member of the IEEE Computatonal Intellgence Socety from 2006 to He s the Char of IEEE CIS Bejng Chapter. He receved the Mchael J. Brck Fellowshp from the Unversty of Notre Dame n 1990, the Harvey N. Davs Dstngushed Teachng Award from Stevens Insttute of Technology n 1997, the Faculty Early Career Development (CAREER) Award from the Natonal Scence Foundaton n 1999, the Unversty Scholar Award from the Unversty of Illnos n 2006, and the Overseas Outstandng Young Scholar Award from the Natonal Natural Scence Foundaton of Chna n He s a member of Eta Kappa Nu and a fellow of the INNS.

428 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 2, FEBRUARY 2014 Dng Wang receved the B.S. degree n mathematcs from the Zhengzhou Unversty of Lght Industry, Zhengzhou, Chna, the M.

degree n control theory and control engneerng from the Insttute of Automaton, Chnese Academy of Scences, Bejng, Chna, n 2007, 2009, and 2012, respectvely.

11 428 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 2, FEBRUARY 2014 Dng Wang receved the B.S. degree n mathematcs from the Zhengzhou Unversty of Lght Industry, Zhengzhou, Chna, the M.S. degree n operatons research and cybernetcs from Northeastern Unversty, Shenyang, Chna, and the Ph.D. degree n control theory and control engneerng from the Insttute of Automaton, Chnese Academy of Scences, Bejng, Chna, n 2007, 2009, and 2012, respectvely. He s currently an Assstant Professor wth The State Key Laboratory of Management and Control for Complex Systems, Insttute of Automaton, Chnese Academy of Scences. Hs current research nterests nclude adaptve dynamc programmng, neural networks and learnng systems, and complex systems and ntellgent control. Honglang L receved the B.S. degree n mechancal engneerng and automaton from the Bejng Unversty of Posts and Telecommuncatons, Bejng, Chna, n He s currently pursung the Ph.D. degree wth The State Key Laboratory of Management and Control for Complex Systems, Insttute of Automaton, Chnese Academy of Scences, Bejng. He s wth the Unversty of Chnese Academy of Scences, Bejng. Hs current research nterests nclude machne learnng, neural networks, renforcement learnng, adaptve dynamc programmng, and game theory.

Off-policy Reinforcement Learning for Robust Control of Discrete-time Uncertain Linear Systems

Off-policy Reinforcement Learning for Robust Control of Discrete-time Uncertain Linear Systems Off-polcy Renforcement Learnng for Robust Control of Dscrete-tme Uncertan Lnear Systems Yonglang Yang 1 Zhshan Guo 2 Donald Wunsch 3 Yxn Yn 1 1 School of Automatc and Electrcal Engneerng Unversty of Scence