Online Partially Model-Free Solution of Two-Player Zero Sum Differential Games

Size: px

Start display at page:

Download "Online Partially Model-Free Solution of Two-Player Zero Sum Differential Games"

Randolph Gordon
6 years ago
Views:

1 Preprins of he 10h IFAC Inernaional Symposium on Dynamics and Conrol of Process Sysems The Inernaional Federaion of Auomaic Conrol Online Parially Model-Free Soluion of Two-Player Zero Sum Differenial Games Praveen P Shubhendu Bhasin Deparmen of Elecrical Engineering, Indian Insiue of Technology Delhi, India ( eez138263@ee.iid.ac.in) Deparmen of Elecrical Engineering, Indian Insiue of Technology Delhi, India ( sbhasin@ee.iid.ac.in) Absrac: An online adapive dynamic programming based ieraive algorihm is proposed for a wo-player zero sum linear differenial game problem arising in he conrol of process sysems affeced by disurbances. The objecive in such a scenario is o obain an opimal conrol policy ha minimizes he specified performance index or cos funcion in presence of wors case disurbance. Convenional algorihms for he soluion of such problems require full knowledge of sysem dynamics. The algorihm proposed in his paper is parially model-free and solves he wo-player zero sum linear differenial game problem wihou knowledge of sae and conrol inpu marices. Keywords: Two-player zero sum differenial game, Adapive dynamic programming, Approximae Dynamic Programming 1. INTRODUCTION In a wo-player zero sum game, each player s gain is exacly balanced by he loss of his compeior and he ne gain of he players a any poin in ime is zero [Isaacs (1965)]. In scenarios of perfec compeiion such as ha exis in a zero sum game, each player ries o make he bes possible decision aking ino accoun he fac ha his opponen also ries o do he same. The heory of zero sum differenial game finds applicaions in a wide variey of disciplines including ha of conrol heory [Tomlin e al. (2000); Wei and Liu (2012)]. The problem of designing opimal conroller for any process sysem subjec o wors possible disurbance is a minimax opimizaion problem where he conroller ries o minimize and he disurbance ries o maximize he infinie horizon quadraic cos. Using game heoreic framework, he above minimax opimizaion problem can be viewed as a zero sum differenial game where he conroller acs as he minimizer or minimizing player, while he disurbance acs as he maximizer or maximizing player [Basar and Bernhard (1995); Basar and Olsder (1995)]. The opimal conrol in such a scenario is equivalen o finding he Nash equilibrium or saddle poin equilibrium of he corresponding wo-player zero sum differenial game [Basar and Bernhard (1995)]. To obain he saddle poin equilibrium sraegy of each player in he wo-player zero sum linear differenial game, one needs o solve an Algebraic Riccai Equaion (ARE) wih a sign indefinie quadraic erm known as he Game Algebraic Riccai Equaion (GARE). [Kleinman (1968)] proposed an ieraive mehod for solving he ARE wih a sign definie quadraic erm. A series of Lyapunov equaions are consruced a each ieraion and he posiive semi-definie soluions of he Lyapunov equaions are shown o converge o he soluion of ARE. However he algorihm proposed in [Kleinman (1968)] for solving he ARE could no be exended o he GARE due o he presence of a sign indefinie quadraic erm. Subsequenly, several Newon-ype algorihms were proposed for he soluion he GARE [Arnold and Laub (1984); Damm and Hinrichsen (2001); Mehrmann and Tan (1988)]. [Mehrmann (1991)] and [Sima (1996)] proposed a marix sign funcion mehod for solving GARE. A more robus ieraive algorihm for solving GARE was proposed by [Lanzon e al. (2008)], where he GARE wih a sign indefinie quadraic erm is replaced by a sequence of AREs wih sign definie quadraic erms. Each of hese AREs can hen be sequenially solved using Kleinman s algorihm or any oher exising algorihm and he recursive soluion of hese AREs converge o he soluion of GARE. However, all he above resuls require knowledge of full sysem dynamics, which is a severe resricion owing o he uncerainies in sysem modelling. The concep of Adapive Dynamic Programming (ADP) was proposed by [Werbos (1992)] for solving he dynamic programming problems relaed o classical opimal conrol in a forward in ime fashion. ADP is based on he conceps of Dynamic Programming [Bellman (2003)] and Reinforcemen Learning [Suon and Baro (1998)] and has been widely used o reach approximae soluions of opimal conrol problems [Abu-Khalaf and Lewis (2005); Lewis and Liu (2013)]. The classical opimal conrol problem is a single player linear differenial game problem [Isaacs (1965)] and a deailed analysis of he applicaion of ADP for he soluion of single player linear differenial game problems (opimal conrol problems) is provided in [Wang e al. (2009)] and [Lewis e al. (2012)]. Copyrigh 2013 IFAC 696

2 Recenly, [Vrabie e al. (2009)] proposed a parially modelfree, online algorihm for finding he soluion of single player zero sum differenial game in coninuous ime. However he proposed algorihm required parial knowledge of sysem dynamics, namely he conrol inpu marix. Subsequenly, a compleely model-free algorihm was proposed by [Jiang and Jiang (2012)] in order o arrive a he opimal conrol of linear sysem wihou requiring knowledge of he conrol inpu and he sysem drif marices. Exending he principle of ADP o differenial games, a model-free algorihm was proposed by [Al-Tamimi e al. (2007)] for solving he wo-player linear differenial zero sum game appearing in discree ime H conrol. [Abu- Khalaf e al. (2008)] proposed an algorihm o solve coninuous ime wo-player nonlinear differenial game problem wih full knowledge of sysem dynamics. In [Vrabie and Lewis (2011)], he saddle poin soluion of a wo player zero sum differenial game was obained using a parially model-free algorihm. The algorihm required no knowledge of sysem drif dynamics, while he knowledge of conrol inpu and disurbance marices was sill needed. In his paper, an online parially model free ieraive algorihm is proposed o solve he wo-player zero sum differenial game problem. The conribuion of his resul over exising resuls in he lieraure is ha he saddle poin soluion is arrived a wihou he knowledge of sysem drif and conrol inpu marices. Only knowledge of he disurbance inpu marix is required and hence, he proposed algorihm is comparaively more robus han he one proposed in [Vrabie and Lewis (2011)]. The proposed algorihm hereby reduces he need of rigorous sysem idenificaion requiremens, which oherwise is necessary for obaining he sysem drif and conrol inpu marices. The algorihm is daa-driven and he saddle poin soluion of he game problem can be achieved online. The algorihm makes use of wo ieraive seps namely, he ouer level ieraion and he inner level ieraion. In he ouer level ieraion, he wo-player zero sum differenial game problem is convered o a sequence of opimal conrol problems. The inner level ieraion moivaed by he work of [Jiang and Jiang (2012)], is used o find he modelfree soluion of each of hese opimal conrol problems. I is shown ha he soluion of opimal conrol problems converge o he saddle poin equilibrium soluion of he underlying game problem. The algorihm is implemened on a linearized power sysem model [Wang e al. (1963)] o arrive a he opimal conroller in presence of wors case disurbance. The algorihm can be exended o any process sysem wih parially unknown dynamics. 2.1 Problem Formulaion 2. PRELIMINARIES The H conroller design was formulaed by [Basar and Bernhard (1995)] as a wo-player zero sum differenial game. Consider he differenial game formulaion given below, ẋ = Ax + B 1 w + B 2 u (1) y = Cx, (2) where x R n is he sae vecor assumed o be measurable, u R m is he conrol inpu and w R d is he disurbance inpu. The conroller player s objecive is o minimize he infinie horizon cos funcion given by, J(x(0), u(), w()) = (x T C T Cx + u T u w T w)d. (3) A he same ime, he disurbance player who is in perfec compeiion wih he conroller player ries o maximize he cos funcion (3). The opimal sraegies o be adoped by he players in he zero sum game is ermed as he saddle-poin equilibrium. The saddle poin equilibrium sraegy for each player is defined as follows:- Definiion 1. A pair of sraegies {u, w } is in saddle poin equilibrium, if he following se of inequaliies are saisfied for all permissible u and w [Basar and Olsder (1995)]: J(x, u, w) J(x, u, w ) J(x, u, w ) (4) The quaniy J(x, u, w ) is called he saddle poin value of he zero sum game. The wo-player zero sum differenial game described above is a wo-player opimizaion problem and he saddle poin sraegies can be found by using he Ponryagin s Maximum Principle. Hamilonian for he cos funcion in (1) and he sysem in (3) can be defined as, H(x, u, v, λ) = 1 2 (xt C T Cx + u T u w T w) + λ T (Ax + B 1 w + B 2 u). (5) Applying he necessary condiion for opimaliy of H u = 0 and H w = 0, u = B T 2 Π x() (6) w = B T 1 Π x(), (7) where Π is he posiive definie symmeric soluion o he ARE given by, 0 = A T Π + Π A + C T C Π (B 2 B T 2 B 1 B T 1 )Π. (8) The saddle poin equilibrium value is given by, J(x(), u, w ) = x() T Π x(). (9) Hence, o obain he saddle poin equilibrium sraegy of each player in he wo-player zero sum differenial game, one needs o solve he ARE (8) wih he sign indefinie quadraic erm, Π(B 2 B T 2 B 1 B T 1 )Π. 2.2 Ieraive Soluion Using Knowledge of Full Sysem Dynamics The ieraive soluion of he GARE (8) wih full knowledge on sysem dynamics was given by [Lanzon e al. (2008)]. Consider real marices A, B 1, B 2 and C, where (C,A) is observable and (A, B 2 ) is sabilizable. Then, he ieraive mehod for solving he ARE (8) is given as [Lanzon e al. (2008)]: 697

3 1. Sar wih Π 0 = 0. (10) 2. Solve, for he unique posiive definie Z k where k 0, 0 = (A + B 1 B T 1 Π k 1 B 2 B T 2 Π k 1 ) T Z k where, + Z k (A + B 1 B T 1 Π k 1 B 2 B T 2 Π k 1 ) F (Π k 1 ) = A T Π k 1 + Π k 1 A + C T C 3. Updae Z k B 2 B T 2 Z k + F (Π k 1 ), (11) Π k 1 (B 2 B T 2 B 1 B T 1 )Π k 1. (12) Π k = Π k 1 + Z k. (13) where he Π k and Z k possess he following properies: (1) (A + B 1 B T 1 Π k, B 2 ) is sabilizable k (2) F (Π k+1 ) = Z k B 1 B T 1 Z k k (3) A + B 1 B T 1 Π k B 2 B T 2 Π k+1 is Hurwiz k (4) Π Π k+1 Π k 0 k Then, lim k Πk = Π. (14) The convergence proof of he algorihm o he unique posiive definie soluion, Π of he GARE is provided by [Lanzon e al. (2008)]. 3.1 Soluion Approach 3. MAIN RESULTS In he wo-player zero sum differenial game defined hrough (1) and (3), he conroller and he disurbance are wo compeing players. If any of he players adops a consan sraegy, i becomes a single player opimizaion problem or a single player differenial game problem (opimal conrol problem). In he proposed game sraegy, he conroller acs as he acive, opimising player and disurbance acs a passive player, whose policy remains consan during a ime frame. Wih disurbance policy remaining consan, he wo-player zero sum game akes he shape of a classical opimal conrol problem. The soluion approach proposed is o break down he original wo-player zero sum differenial game ino a sequence of opimal conrol problems. The acive conroller player seeks he opimal policy which minimizes he cos funcion of he associaed opimal conrol problem. Afer he conroller player converges o his opimal policy, he disurbance player updaes his policy accordingly. Wih he changed disurbance policy, a new opimal conrol problem is consruced and he conroller player hen seeks he new opimal policy. The aim of his paper is o arrive a he saddle poin soluion of he game wih less dependency on he knowledge of sysem dynamics. This objecive ranslaes o solving each of he opimal conrol problem wihou knowledge of A and B 2 marices. The proposed algorihm only requires knowledge of B 1 for implemenaion. The game soluion is achieved using wo ieraive seps namely ouer level ieraion and he inner level ieraion. In each ouer level ieraion, an opimal conrol problem is consruced from he underlying game problem. The inner level ieraion process is embedded wihin each ouer level ieraion and finds he online soluion of each of hose opimal conrol problems. I is shown ha he inner level ieraions converge o he soluion of he corresponding opimal conrol problem and he ouer level ieraions converge o he saddle poin soluion of he game problem. Ouer Level Ieraion The ouer level ieraion is moivaed by [Lanzon e al. (2008)], where an opimal conrol problem is consruced considering ha he disurbance player acs as a passive player whose policy remains consan during each ieraion. The conroller player, afer considering he disurbance player s policy, hen minimizes he cos funcion by solving he corresponding opimal conrol problem. Le a he k h ouer level ieraion, he disurbance player chooses he sraegy, w = B T 1 Π k 1 x. Considering ha he sraegy of he passive disurbance player remains consan, he game cos funcion (3) can be ransformed ino he following opimal conrol form, J(x(0), u()) = (x T (C T C Π k 1 B 1 B T 1 Π k 1 )x + u T u)d. (15) subjec o sysem dynamics given by,. x = (A + B 1 B T 1 Π k 1 )x + B 2 u. (16) Lemma 2. The soluion of opimal conrol problem defined by (15) and (16) is equivalen o he ieraion seps defined by (11) and (13). Proof. The soluion of he opimal conrol problem defined in (15) and (16) can be found by solving he ARE, 0 = (A + B 1 B T 1 Π k 1 ) T Π k + Π k (A + B 1 B T 1 Π k 1 ) Π k B 2 B T 2 Π k + (C T C Π k 1 B 1 B T 1 Π k 1 ). (17) In he equaion (11), subsiuing for F (Π k 1 ) from (12) and Z k from (13) resul in he ARE (17), which proves he lemma. A he k h ouer level ieraion, he opimal conrol problem o minimize he cos funcion (15) will be found subjec o he modified sysem dynamical equaion (16) resuling in he opimal values, Π k and K k. The conroller sraegy hen becomes, u k = K k x. The disurbance being a passive player, adops he policy w k = B1 T Π k x. Wih he new disurbance policy, he ouer level ieraion is carried ou where he opimal conrol problem defined by (15) and (16) are modified and he opimal conrol soluion is sough again o obain K k+1 and Π k+1. The process is coninued ill convergence where K k+1 K k ɛ 1, where ɛ 1 is a specified hreshold. A convergence, he policies u k = K k+1 x and w k = B T 1 Π k+1 x converge o he saddle poin equilibrium values u = K x and w = B T 1 Π x, for conroller and disurbance respecively. 698

4 Inner Level Ieraion The opimal conrol problem consruced in each ouer level ieraion is solved using he inner level ieraion process. The inner level ieraion is an ADP algorihm based on he work by [Jiang and Jiang (2012)]. I comprises of a policy evaluaion sage and a policy improvemen sage. A he policy evaluaion sage, he curren conrol policy is evaluaed and he policy is updaed in he policy improvemen sage. Lemma 3. The opimal conrol problem defined by (15) and (16) can be solved by sequenially using he following Lyapunov equaions [Kleinman (1968)], (A k i ) T Π i + Π i A k i + C T C Π k 1 B 1 B T 1 Π k 1 + (K i ) T K i = 0. (18) and performing policy updae as, K i+1 = B T 2 Π i, (19) where A k i = A + B 1B T 1 Π k 1 B 2 K i and i is he ieraion. Then, lim K i = K k. i Proof. The proof is provided by [Kleinman (1968)]. The Kleinman s algorihm guaranees he convergence of K i o he opimal value of k h ouer level ieraion, K k. Wih an inenion o arrive a he soluion of opimal conrol problem wihou knowledge on sysem marices A and B 2, he equaion (16) is reformulaed as [Jiang and Jiang (2012)],. x = A k i x + B 2 (u + K i x), (20) where A k i = A + B 1B1 T Π k 1 B 2 K i, k is he ouer level ieraion sage and i is he curren inner level ieraion sage. Wih quadraic paramerizaion, he infinie horizon cos (15) a i h inner level ieraion of k h ouer level ieraion can be wrien as, J i (x()) = x() T Π i x() where Π i is a posiive definie symmeric marix. Then using (18), (19) and (20), dj i d = ẋ()t Π i x() + x() T Π i ẋ() = (A k i x + B 2 (u + K i x)) T Π i x() + x() T Π i (A k i x + B 2 (u + K i x)) = x() T ((A k i ) T Π i + Π i A k i )x() + 2(u + K i x) T B T 2 Π i x() = x() T (C T C Π k 1 B 1 B T 1 Π k 1 + K i T K i )x() + 2(u + K i x) T B T 2 Π i x() (21) Then using K i+1 = B T 2 Π i from (19), i follows from (21) ha, x() T Π i x() x( + T ) T Π i x( + T ) = x T (C T C Π k 1 B 1 B T 1 Π k 1 + K i T K i )dτ 2 where i is he inner ieraion sage. (u + K i x) T K i+1 xdτ, (22) The LHS of he equaion (22) can be reformulaed as x() T Π i x() x(+t ) T Π i x(+t ) = ( x() x(+t )) T ˆΠi. (23) where x() is a modified Kronecker produc vecor wih elemens {x p ()x q ()} p=1:n;q=1:n and ˆΠ i is a vecor formed by sacking he diagonal and upper riangular par of a symmeric marix ino a vecor whose off diagonal erms are muliplied by wo. The inegrand in he second par of RHS can hen be modified as (u + K i x) T K i+1 x = [(x T u T ) + (x T x T )(I n (K i ) T )] ˆK i+1. (24) where denoes Kronecker produc and ˆK i+1 is he vecor form of marix K i+1 sacking columns one over anoher. Now using (23) and (24), he enire equaion (22) can be rewrien as, ( x() x( + T )) T ˆΠi = + 2[(x T u T ) + (x T x T )(I n K T i )] ˆK i+1 x T (C T C Π k 1 B 1 B T 1 Π k 1 + (K i ) T K i )dτ. (25) The equaion (25) can be ransformed ino linear regression form given by Xθ = Y (26) where X = [( x() x(+t )) T 2((x T u T )+(x T x T )(I n Ki T ))], he regression vecor θ = [(Π ˆ i ) T ( ˆK i+1 ) T ] T, and Y = x T (C T C Π k 1 B 1 B T 1 Π k 1 + (K i ) T K i )dτ. Assuming C and B 1 are known, he cos X can be measured along he sae rajecory. Afer geing enough measuremens in Y and X, θ can be obained as a leas square esimae wihou knowledge of A and B 2. ˆθ = (X T X) 1 X T Y. (27) The above esimaion sage is he policy evaluaion sage of ADP, where Π i and K i+1 are esimaed from measuremens of he sysem saes. These measuremens can be considered as reinforcemens from he sysem, analogous o he one described in he heoreical framework of reinforcemen learning. The minimum number of measuremens required for he soluion of he leas square problem (27) equals he number of unknowns in he regression vecor, θ. Afer he policy evaluaion sage, policy updaion sage is performed where he curren policy is improved as u i+1 = K i+1 x + e, where e is he exploraion signal. Remark 4. The algorihm which is based on he ADP conceps of policy evaluaion and policy updaion requires persisence of exciaion (PE) so ha (27) is a well posed leas square problem. In cases wherein he sysem saes reach a saionary phase, persisence of exciaion condiion fails and i may affec he numerical sabiliy [Lee e al. (2012)] of he algorihm. Hence, exploraion signals 699

5 are added o he conrol inpu as i ensures ha sysem saes never become saionary and hence saisfy he PE condiion. Any sufficienly exciing signal like random noise or sinusoidal signal can be used as exploraion signal. The inner level ieraion defined by (26) is done again wih measuremens obained wih new conroller policy, u i+1 and he process is coninued ill convergence. A hreshold ɛ 2 is specified and in each ouer level ieraion, he inner level ieraions are carried ou ill K i+1 K i ɛ 2. The opimal conrol problem specified by (15) and (16) is solved once inner level ieraion is compleed. The opimal values Ki and Π i are hen be passed on for updaion in nex ouer level ieraion as K k and Π k. Remark 5. As K k is direcly esimaed hrough he inner level ieraion process, he opimal conrol can be arrived a wih ou he knowledge of A and B 2 marices. 3.2 Algorihm The proposed algorihm consiss of ouer level and inner level ieraive sages deailed as follows:- (1) Sar he algorihm wih an iniial sabilizing policy K 0 and Π 0 = 0. (2) Le k be he curren ouer ieraion sage wih known values for K k 1 and Π k 1. The disurbance policy is defined as w k = B 1 Π k 1 x. (3) Ouer level ieraion :- Wih he disurbance policy remaining consan, solve he corresponding opimal conrol problem defined by (15) and (16) for Π k and K k using he following inner level ieraion. (4) Inner level ieraion :- (a) Solve online for Π i and K i+1 from sae measuremens where i denoes he inner level ieraion sage, = ( x() x( + T )) T Π i + 2[(x T u T ) + (x T x T )(I n K T i )]K i+1 x T (C T C Π k 1 B 1 B T 1 Π k 1 +K T i K i )dτ. (b) Updae u = K i+1 x + e. (c) Coninue inner level ieraion ill K i+1 K i ɛ 2. (5) Se K k = K i and Π k = Π i. (6) Updae disurbance policy as w k+1 = B T 1 Π k x. Modify he opimal conrol problem described by (15) and (16) using updaed disurbance policy. (7) Coninue ouer level ieraion(seps (3) o (6)) ill K k K k 1 ɛ 1. (8) Se saddle poin policy, K = K k. Remark 6. In he Sep (1) of he algorihm, he iniial sabilizing gain K 0 can be obained by using some nominal knowledge abou he sysem, e.g for sable sysems K 0 can be chosen. 3.3 Convergence Analysis. Theorem 7. The proposed algorihm defined by he ouer level and inner level ieraive sages (Seps (1) o (6) in Secion 3.2) converge o he saddle poin soluion of he (28) wo-player zero sum game. The proposed algorihm creaes a sequence of conrol policies, {K k, k= 1,2,3...} which converges o he saddle poin conrol policy, K so ha lim k Kk K = 0. Proof. The proposed algorihm is based on ouer and inner level ieraive sages. Convergence of proposed algorihm is proved by proving convergence of inner level and ouer level ieraions. Since he inner level ieraion is embedded in he ouer level ieraion, he convergence of inner level ieraion is analyzed firs. The inner level ieraion based on (28) is solved using leas square esimae of K i+1 and Π i using (27). Since (28) is derived from (18) and (19) using (20), (21), (22) and (23), he inner level ieraion sage is equivalen o Kleinman s ieraion (18) and (19). Hence convergence of inner level ieraion is proved using Lemma 3. The ouer level ieraions are based on he soluion of (15) and (16), which is equivalen o he soluion of (11) and (13) using Lemma 2. Then using [Lanzon e al. (2008)], convergence of ouer level ieraion is proved. The convergence resuls of inner level and he ouer level ieraions guaranee convergence of he proposed algorihm o he saddle poin soluion of he wo-player zero sum differenial game. Remark 8. The proposed algorihm assumes knowledge of he disurbance inpu marix while no requiring he knowledge of drif dynamics and conrol inpu marix. 4. SIMULATION RESULTS The algorihm is implemened on a linearized power sysem model proposed in [Wang e al. (1963)]. The sysem marices of he nominal model are A = B 2 = [ ] T, [B 1 = [ ] T,, C = I 4. (29) Wih complee knowledge of sysem dynamics, he feedback gain of he conroller player corresponding o he saddle poin equilibrium can be calculaed using he algorihm pu forward by [Lanzon e al. (2008)] as, K saddle = [ ] (30) The adapive algorihm proposed in his paper does no require he knowledge of A and B 2 and only he knowledge of B 1 is required. Using he proposed algorihm wih random noise as exploraion signal, he following saddle poin policy is obained afer four ouer level ieraions and eleven inner level ieraions. K alg = [ ] (31) 700

6 Fig. 1. Evoluion of Feedback Gain 5. CONCLUSION An adapive dynamic programming based algorihm for finding an online saddle poin soluion of wo-player zero sum linear differenial games is presened in his paper. The proposed algorihm is capable of arriving a he saddle poin soluion of he game problem wih parial knowledge of sysem dynamics, i.e only he knowledge of disurbance inpu marix is required. The algorihm is used o find he opimal conrol policy in a power sysem under wors case disurbance. Fuure effors will be focussed on making he algorihm compleely model-free so ha saddle poin soluion can be achieved wihou any knowledge sysem dynamics. REFERENCES Abu-Khalaf, M., Lewis, F., and Huang, J. (2008). Neurodynamic programming and zero-sum games for consrained conrol sysems. IEEE Transacions on Neural Neworks, 19(7), Abu-Khalaf, M. and Lewis, F.L. (2005). Nearly opimal conrol laws for nonlinear sysems wih sauraing acuaors using a neural nework HJB approach. Auomaica, 41(5), Al-Tamimi, A., Lewis, F.L., and Abu-Khalaf, M. (2007). Model-free Q-learning designs for linear discree-ime zero-sum games wih applicaion o H-infiniy conrol. Auomaica, 43(3), Arnold, W.F., I. and Laub, A. (1984). Generalized eigenproblem algorihms and sofware for Algebraic Riccai Equaions. Proceedings of he IEEE, 72(12), Basar, T. and Bernhard, P. (1995). H-infiniy Opimal Conrol and Relaed Minimax Design Problems. Birkhauser. Basar, T. and Olsder, G.J. (1995). Dynamic Non- Cooperaive Game Theory. SIAM. Bellman, R.E. (2003). Dynamic Programming. Dover Publicaions. Damm, T. and Hinrichsen, D. (2001). Newon s mehod for a raional marix equaion occurring in sochasic conrol. Linear Algebra and is Applicaions, 332(0), Isaacs, R. (1965). Differenial Games: A Mahemaical Theory wih Applicaions o Warfare and Pursui, Conrol and Opimizaion. Dover Publicaions. Jiang, Y. and Jiang, Z. (2012). Compuaional adapive opimal conrol for coninuous-ime linear sysems wih compleely unknown dynamics. Auomaica, 48(10), Kleinman, D. (1968). On an ieraive echnique for Riccai equaion compuaions. IEEE Transacions on Auomaic Conrol, 13(1), Lanzon, A., Feng, Y., Anderson, B., and Rokowiz, M. (2008). Compuing he posiive sabilizing soluion o Algebraic Riccai Equaions wih an indefinie quadraic erm via a recursive mehod. IEEE Transacions on Auomaic Conrol, 53(10), Lee, J.Y., Park, J.B., and Choi, Y.H. (2012). Inegral Q-learning and explorized policy ieraion for adapive opimal conrol of coninuous-ime linear sysems. Auomaica, 48(11), Lewis, F. and Liu, D. (2013). Reinforcemen Learning and Approximae Dynamic Programming for Feedback Conrol. Wiley-IEEE Press. Lewis, F., Vrabie, D., and Vamvoudakis, K. (2012). Reinforcemen learning and feedback conrol: Using naural decision mehods o design opimal adapive conrollers. IEEE Conrol Sysems Magazine, 32(6), Mehrmann, V. and Tan, E. (1988). Defec correcion mehod for he soluion of Algebraicl Riccai Equaions. IEEE Transacions on Auomaic Conrol, 33(7), Mehrmann, V. (1991). The Auonomous Linear Quadraic Conrol Problem: Theory and Numerical Soluion. Lecure Noes in Conrol and Informaion Sciences. Springer. Sima, V. (1996). Algorihms for Linear Quadraic Opimizaion. Marcel Dekker Incorporaed. Suon, R.S. and Baro, A.G. (1998). Reinforcemen Learning: An Inroducion (Adapive Compuaion and Machine Learning). A Bradford Book. Tomlin, C., Lygeros, J., and Sasry, S. (2000). A game heoreic approach o conroller design for hybrid sysems. Proceedings of he IEEE, 88(7), Vrabie, D., Pasravanu, O., Abu-Khalaf, M., and Lewis, F. (2009). Adapive opimal conrol for coninuous-ime linear sysems based on policy ieraion. Auomaica, 45(2), Vrabie, D. and Lewis, F. (2011). Adapive dynamic programming for online soluion of a zero-sum differenial game. Journal of Conrol Theory and Applicaions, 9(3), Wang, F.Y., Zhang, H., and Liu, D. (2009). Adapive dynamic programming: An inroducion. IEEE Compuaional Inelligence Magazine, 4(2), Wang, Y., Zhou, R., and Wen, C. (1963). Robus loadfrequency conroller design for power sysems. IEE Proceedings C Generaion, Transmission and Disribuion, 140(1), Wei, Q. and Liu, D. (2012). Self-learning conrol schemes for wo-person zero-sum differenial games of coninuous-ime nonlinear sysems wih sauraing conrollers. In Inernaional conference on Advances in Neural Neworks, Werbos, P. (1992). Approximae dynamic programming for real-ime conrol and neural modeling. Handbook of Inelligen Conrol: Neural, Fuzzy and Adapive Approaches. 701

Sliding Mode Extremum Seeking Control for Linear Quadratic Dynamic Game

Sliding Mode Extremum Seeking Control for Linear Quadratic Dynamic Game Sliding Mode Exremum Seeking Conrol for Linear Quadraic Dynamic Game Yaodong Pan and Ümi Özgüner ITS Research Group, AIST Tsukuba Eas Namiki --, Tsukuba-shi,Ibaraki-ken 5-856, Japan e-mail: pan.yaodong@ais.go.jp