CLASSICAL optimal control problems are formulated in

Size: px

Start display at page:

Download "CLASSICAL optimal control problems are formulated in"

Jody Lindsey
5 years ago
Views:

1 IEEE/CAA JOURNAL OF AUOMAICA SINICA, VOL., NO. 3, JULY 39 Concurrent Learnng-based Approxmate Feedbac-Nash Equlbrum Soluton of N-player Nonzero-sum Dfferental Games Rushesh Kamalapurar Justn R. Klotz arren E. Dxon Abstract hs paper presents a concurrent learnng-based actor-crtc-dentfer archtecture to obtan an approxmate feedbac-nash equlbrum soluton to an nfnte horzon N- player nonzero-sum dfferental game. he soluton s obtaned onlne for a nonlnear control-affne system wth uncertan lnearly parameterzed drft dynamcs. It s shown that under a condton mlder than persstence of exctaton PE, unformly ultmately bounded convergence of the developed control polces to the feedbac-nash equlbrum polces can be establshed. Smulaton results are presented to demonstrate the performance of the developed technque wthout an added exctaton sgnal. Index erms Nonlnear system, optmal adaptve control, dynamc programmng, data drven control. I. INRODUCION CLASSICAL optmal control problems are formulated n Bernoull form as the need to fnd a sngle control nput that mnmzes a sngle cost functonal under boundary constrants and dynamcal constrants mposed by the system [ ]. A multtude of relevant control problems can be modeled as mult-nput systems, where each nput s computed by a player, and each player attempts to nfluence the system state to mnmze ts own cost functon. In ths case, the optmzaton problem for each player s coupled wth the optmzaton problem for other players, and hence, n general, an optmal soluton n the usual sense does not exst, motvatng the formulaton of alternatve optmalty crtera. Dfferental game theory provdes soluton concepts for many mult-player, mult-obectve optmzaton problems [3 5]. For example, a set of polces s called a Nash equlbrum soluton to a mult-obectve optmzaton problem f none of the players can mprove ther outcomes by changng ther polces whle all the other players abde by the Nash equlbrum polces [6]. hus, Nash equlbrum solutons provde a secure set of strateges, n the sense that none of the players have an ncentve to dverge from ther equlbrum polces. Hence, Nash equlbrum has been Manuscrpt receved September, 3; accepted January 7,. hs wor was supported by Natonal Scence Foundaton Award 66, 798, Offce of Naval Research N-3--5, and a contract wth the Ar Force Research Laboratory Mathematcal Modelng and Optmzaton Insttute. Recommended by Assocate Edtor Zhongsheng Hou Ctaton: Rushesh Kamalapurar, Justn R. Klotz, arren E. Dxon. Concurrent learnng-based approxmate feedbac-nash equlbrum soluton of N-player nonzero-sum dfferental games. IEEE/CAA Journal of Automatca Snca,, 3: 39 7 Rushesh Kamalapurar s wth the Department of Mechancal and Aerospace Engneerng, Unversty of Florda, Ganesvlle 368, USA emal: ramalapurar@ufl.edu. Justn R. Klotz s wth the Department of Mechancal and Aerospace Engneerng, Unversty of Florda, Ganesvlle 368, USA e-mal: lotz@ufl.edu. arren E. Dxon s wth the Department of Mechancal and Aerospace Engneerng, Unversty of Florda, Ganesvlle 368, USA e-mal: wdxon@ufl.edu. a wdely used soluton concept n dfferental game-based control technques. In general, Nash equlbra are not unque. For a closedloop dfferental game.e., the control s a functon of the state and tme wth perfect nformaton.e., all the players now the complete state hstory, there can be nfntely many Nash equlbra. If the polces are constraned to be feedbac polces, the resultng equlbra are called subgame perfect Nash equlbra or feedbac-nash equlbra. he value functons correspondng to feedbac-nash equlbra satsfy a coupled system of Hamlton-Jacob HJ equatons [7 ]. If the system dynamcs are nonlnear and uncertan, an analytcal soluton of the coupled HJ equatons s generally nfeasble; and hence, dynamc programmng-based approxmate solutons are sought [ 8]. In [6], an ntegral renforcement learnng algorthm s presented to solve nonzero-sum dfferental games n lnear systems wthout the nowledge of the drft matrx. In [7], a dynamc programmng-based technque s developed to fnd an approxmate feedbac-nash equlbrum soluton to an nfnte horzon N-player nonzerosum dfferental game onlne for nonlnear control-affne systems wth nown dynamcs. In [9], a polcy teraton-based method s used to solve a two-player zero-sum game onlne for nonlnear control-affne systems wthout the nowledge of drft dynamcs. he methods n [7] and [9] solve the dfferental game onlne usng a parametrc functon approxmator such as a neural networ NN to approxmate the value functons. Snce the approxmate value functons do not satsfy the coupled HJ equatons, a set of resdual errors the so-called Bellman errors BEs s computed along the state traectores and s used to update the estmates of the unnown parameters n the functon approxmator usng least-squares or gradent-based technques. Smlar to adaptve control, a restrctve persstence of exctaton PE condton s requred to ensure boundedness and convergence of the value functon weghts. Smlar to renforcement learnng, an ad-hoc exploraton sgnal s added to the control sgnal durng the learnng phase to satsfy the PE condton along the system traectores [ ]. It s unclear how to analytcally determne an exploraton sgnal that ensures PE for nonlnear systems; and hence, the exploraton sgnal s typcally computed va a smulaton-based tral and error approach. Furthermore, the exstng onlne approxmate optmal control technques such as [6 7, 9, 3] do not consder the ad-hoc sgnal n the Lyapunov-based analyss. Hence, the stablty of the overall closed-loop mplementaton s not establshed. hese stablty concerns, along wth concerns that the added probng sgnal can result n ncreased control effort and oscllatory transents, provde motvaton for the subsequent development. Based on the deas n recent concurrent learnng-based adap-

2 IEEE/CAA JOURNAL OF AUOMAICA SINICA, VOL., NO. 3, JULY tve control results such as [] and [5] that show a concurrent learnng-based adaptve update law can explot recorded data to augment the adaptve update laws to establsh parameter convergence under condtons mlder than PE, ths paper extends the wor n [7] and [9] to relax the PE condton. In ths paper, a concurrent learnng-based actor-crtc-dentfer archtecture [3] s used to obtan an approxmate feedbac- Nash equlbrum soluton to an nfnte horzon N-player nonzero-sum dfferental game onlne, wthout requrng PE, for a nonlnear control-affne system wth uncertan lnearly parameterzed drft dynamcs. A system dentfer s used to estmate the unnown parameters n the drft dynamcs. he solutons to the coupled HJ equatons and the correspondng feedbac-nash equlbrum polces are approxmated usng parametrc unversal functon approxmators. Based on estmates of the unnown drft parameters, estmates for the BEs are evaluated at a set of preselected ponts n the state-space. he value functon and the polcy weghts are updated usng a concurrent learnng-based least-squares approach to mnmze the nstantaneous BEs and the BEs evaluated at pre-selected ponts. Smultaneously, the unnown parameters n the drft dynamcs are updated usng a hstory stac of recorded data va a concurrent learnngbased gradent descent approach. It s shown that under a condton mlder than PE, unformly ultmately bounded UUB convergence of the unnown drft parameters, the value functon weghts and the polcy weghts to ther true values can be establshed. Smulaton results are presented to demonstrate the performance of the developed technque wthout an added exctaton sgnal. II. PROBLEM FORMULAION AND EXAC SOLUION Consder a class of control-affne mult-nput systems ẋ = f x+ g xû, where x R n s the state and û R m are the control nputs.e., the players. In, the unnown functon f : R n R n s lnearly parameterzable, the functon g : R n R n m s nown and unformly bounded, f and g are locally Lpschtz, and f =. Let U := {{u : R n R m,,,n}, such that the tuple {u,,u N } s admssble w.r.t. } be the set of all admssble tuples of feedbac polces. Let V {u,,u N } : R n R denote the value functon of the th player w.r.t. the tuple of feedbac polces {u,,u N } U, defned as V {u,,u N } x = r φ τ,x,u φ τ,x,,u N φ τ,x dτ, t where φ τ,x for τ [t, denotes the traectory of obtaned usng the feedbac polces û τ = u φ τ,x and the ntal condton φ t, x =x. In, r : R n R m R m N R denotes the nstantaneous cost defned as r x, u,,u N :=x Q x + N u R u, where Q R n n s a postve defnte matrx. he control obectve s to fnd an approxmate feedbac-nash equlbrum soluton to the nfnte horzon regulaton dfferental game onlne,.e., to fnd a tuple {u,,u N } U such that for all {,,N}, for all x R n, the correspondng value functons satsfy V x:=v {u,u,,u,,u N } x V {u,u,,u,,u N } x for all u such that {u,u,,u,,u N } U. he exact closed-loop feedbac-nash equlbrum soluton {u,,u N } can be expressed n terms of the value functons [5, 8 9, 7] as u = R g x V, 3 where x := x and the value functons {V,,VN } are the solutons to the coupled HJ equatons x Q x + xv xv G x V + x V f G x V =. In, G := g R g and G := g R R R g. he HJ equatons n are n the so-called closed-loop form; they can be expressed n an open-loop form as x Q x + u R u + x V f + x V g u =. III. APPROXIMAE SOLUION Computaton of an analytcal soluton to the coupled nonlnear HJ equatons n { s, n general, nfeasble. Hence, an approxmate soluton ˆV,, ˆV } N s sought. Based on { ˆV,, ˆV } N, an approxmaton {û,, û N } to the closedloop feedbac-nash equlbrum soluton s computed. Snce the approxmate soluton, n general, does not satsfy the HJ equatons, a set of resdual errors the so-called Bellman errors BEs s computed as δ = x Q x + û R û + x ˆV f + x ˆV N 5 g û, 6 and the approxmate soluton s recursvely mproved to drve the BEs to zero. he computaton of the BEs n 6 requres nowledge of the drft dynamcs f. o elmnate ths requrement, a concurrent learnng-based system dentfer s developed n the followng secton. A. System Identfcaton Let f x =Y x θ be the lnear parametrzaton of the drft dynamcs, where Y : R n R n p θ denotes the locally Lpschtz regresson matrx, and θ R p θ denotes the vector of constant, unnown drft parameters. he system dentfer s desgned as ˆx = Y x ˆθ + g xû + x x, 7 where the measurable state estmaton error x s defned as x := x ˆx, x R n n s a postve defnte, constant dagonal observer gan matrx, and ˆθ R p θ denotes the

3 KAMALAPURKAR et al.: CONCURREN LEARNING-BASED APPROXIMAE FEEDBACK-NASH EQUILIBRIUM SOLUION OF N-PLAYER vector of estmates of the unnown drft parameters. In tradtonal adaptve systems, the estmates are updated to mnmze the nstantaneous state estmaton error, and convergence of parameter estmates to ther true values can be establshed under a restrctve PE condton. In ths result, a concurrent learnng-based data-drven approach s developed to relax the PE condton to a weaer, verfable ran condton as follows. Assumpton [ 5] {. A hstory stac H d contanng state-acton tuples x, û,, û N, } =,,M θ recorded along the traectores of s avalable a pror, such that M θ ran Y = p θ, 8 Y where Y = Y x, and p θ denotes the number of unnown parameters n the drft dynamcs. o facltate the concurrent learnng-based parameter update, numercal methods are used to compute the state dervatve ẋ correspondng to x, û. he update law for the drft parameter estmates s desgned as ˆθ =Γ θ Y x M θ +Γ θ θ Y ẋ g û Y ˆθ, 9 where g := g x, Γ θ R p p s a constant postve defnte adaptaton gan matrx, and θ R s a constant postve concurrent learnng gan. he update law n 9 requres the unmeasurable state dervatve ẋ. Snce the state dervatve at a past recorded pont on the state traectory s requred, past and future recorded values of the state can be used along wth accurate noncausal smoothng technques to obtan good estmates of ẋ. In the presence of dervatve estmaton errors, the parameter estmaton errors can be shown to be UUB, where the sze of the ultmate bound depends on the error n the dervatve estmate []. o ncorporate new nformaton, the hstory stac s updated wth new data. hus, the resultng closed-loop system s a swtched system. o ensure the stablty of the swtched system, the hstory stac s updated usng a sngular value maxmzng algorthm []. Usng, the state dervatve can be expressed as ẋ g û = Y θ, and hence, the update law n 9 can be expressed n the advantageous form θ = Γ θ Y x M θ Γ θ θ Y θ, Y where θ := θ ˆθ denotes the drft parameter estmaton error. he closed-loop dynamcs of the state estmaton error are gven by x = Y θ x x. B. Value Functon Approxmaton he value functons,.e., the solutons to the HJ equatons n, are contnuously dfferentable functons of the state. Usng the unversal approxmaton property of NNs, the value functons can be represented as V x = x+ɛ x, where R p denotes the constant vector of unnown NN weghts, : R n R p denotes the nown NN actvaton functon, p N denotes the number of hdden layer neurons, and ɛ : R n R denotes the unnown functon reconstructon error. he unversal functon approxmaton property guarantees that over any compact doman C R n, for all constants ɛ, ɛ >, there exst a set of weghts and bass functons such that, sup x C x, sup x C x, sup x C ɛ x ɛ and sup x C ɛ x ɛ, where,,, ɛ, ɛ R are postve constants. Based on 3 and, the feedbac-nash equlbrum solutons are gven by u x = R g x x + ɛ x. 3 he NN-based approxmatons to the value functons and the controllers are defned as ˆV := Ŵ c, û := R g Ŵ a, where Ŵc R p,.e., the value functon weghts, and Ŵ a R p,.e., the polcy weghts, are the estmates of the deal weghts. he use of two dfferent sets of estmates to approxmate the same set of deal weghts s motvated by the subsequent stablty analyss and the fact that t facltates an approxmate formulaton of the BEs whch s affne n the value functon weghts, enablng least squaresbased adaptaton. Based on, measurable approxmatons to the BEs n 6 are developed as ˆδ = ω Ŵ c + x Q x + Ŵ a G Ŵ a, 5 where ω := Y ˆθ N G Ŵ a. he followng assumpton, whch s generally weaer than the PE assumpton, s requred for convergence of the concurrent learnng-based value functon weght estmates. Assumpton. For each {,,N}, there exst a fnte set of M x ponts {x R n =,,M x } such that for all t R, ran Mx = ω t ω t ρ t = p, { Mx λ mn nf t R = } ω tω t ρ t c x := >, 6 M x where λ mn denotes the mnmum egenvalue, and c x R s a postve constant. In 6, ω t := Y ˆθ t N G Ŵ a t, where the superscrpt ndcates that the term s evaluated at x = x, and ρ := +ν ω Γ ω, where ν R > s the normalzaton gan and Γ R P P s the adaptaton gan matrx.

4 IEEE/CAA JOURNAL OF AUOMAICA SINICA, VOL., NO. 3, JULY he concurrent learnng-based least-squares update law for the value functon weghts s desgned as M x ω Ŵ c = η c Γ ˆδ η cγ ˆδ ρ M x ρ, = ω ω Γ = β Γ η c Γ ρ Γ { Γ Γ }, Γ t Γ, 7 where ρ := +ν ω Γ ω, { } denotes the ndcator functon, Γ > R s the saturaton constant, β R s the constant postve forgettng factor, η c,η c R are constant postve adaptaton gans, and the approxmate BE ˆδ s defned as ˆδ := ω Ŵ c + x Q x + Ŵ a ω G Ŵ a he polcy weght update laws are desgned based on the subsequent stablty analyss as Ŵ a = η a Ŵa Ŵc η a Ŵ a + η c G Ŵa ω Ŵ ρ c+ M x = η c M x G Ŵa ω Ŵc, 8 where η a,η a R are postve constant adaptaton gans. he forgettng factor β along wth the saturaton n the update law for the least-squares gan matrx n 7 ensure that the least-squares gan matrx Γ and ts nverse s postve defnte and bounded for all {,,N} as [6] Γ Γ t Γ, t R, 9 where Γ R s a postve constant, and the normalzed regressor s bounded as ω ρ. ν Γ IV. SABILIY ANALYSIS ρ. Substtutng for V and u from and 3 and usng f = Yθ, the approxmate BE can be expressed as ˆδ = ω Ŵ c + Ŵ a G Ŵ a Yθ ɛ Yθ G +ɛ G + ɛ G ɛ + G + ɛ G + G ɛ + ɛ G ɛ. Addng and subtractng Ŵ a G + ω yelds ˆδ = ω c + a G a Y θ G G a ɛ Yθ+Δ, N where Δ := G G ɛ + N G ɛ + N ɛ G ɛ N ɛ G ɛ. Smlarly, the approxmate BE evaluated at the selected ponts can be expressed n an unmeasurable form as ˆδ = ω c + G a G G a +Δ a Y θ, where the constant Δ R s defned as Δ := ɛ Y θ + Δ. o facltate the stablty analyss, a canddate Lyapunov functon s defned as V L = V + cγ c + a a + x x + θ Γ θ θ. Subtractng from 5, the approxmate BE can be Snce V are postve defnte, the bound n 9 and Lemma expressed n an unmeasurable form as.3 n [7] can be used to bound the canddate Lyapunov functon as ˆδ = ω Ŵ c + x Q x + Ŵ a G Ŵ a v Z V L Z, t v Z, 3 [ x Q x + u R u + x V f + x V g u. where Z = x, c,, cn, a,, θ] an, x, R n+n p +p θ and v, v : R R are class K

5 KAMALAPURKAR et al.: CONCURREN LEARNING-BASED APPROXIMAE FEEDBACK-NASH EQUILIBRIUM SOLUION OF N-PLAYER 3 functons. For any compact set Z R n+n p +p θ, defne ι := max sup, Z Z G + ɛ G, ι := max sup η cω 3, Z Z ρ G G + M x 3 G G, = ι 3 := max, η c ω M x ρ sup, Z Z, + ɛ G ɛ + ɛ G ɛ, ι := max sup G,ι 5 := η cl Y ɛ θ, Z Z, ν Γ ι 6 := η cl Y η c max Y,ι 7 := ν Γ, ν Γ η c + η c ι ι 8 := 8,ι 9 := ι N +η a + ι 8, ν Γ ι := η c sup Z Z v l := mn ι := N Δ Δ + η c max, ν Γ q, η cc x ι 9 η a + η a +, x, η a + η a, θy, 8 ι η c c x + ι 3, where y denotes the mnmum egenvalue of M θ Y, q denotes the mnmum egenvalue of Q, x denotes the mnmum egenvalue of x, and the suprema exst snce ω ρ s unformly bounded for all Z, and the functons G, G,, and ɛ are contnuous. In, L Y R denotes the Lpschtz constant such that Y ϖ L Y ϖ for all ϖ Z R n. he suffcent condtons for UUB convergence are derved based on the subsequent stablty analyss as Y q > ι 5, η c c x > ι 5 +ζ ι 7 + ι ζ N + η a +ζ 3 ι 6 Z, η a + η a > ι 8 + ι N, ζ θ y > ι 7 + ι 6 Z, 5 ζ ζ 3 where Z := v v max Z t, ιvl and ζ,ζ,ζ 3 R are nown postve adustable constants. Snce the NN functon approxmaton error and the Lpschtz constant L Y depend on the compact set that contans the state traectores, the compact set needs to be establshed before the gans can be selected usng 5. Based on the subsequent stablty analyss, an algorthm s developed to compute the requred compact set denoted by Z based on the ntal condtons. In Algorthm, the notaton {ϖ} for any parameter ϖ denotes the value of ϖ computed n the - th teraton. Snce the constants ι and v l depend on L Y only through the products L Y ɛ and L Y ζ 3, Algorthm ensures that ι dam Z, 6 v l where damz denotes the dameter of set Z. heorem. Provded Assumptons hold and the control gans satsfy the suffcent condtons n 5, where the constants n are computed based on the compact set Z selected usng Algorthm, the system dentfer n 7 along wth the adaptve update law n 9, and the controllers n along wth the adaptve update laws n 7 and 8 ensure that the state x, the state estmaton error x, the value functon weght estmaton errors c and the polcy weght estmaton errors a are UUB, resultng n UUB convergence of the polces û to the feedbac-nash equlbrum polces u. Proof. he dervatve of the canddate Lyapunov functon n along the traectores of,,, 7, and 8 s gven by V L = x V f + g u + x Y θ x x + c η c ω c ρ ˆδ + η M x c M x β Γ ω ω η c θ Y x M θ θ ω ˆδ ρ ρ Y Y c + θ

6 IEEE/CAA JOURNAL OF AUOMAICA SINICA, VOL., NO. 3, JULY M x a = η a Ŵa Ŵ c η a Ŵa+ η c Ŵc ω Ŵ ρ a G + η c Ŵ ω c M x ρ Ŵ a G. 7 Substtutng the unmeasurable forms of the BEs from and nto 7, and usng the trangle nequalty, the Cauchy- Schwarz nequalty and Young s nequalty, the Lyapunov dervatve n 7 can be bounded as q η c c V x x θ y θ ηa + η a N N ι 9 a + ι c c x x a + q ι 5 x η c c x ι 5 ζ ι 7 ι ζ N η a ζ 3 ι 6 x c + ηa + η a ι 8 ι N ζ θ y ι 7 ζ ι 6 x θ + ζ 3 a + ι 3. 8 Provded the suffcent condtons n 5 hold and the condtons η c c x >ι 5 + ζ ι 7 + ι ζ N + η a + ζ 3 ι 6 x, θ y > ι 7 + ι 6 x 9 ζ ζ 3 hold for all t R. Completng the squares n 8, the bound on the Lyapunov dervatve can be expressed as q η c c V x x c x x ηa + η a a θy θ + ι 8 v l Z, Z > ι,z Z. 3 v l Usng 3, 6, and 3, heorem.8 n [7] can be nvoed to conclude that lm t sup Z t v v ι v l. Furthermore, the system traectores are bounded as Z t Z for all t R. Hence, the condtons n 5 are suffcent for the condtons n 9 to hold for all t R. he error between the feedbac-nash equlbrum polcy and the approxmate polcy can be expressed as u û R g a + ɛ, for all =,,N, where g := sup x g x. Snce the weghts a are UUB, UUB convergence of the approxmate polces to the feedbac-nash equlbrum polces s obtaned. Remar. he closed-loop system analyzed usng the canddate Lyapunov functon n s a swtched system. he swtchng happens when the hstory stac s updated and when the least-squares regresson matrces Γ reach ther saturaton bound. Smlar to least squares-based adaptve control [6], can be shown to be a common Lyapunov functon for the regresson matrx saturaton, and the use of a sngular value maxmzng algorthm to update the hstory stac ensures that s a common Lyapunov functon for the hstory stac updates []. Snce s a common Lyapunov functon, 3, 6 and 3 establsh UUB convergence of the swtched system. V. SIMULAION A. Problem Setup o portray the performance of the developed approach, the concurrent learnng-based adaptve technque s appled to the nonlnear control-affne system [7] ẋ = f x+g x u + g x u, 3 where x R, u,u R, and x x f = x x + x cos x + + x, sn x + [ ] [ ] g =,g cos x + = sn x. + he value functon has the structure shown n wth the weghts Q = Q = I and R = R = R = R =, where I s a dentty matrx. he system dentfcaton protocol gven n Secton III-A and the concurrent learnng-based scheme gven n Secton III-B are mplemented smultaneously to provde an approxmate onlne feedbac-nash equlbrum soluton to the gven nonzero-sum two-player game. B. Analytcal Soluton he control affne system n 3 s selected for ths smulaton because t s constructed usng the converse HJ approach [8] such that the analytcal feedbac-nash equlbrum soluton of the nonzero-sum game s V = V = [.5 [.5.5 ] x x x x ] x x x x,, and the feedbac-nash equlbrum control polces for Player and Player are [ ] [ ] u = x.5 R g x x, x [ ] u = x R g. x x x ] [.5.5

7 KAMALAPURKAR et al.: CONCURREN LEARNING-BASED APPROXIMAE FEEDBACK-NASH EQUILIBRIUM SOLUION OF N-PLAYER 5 Snce the analytcal soluton s avalable, the performance of the developed method can be evaluated by comparng the obtaned approxmate soluton aganst the analytcal soluton. C. Smulaton Parameters he dynamcs are lnearly parameterzed as f = Y x θ, where x x x Y x = x x cos x + x sn x + s nown and the constant vector of parameters θ = [,,,,, ] s assumed to be unnown. he ntal guess for θ s selected as ˆθ t =.5 [,,,,, ]. he system dentfcaton gans are chosen as x = 5, Γ θ = dag {,,,, 6, 6}, θ =.5. A hstory stac of 3 ponts s selected usng a sngular value maxmzng algorthm [] for the concurrent learnng-based update law n 9, and the state dervatves are estmated usng a ffth order Savtzy-Golay flter [9]. Based on the structure of the feedbac-nash equlbrum value functons, the bass functon for value functon approxmaton s selected as =[x,x x,x ], and the adaptve learnng parameters and ntal condtons are shown for both players n ables I and II. wenty-fve ponts lyng on a 5 5 grd around the orgn are selected for the concurrent learnng-based update laws n 7 and 8. ABLE I SIMAULAION PARMAMERERS Fg. 5 demonstrates that wthout the necton of a PE sgnal the system dentfcaton parameters also approxmately converge to the correct values. he state and control sgnal traectores are dsplayed n Fgs. 6 and 7. Fg.. ABLE II SIMULAION INIIAL CONDIIONS Player Player Ŵ c t [3, 3, 3] [3, 3, 3] Ŵ a t [3, 3, 3] [3, 3, 3] Γt I 3 I 3 x t [, ] [, ] ˆx t [, ] [, ] Player crtc weghts convergence. Player Player ν.5.5 η c η c.5 η a η a.. β 3 3 Γ D. Smulaton Results Fgs. show the rapd convergence of the actor and crtc weghts to the approxmate feedbac-nash equlbrum values for both players, resultng n the value functons and control polces [.5 ] [.5 ] ˆV =.59.99, ˆV =.7.968, û = R g û = R g [ x x x x [ x x x x ] [ ] [ ], ]. Fg.. Player actor weghts convergence. VI. CONCLUSION A concurrent learnng-based adaptve approach s developed to determne the feedbac-nash equlbrum soluton to an N- player nonzero-sum game onlne. he solutons to the assocated coupled HJ equatons and the correspondng feedbac- Nash equlbrum polces are approxmated usng parametrc unversal functon approxmators. Based on estmates of the unnown drft parameters, estmates for the BEs are evaluated at a set of preselected ponts n the state-space. he value

8 6 IEEE/CAA JOURNAL OF AUOMAICA SINICA, VOL., NO. 3, JULY Fg. 3. Player crtc weghts convergence. condton for convergence, UUB convergence of the drft parameters, the value functon and polcy weghts to ther true values, and hence, UUB convergence of the polces to the feedbac-nash equlbrum polces, are establshed under weaer ran condtons usng a Lyapunov-based analyss. Smulatons are performed to demonstrate the performance of the developed technque. he developed result reles on a suffcent condton on the mnmum egenvalue of a tme-varyng regresson matrx. hle ths condton can be heurstcally satsfed by choosng enough ponts, and can be easly verfed onlne, t cannot, n general, be guaranteed a pror. Furthermore, fndng a suffcently good bass for value functon approxmaton s, n general, nontrval and can be acheved only through pror nowledge or tral and error. Future research wll focus on extendng the applcablty of the developed technque by relevng the aforementoned shortcomngs. Fg.. Player actor weghts convergence. Fg. 6. State traectory convergence to the orgn. Fg. 5. System dentfcaton parameters convergence. functon and the polcy weghts are updated usng a concurrent learnng-based least-squares approach to mnmze the nstantaneous BEs and the BEs evaluated at the preselected ponts. Smultaneously, the unnown parameters n the drft dynamcs are updated usng a hstory stac of recorded data va a concurrent learnng-based gradent descent approach. Unle tradtonal approaches that requre a restrctve PE Fg. 7. Control polces of Players and. REFERENCES [] Kr D E. Optmal Control heory: An Introducton. New Yor: Dover Publcatons,. [] Lews F L, Vrabe D, Syrmos V L. Optmal Control hrd edton. New Yor: John ley & Sons,. [3] Isaacs R. Dfferental Games: A Mathematcal heory wth Applcatons to arfare and Pursut, Control and Optmzaton. New Yor: Dover Publcatons, 999. [] s S. Introducton to Game heory. Hndustan Boo Agency, 3.

KAMALAPURKAR et al.: CONCURREN LEARNING-BASED APPROXIMAE FEEDBACK-NASH EQUILIBRIUM SOLUION OF N-PLAYER 7 [5] Basar, Olsder G. Dynamc Noncooperatve Game heory Second edton. Phladelpha, PA: SIAM, 999.

Nonzero-sum dfferental games. Journal of Optmzaton heory and Applcatons, 969, 33: 8 6 [9] Starr A, Ho C Y. Further propertes of nonzero-sum dfferental games.

Cogntve Systems Research,, : 55 66 [] e Q L, Zhang H G. A new approach to solve a class of contnuoustme nonlnear quadratc zero-sum game usng ADP.

9 KAMALAPURKAR et al.: CONCURREN LEARNING-BASED APPROXIMAE FEEDBACK-NASH EQUILIBRIUM SOLUION OF N-PLAYER 7 [5] Basar, Olsder G. Dynamc Noncooperatve Game heory Second edton. Phladelpha, PA: SIAM, 999. [6] Nash J. Non-cooperatve games. he Annals of Mathematcs, 95, 5: [7] Case J H. oward a theory of many player dfferental games. SIAM Journal on Control, 969, 7: [8] Starr A, Ho C Y. Nonzero-sum dfferental games. Journal of Optmzaton heory and Applcatons, 969, 33: 8 6 [9] Starr A, Ho C Y. Further propertes of nonzero-sum dfferental games. Journal of Optmzaton heory and Applcatons, 969, 3: 7 9 [] Fredman A. Dfferental Games. New Yor: John ley and Sons, 97. [] Lttman M. Value-functon renforcement learnng n Marov games. Cogntve Systems Research,, : [] e Q L, Zhang H G. A new approach to solve a class of contnuoustme nonlnear quadratc zero-sum game usng ADP. In: Proceedngs of the IEEE Internatonal Conference on Networng Sensng and Control. Sanya, Chna: IEEE, [3] Vamvoudas K, Lews F. Onlne synchronous polcy teraton method for optmal control. In: Proceedngs of the Recent Advances n Intellgent Control Systems. London: Sprnger, [] Zhang H G, e Q L, Lu D R. An teratve adaptve dynamc programmng method for solvng a class of nonlnear zero-sum dfferental games. Automatca,, 7: 7 [5] Zhang X, Zhang H G, Luo Y H, Dong M. Iteraton algorthm for solvng the optmal strateges of a class of nonaffne nonlnear quadratc zero-sum games. In: Proceedngs of the IEEE Conference Decson and Control. Xuzhou, Chna: IEEE, [6] Vrabe D, Lews F. Integral renforcement learnng for onlne computaton of feedbac Nash strateges of nonzero-sum dfferental games. In: Proceedngs of the 9th IEEE Conference Decson and Control. Atlanta, GA: IEEE, [7] Vamvoudas K G, Lews F L. Mult-player non-zero-sum games: onlne adaptve learnng soluton of coupled Hamlton-Jacob equatons. Automatca,, 78: [8] Zhang H, Lu D, Luo Y, ang D. Adaptve Dynamc Programmng for Control-Algorthms and Stablty Communcatons and Control Engneerng. London: Sprnger-Verlag, 3 [9] Johnson M, Bhasn S, Dxon E. Nonlnear two-player zero-sum game approxmate soluton usng a polcy teraton algorthm. In: Proceedngs of the 5th IEEE Conference on Decson and Control and European Control Conference. Orlando, FL, USA: IEEE,. 7 [] Mehta P, Meyn S. Q-learnng and pontryagn s mnmum prncple. In: Proceedngs of the IEEE Conference on Decson and Control. Shangha, Chna: IEEE, [] Vrabe D, Abu-Khalaf M, Lews F L, ang Y Y. Contnuous-tme ADP for lnear systems wth partally unnown dynamcs. In: Proceedngs of the IEEE Internatonal Symposum on Approxmate Dynamc Programmng and Renforcement Learnng. Honolulu, HI: IEEE, [] Sutton R S, Barto A G. Renforcement Learnng: An Introducton. Cambrdge, MA, USA: MI Press, 998. [3] Bhasn S, Kamalapurar R, Johnson M, Vamvoudas K, Lews F L, Dxon. A novel actor-crtc-dentfer archtecture for approxmate optmal control of uncertan nonlnear systems. Automatca, 3, 9: 89 9 [] Chowdhary G, Yucelen, Mühlegg M, Johnson E N. Concurrent learnng adaptve control of lnear systems wth exponentally convergent bounds. Internatonal Journal of Adaptve Control and Sgnal Processng,, 7: 8 3 [5] Chowdhary G V, Johnson E N. heory and flght-test valdaton of a concurrent-learnng adaptve controller. Journal of Gudance, Control, and Dynamcs,, 3: [6] Ioannou P, Sun J. Robust Adaptve Control. New Jersey: Prentce Hall, [7] Khall H K. Nonlnear Systems hrd edton. New Jersey: Prentce Hall,. 7 [8] Nevstć V, Prmbs J A. Constraned nonlnear optmal control: a converse HJB approach. Calforna Insttute of echnology, Pasadena, CA 95, echcal Report CI-CDS 96-, [9] Savtzy A, Golay M J E. Smoothng and dfferentaton of data by smplfed least squares procedures. Analytcal Chemstry, 96, 368: author of ths paper. Rushesh Kamalapurar Receved hs bachelor degree n mechancal engneerng from Vsvesvaraya Natonal Insttute of echnology, Nagpur, Inda. He wored for two years as a desgn engneer at Larsen and oubro Ltd., Mumba, Inda. He s currently pursung the Ph. D. degree n the Department of Mechancal and Aerospace Engneerng n Unversty of Florda under the supervson of Dr. arren E. Dxon. Hs research nterest covers learnng based control, dynamc programmng, optmal control, renforcement learnng and adaptve control for uncertan nonlnear dynamcal systems. Correspondng Justn R. Klotz Receved the B. S. and M. S. degrees n mechancal engneerng from the Unversty of Florda n and 3, respectvely. He s currently worng towards the Ph. D. degree wth a concentraton on control theory at the Unversty of Florda under the SMAR Scholarshp. He was a research assstant for the Ar Force Research Laboratory at Egln Ar Force Base for the summers of and 3. Hs research nterest covers cooperatve networ control and optmal control of uncertan nonlnear systems. arren E. Dxon Receved hs Ph. D. degree n from the Department of Electrcal and Computer Engneerng, Clemson Unversty. After completng hs doctoral studes he was selected as an Eugene P. gner Fellow at Oa Rdge Natonal Laboratory ORNL. In, Dr. Dxon oned the Unversty of Florda n the Department of Mechancal and Aerospace Engneerng. Hs research nterest covers the development and applcaton of Lyapunov-based control technques for uncertan nonlnear systems. He has publshed 3 boos, an edted collecton, 9 chapters, and over 5 referred ournal and conference papers. Hs wor has been recognzed by the 3 Fred Ellersc Award for Best Overall MILCOM Paper, -3 Unversty of Florda College of Engneerng Doctoral Dssertaton Mentorng Award, Amercan Socety of Mechancal Engneers ASME Dynamcs Systems and Control Dvson Outstandng Young Investgator Award, 9 Amercan Automatc Control Councl AACC O. Hugo Schuc Best Paper Award, 6 IEEE Robotcs and Automaton Socety RAS Early Academc Career Award, an NSF CAREER Award 6, DOE Outstandng Mentor Award, and the ORNL Early Career Award for Engneerng Achevement. He s an IEEE Control Systems Socety CSS Dstngushed Lecturer. He currently serves as the drector of operatons for the Executve Commttee of the IEEE CSS Board of Governors. He s currently or was formerly an assocate edtor for ASME Journal of Dynamc Systems, Measurement and Control, Automatca, IEEE ransactons on Systems Man and Cybernetcs: Part B Cybernetcs, and the Internatonal Journal of Robust and Nonlnear Control.

Concurrent learning-based online approximate feedback-nash equilibrium solution of N-player nonzero-sum differential games

Concurrent learning-based online approximate feedback-nash equilibrium solution of N-player nonzero-sum differential games Concurrent learnng-based onlne approxmate feedbac-nash equlbrum soluton of N-player nonzero-sum dfferental games Rushesh Kamalapurar, Justn Klotz, and arren E. Dxon arxv:30.38v [cs.sy] Oct 03 Abstract