A Continuous-time Markov Decision Process Based Method on Pursuit-Evasion Problem

Preprints of the th World Congress The Interntionl Federtion of Automtic Control Cpe Town, South Afric. August -, A Continuous-time Mrkov Decision Process Bsed Method on Pursuit-Evsion Problem Ji Shengde Wng Xingke Ji Xioting Zhu Huyong College of Mechntronic Engineering nd Automtion, Ntionl University of Defense Technology, Chngsh, Chin (e-mil: ji.shde@gmil.com,xkwng@nudt.edu.cn,xiotji@nudt.edu.cn). Abstrct: This pper presents method to ddress the pursuit-evsion problem which incorportes the behviors of the opponent, in which continuous-time Mrkov decision process (CTMDP) model is introduced, where the significnt difference from Mrkov decision process (MDP) is tht the influence of the trnsition time between the sttes is tken into ccount. By introducing the concept of sitution, the probbilities ddressing verge behviors re obtined. Furthermore, these probbilities re introduced to construct the trnsition mtrix in the CTMDP. A policy itertion method for solving the CTMDP is lso given. To demonstrte the CTMDP method for pursuit-evsion, exmples in grid environment re computed. The CTMDP-bsed method presented in this pper offers new pproch to pursuit-evsion modeling nd my be extended to similr problems in the sequentil decision process. Keywords: Pursuit-Evsion, Continuous-time Mrkov Decision Process, Trnsition Rtes Mtrix, Dynmic Progrmming, Policy Itertion.. INTRODUCTION Pursuit-evsion is fmily of problems in the control theory nd computer science, in which one group ttempts to trck down the members of nother group in n environment. Due to the vrious pplictions of the ground mobile robots nd unmnned eril vehicles nd so on, the existing literture on pursuit-evsion is vst in volume [Isler nd et l.,, Virtnen et l., ]. The problem is difficult nd usully considered s dynmic, stochstic, continuous-spce, continuous-time or discretetime discrete-spce gme [Shedied, ]. The optiml control technique provides n efficient tool for the nlysis of pursuers decisions when the dynmics of the pursuer re known. The min dvntge of this method is tht if rel-time solution exists, then it cn be obtined by set of difference equtions [LVlle, ]. The optiml control technique is lwys used to ddress the problem with continuous-time nd continuousspce, but the used differentil equtions my not ctully represent the complex behviors of the plyers nd the solution will not be fesible while uncertinty exists in opponents nd the environment. As the discretiztion of time nd spce represents nother perspective, dynmic progrmming [Bertseks nd Tsitsiklis, ] theories provide common rchitecture, nd this discrete form suits more the clcultion on computers. Through constructing vlue function to be optimized nd spce of sttes, there re two mjor pproches of solution well known s vlue itertion nd policy itertion. In order to del with the uncertinties s well s dynmics in pursuitevsion, the dynmic progrmming-bsed techniques need to be extended. These methods tht hve emphsized the probbilities of ddressing verge behviors will be good choice. The description of the evders behvior llows incorporting probbilities in opponent loctions, intents nd/or sensor observtions. If the probbility between the sttes hs been defined, ssuming they obey the property of Mrkovin, the dynmic progrmming technique becomes Mrkov decision process. The MDP is now widely ccepted s preferred frme for decision-theoretic plnning [LVlle,, Russell nd Norvig, ]. Some reserches using fuzzy reinforcement lerning re mentioned in [Fiy et l., ]. It is lwys considered tht in the context of MDP, only the sttes hve n impct on the trnsition probbilities nd the expected rewrd function, but the trnsition time between sttes is not considered. However, in pursuit-evsion process, it is obvious tht the combting result will be sensitive to time, i.e., shorter trnsition time spn nd longer one might led to different results. To ddress this issue, the CTMDP will provide more nturl model. A brief description of the CTMDP method will be presented in Section s it will be used to model n ir combt strtegy process. The rest of the pper is orgnized s follows. Section gives brief description of the CTMDP, in which the definition of CTMDP model nd policy lgorithm of solution re presented. The dynmics of the pursuer nd the evder re presented in Section. Section describes the min contributions of this pper, in which the behvior of the opponent is modeled nd novel method for obtining the trnsition rtes mtrix is presented. The presented method is demonstrted by some numericl exmples in Section. And finlly, concluding remrks re given in Section. Copyright IFAC

th IFAC World Congress Cpe Town, South Afric. August -,. THE CONTINUOUS-TIME MARKOV DECISION PROCESS A CTMDP model cn be presented s five tuple: {S, S, (A(i), i S), q(j i, ), r(i, )} The stte spce S is finite set of fully observble sttes of the system, S represents the initil stte tht belongs to S, A(i) denotes fmily of mesurble subsets of ctions pplicble in i S, q(j i, ) is the trnsition rte of Stte j fter performing ction A(i) in Stte i, which cn stisfy q(j i, ) nd i j, nd r(i, ) R is rewrd function such tht r(i, ) becomes the immedite rewrd for being in Stte i with Action. In ddition, two most importnt ssumptions shll be kept in mind for this cse. Firstly, the trnsition rte is conservtive, i.e. q(j i, ) =. () j S Algorithm Policy Itertion Algorithm : Pick n rbitrry µ. Let k = nd tke µ k = µ. : (Policy evlution). : Obtin R(µ k ) = [r(, µ k ()),..., r( S, µ k ( S ))] T : Obtin J k = [αi Q(µ k )] R(µ k ). : (Policy improvement). Obtin policy µ k+ =,,..., S tht provides j i q(j i, i) > J k (i), for i S. α+q(i i, i ) : if µ k+ = µ k then : stop : else : go to Step : end if r(i, i ) α+q(i i, i ) + Secondly, there is stble condition tht cn be expressed s: sup A(i) q(i i, ) <, i S. With these ssumptions, the unique solution of the probbility mtrix for the CTMDP cn be generted [Guo, ]. Furthermore, the trnsition probbility mtrix nd the trnsition rtes mtrix stisfy the Kolmogorovs forwrd nd bckwrd equtions [Stroock, ]: dp (t; µ) dt = P (t; µ)q(µ) nd dp (t; µ) dt = Q(µ)P (t; µ) () The policy µ consists of ll the ctions which hd been selected in every Stte i, i.e. µ = {,,..., S }, where i A(i) nd S represents the number of totl sttes in S. The rewrd function r(i, ) R is the immedite rewrd while selecting Action in Stte i. Then, the expected rewrd function J α t i under policy µ, which is required to be mximized, is defined s the sum of the future rewrds over n infinite horizon, s discounted by Prmeter α: J α (i, µ) := e αt j S p i,j (t, µ(i))r(j, µ(j))dt. () According to Theorem. in in Guo [], there exists n optiml policy µ tht mkes J α to be mximized. By the definition of the vlue function J α, Guo [] offers policy itertion lgorithm tht cn be used to obtin the optiml policy which is described s Algorithm.. PROBLEM STATEMENT A given pursuit-evsion cse cn be modeled s continuous-time mrkov process. In this section, simple discrete pursuit-evsion exmple will be introduced nd then the detils to construct the CTMP will be described.. Dynmics Consider the environment s discrete grid of dimension plne, in which two robots try to trck nd evde from ech other. For simplicity, the robots move simultneously step by step. Ech step of the movement is clled n ction. In this context, the robots re considered to hve Fig.. The environment nd the ctions of robots three vilble ctions {L, R, S}, nmely turn left, turn right nd go stright (Fig. ). Wht the present pper focuses on is how to find successive ctions for the blue robot, which finlly mkes him rrive in the bck region of the red robot. The environment is represented s Ω = {(x, y) x, y N}. The stte of the robot s dynmic consists of the position nd the velocity informtion. As n exmple, the stte vector of the blue robot is represented s x b = (x b, y b, v b x, v b y), nd the entries of x b denote the horizontl nd verticl positions nd velocities respectively. Assuming tht (x b, y b ) Ω nd (v b x, v b y) {v b x + v b y = nd v b x, v b y =, }, the dynmic of the robots cn be written s Algorithm :. Combt-Sttes, Gol nd Rewrds The pursuit-evsion process cn be considered s combting process. In the previous subsection, the dynmics of single robot hve been introduced, in which only the single side is considered, while the opponent is ignored. In order to depict the combting process, the Combt-Sttes in reltion to the dynmics of both robots re defined. The Combt-Stte is represented s x c = [d, AA, AT A], where d is the Mnhttn distnce(it will be more suitble for the discrete environment in this pper), nd AT A nd AA re described in Fig.. Given dynmic sttes x b nd x r of the blue nd red robots, these elements in Combt- Stte cn be obtined s shown in ().

th IFAC World Congress Cpe Town, South Afric. August -, Algorithm Robots Dynmics : if ction = turn left ( = L) then : if [v x (k), v y (k)] = [, ] then : x(k + ) = [x(k), y(k) +,, ] : else if [v x (k), v y (k)] = [, ] then : x(k + ) = [x(k), y(k),, ] : else if [v x (k), v y (k)] = [, ] then : x(k + ) = [x(k) +, y(k),, ] : else if [v x (k), v y (k)] = [, ] then : x(k + ) = [x(k) +, y(k) +,, ] : end if : else if ction = turn right ( = R) then : if [v x (k), v y (k)] = [, ] then : x(k + ) = [x(k) +, y(k) +,, ] : else if [v x (k), v y (k)] = [, ] then : x(k + ) = [x(k) +, y(k),, ] : else if [v x (k), v y (k)] = [, ] then : x(k + ) = [x(k), y(k),, ] : else if [v x (k), v y (k)] = [, ] then : x(k + ) = [x(k), y(k) +,, ] : end if : else if ction = go stright ( = S) then : x(k + ) = x(k) + [v x (k), v y (k),, ] : end if d = x b x r + y b y r AT A = ±rccos (vb x, v b y) (x r x b, y r y b ) (v b x, v b y) (x r x b, y r y b ) AA = rccos (vr x, v r y) (x r x b, y r y b ) (v r x, v r y) (x r x b, y r y b ) This group of equtions bove offers description of the Combt-Sttes s represented by the dynmic sttes of the two robots. The size of the problem is limited by integer N, where it provides tht x r x b N nd y r y b N. It is obviously to see tht AA is minly depends on the red robot nd AT A is the blue one. The rnge of d belongs to [, N], AA is set to [, π] nd AT A is set to [ π, π). Becuse if the rnge of AT A is [, π], it is impossible to distinguish which side of LOS for v b hs been locted. From the direction of the LOS to v b, if the direction is counter-clockwise then AT A >, otherwise, AT A <,which cn help to clculte the next position with the selected ction in the lgorithm of the rewrd function(see.alg.). However in the clcultion of the likelihood functions nd trnsition rtes,whether AT A is positive or negtive is ignorble nd the rnge of AT A will be set to [, π]. To chieve this gol, the rewrd functions re necessry for obtining solution in the mthemtic model.the outcome in Stte i is represented s g(i), which consists of two prts, nmely the gol zone rewrd function g p (i) nd the scoring function S(i) McGrew et l. []. The outcome with the weight ω g [, ] cn be defined s: g(i) = ω g g p (i) + ( ω g )S(i). () Firstly, we define gol zone, nd rewrd the sttes tht mke the red robot drop in the gol zone nd punish the filing one. The gol zone of the blue robot is ssumed to be the four blocks in front of the robot, see Fig.. If the opponent is in the gol zone, one unit of rewrds cn be obtined which provides (g p (i) = ); otherwise zero () v b ATA v r ATA AA Fig.. The geometricl reltionship of robots Fig.. The gol zone nd g p (g p (i) = ). Furthermore, it is lso helpful for getting better solution by defining scoring function. The scoring function is n expert heuristic function, which resonbly cptures the reltive merits of every possible stte in the dversril gme McGrew et l. []. The scoring function cn be evluted by the elements of the Combt- Stte i, s shown in. Where d denotes the expected distnce between the robots, which is set to d =. The constnt K is fctor to djust the reltive effect of the rnge nd the ngle, in this pper it is set to K = /π. S(i) = v b LOS AT A ( π ) + ( AA π ) d d exp ( πk ) () The rewrd functions r(i, ) cn be obtined through Algorithm.. CONSTRUCTION OF TRANSITION RATES MATRIX As the Combt-Sttes, ctions, nd rewrds hve been depicted, modeling the pursuit-evsion process into CT- MDP with especil ttentions on construction of the trnsition rtes will be described in this section.. Combting Situtions Different from the MDP,in pursuit-evsion process, it is obvious tht the time of spent for trnsition will infect the combting sitution to gret extent. Thus it is more resonble if considertion is given to the connection between the trnsition probbility nd time. The trnsition probbility p(j i, ) is replced by p(j i, ; t), which mirrors

th IFAC World Congress Cpe Town, South Afric. August -, Algorithm The rewrd function r(i, ) : Given the ction of the blue robot in the current Combt-Stte i = [d, AA, AT A]. : Assuming the red robot movies immobile, obtin ī = [ d, AA, AT A]. : d = d nd AA = AA : if = L then : if AT A > π then : AT A > π AT A : else : AT A = π + AT A : end if : end if : if = R then : if AT A < π then : AT A = π + AT A : else : AT A = π + AT A : end if : end if : if = S then : AT A = AT A : end if the probbility of trnsferring from Stte i to Stte j fter time period t. There re so mny Combt-Sttes tht mke the problem hrd to be solved. Clssifying these sttes my be not bd choice. Similr with the cses in Virtnen et l. [, ], bsing on the reltive geometry of the sttes of the two robots, the concept of combting sitution will be introduced here. The possible outcomes of combting sitution in Step k re divided into four clsses, which re neutrl, defined by θ k = Θ ; dvntge, defined by θ k = Θ ; disdvntge, defined by θ k = Θ ; nd mutul disdvntge, defined by θ k = Θ. Sitution θ k is rndom vrible tht stisfies the following eqution: p(θ k = Θ l ) =. l= The four bove-mentioned combting situtions re illustrted in Fig.. To be more specificl, in the neutrl sitution, the two robots re locted fr from ech other, nd in not long future, the influence of the ctions selected by both sides cn be negtive. There is no obvious intention tht cn be observed (Fig. ()). Secondly, in the dvntge sitution, the blue robot occupies dominted plce, or in other words, the red robot is locted in front of the blue one nd the distnce is t middle level (Fig. (b)). Thirdly, the disdvntge sitution is just to the contrry, in which the roles of both sides re exchnged (Fig. (c)). Finlly, in the mutul disdvntge sitution, the two robots move ginst to ech other, nd both of them re in disdvntge loctions (Fig. (d)).. Likelihood Functions Depended on Sitution After the probbility of the sitution node is depicted, the trnsition probbility p(j i, ; t) cn be deduced by the Byesin theorem, which provides method for clculting the probbility using the likelihood function of the sitution: p(j i, ; t) := p(j(t) i, ) = = k= p(θ i = Θ k i)p(j θ i = Θ k, ) k= p(θ i = Θ k )p(i θ i = Θ k ) l= p(θ i = Θ l )p(i θ i = Θ l ) p(j θ i = Θ k, ) It is ssumed tht the elements of Combt-Stte re independent in the given sitution, which cn be represented s: () p(j θ i ) = p(d θ i )p(at A θ i )p(aa θ i ). () Except for the neutrl sitution, items d, AA, nd AT A re considered obeying the Gussin distribution. The detils of these distributions re listed in Tble. In the neutrl sitution Θ, there re no specil preferences on the distribution of vribles d, AA, nd AT A. They could be ny vlue s long s it belongs to the resonble intervls. Thus, their likelihood functions re verge distributions. In the dvntgeous sitution Θ, the blue robot is in domintion, nd resonble ssumption is tht the greter dvntge the blue robot hs, the smller Vrible d will be nd to greter extent the AA, nd AT A will become close to the sight of line (LOS). It is ssumed tht these rndom vribles obey -men Gussin distributions with different vrinces. Contrstively, in the disdvntgeous sitution Θ, the vribles lso tke the Gussin form but the men vlue ofaa, nd AT A re both set to π. Finlly in the mutul disdvntgeous sitution Θ, the distribution of Vrible d keeps the sme nd the men vlue of AA, nd AT A re set to π nd respectively. Tble. The conditionl distributions on entries of Combt-Stte d [, N] AA [, π] AT A [, π] Θ N π π d δ Θ e d e AA AT A δ δ e πδd πδ πδ d δ Θ e d πδd d δ Θ e d πδd. Construction of Q e (π AA) δ πδ e (π AA) δ πδ e (π AT A) δ πδ AT A δ e πδ According to Stroock [], the solution of the Kolmogorov eqution cn be represented s follows: P (t) = e tq t m Q m =, t [, ]. () m! m= Considering the extremely limited neighbor domin of t =, it cn be obtined tht the first order Tylor series expnsion is P (t) I + tq, nd then it cn be dp (t) inferred tht Q = dt. On the other side, from () the probbility cn be constructed by the method depicted in the previous section. Thus, given i nd j with i j,

th IFAC World Congress Cpe Town, South Afric. August -, () Neutrl sitution (b) Advntge sitution (c) Disdvntge sitution (d) Mutul sitution Fig.. Four different situtions in combting process Probbility p(j i, ; t) cn be written s the weight sum of the condition probbilities p(j i, ; t) = w i,l p(j θ i = Θ l, ) () in which w i,l = l= p(θ i = Θ l )p(i θ i = Θ l ) ι= p(θ i = Θ ι )p(i θ i = Θ ι ). () From () nd (), it cn be obtined tht dp(j i, ; t) q(j i, ) = = d w i,l p(j θ i = Θ l, ) () dt dt Eqution () cn be written into the first order Tylor series expnsion, nd then it cn be found tht the vlue of q(j i, ) is equl to the sum of the coefficients of monomil terms in the expnsion of p(j θ i = Θ l, ). It is esy to note tht the monomil term of p(j θ i = Θ, ) is zero. p(j θ i = Θ, ) is used s n exmple to show how to get Q. Bsed on tble, it cn be obtined s following: l= p(j θ i = Θ, ) = exp( (d + kd ()t) π δ d δ δd ) (π AA kaa ()t) exp( δ ) A (π AT A kat ()t) exp( δ ) () If we define tht l () = [kl d ()/δd, kl AA ()/δ, kl AT A ()/δ], l =,, nd C = π nd x δ d δ c (j) = [d, AA, AT A], then we hve tht: p(j θ i = Θ, ) = C ( ()x c (j) T t) exp( d δd AA AT A δ δ ), p(j θ i = Θ, ) = C ( ()(x c (j) [, π, π]) T t) exp( d (π AA) (π AT A) δd δ δ ), p(j θ i = Θ, ) = C ( ()(x c (j) [, π, ]) T t) exp( d (π AA) AT A δd δ δ ). () () () And define K l () = [kl d (), kl AA (), kl AT A ()], for l =,,, s the chnge rtes dependent on the mneuvering cpbilities in d, AA, AT A. The trnsition rte cn be represented s: q(j i, ) = C {w i, λ ()x T c (j) +w i, λ ()(x c (j) [, π, π]) T () +w i, λ ()(x c (j) [, π, ]) T } Where, nd λ = exp( d δd AA δ λ = exp( d (π AA) δd δ λ = exp( d δ d (π AA) δ AT A δ ) (π AT A) δ ) AT A δ To provide the CTMDP solution, it is needed to sustin the ssumption of the conservtiveness (),i.e. q(i i, ) = j i q(j i, ).. NUMERICAL EXAMPLES ). () To verify the effectiveness of our proposed CTMDP-bsed method on the pursuit-evsion problem, some numericl experiments re recorded in this section. Here the size of the combting window is set to be N =, nd the prmeters selected in numericl exmples re listed in tble.. Keep-Stright Sitution In the first numericl exmple, the predetermined pth of the red robot is liner. The position nd direction re initilized to x r () = [,,, ]. The x-y coordintes re [,], nd the direction is left. For the blue robot, there re four different initil sttes, where x b () is set to [,,-,], [,,,], [,, -,], nd [,,,]. The problem with the CTMDP is solved by the method given in the previous sections, nd n efficient policy cn thus be obtined to ensure the blue robot respond to the red ones ctions (See Fig. ). It is shown tht the policy helps the blue robot ctch up with the red one in short pth under the four different initil sttes respectively.

th IFAC World Congress Cpe Town, South Afric. August -,. CONCLUSION - - - - - () - - - - - (c) - - - - - (b) - - - - - Fig.. The numericl results in keep-stright sitution - - - - - - - - () - - - - - (d) - - - Fig.. The numericl results in keep-turning sitution. Keep-Turning Sitution In this section, the red robot moves in circle, s shown in Fig.. The initil stte of the red one is lso set to x r () = [,,, ]. Two typicl sttes for x b () re selected: one inside the circle nd the other outside, which re [,,-,] nd [,,,], respectively. Fig. shows tht the blue robot is trying to trck the red one s soon s possible, or in other words, trying to locte the red one in the gol zone of the blue one. (b) Tble. The prmeters used in numericl exmples [K T ( = ), KT ( = ), KT ( = )] [K T ( = ), KT ( = ), KT ( = )] [K T ( = ), KT ( = ), KT ( = )] Prmeter Vlue [w i,, w i,, w i, ] [.,.,.] [w g, d, α] [.,,.] [δ d, δ ] [, π/] π/, π/, [ -, -, -. π/, π/, ] π/, π/, [ -, -, -. π/, π/, ] π/, π/, [ -, -, - π/, π/, ] In this pper, the pursuit-evsion process is modeled into continuous time Mrkov decision process. Bsed on the situtions, Byesin pproch is used to describe the trnsition probbility mtrix, nd n efficient method is presented to ddress how to construct trnsition rtes in such context. In both numericl exmples, the successive control decisions of the blue robot re mde ppropritely. The results indicte tht this method provides fesible solution. The future work would focus on extending it to -D form. A potentil difficulty in chieving this gol will be cused by the fct tht the size of pursuit-evsion will certinly increse, which will be ccompnied by the increse in the number of stte vribles nd tht of control ctions. Consequently, the method for developing policy in the CTMDP should be extended to more complicted one. REFERENCES Bertseks, D. nd Tsitsiklis, J. (). Neuro-Dynmic Progrmming. Anthropologicl Field Studies. Athen Scientific. Fiy, B., Crleton University. Disserttion. Engineering, E., nd Computer (). Lerning in Pursuit-evsion Differentil Gmes Using Reinforcement Fuzzy Lerning. Crleton University. Guo, X.P. (). Continuous-Time Mrkov Decision Processes. Heidelberg: Springer. Isler, V. nd et l. (). Rodmp bsed pursuit-evsion nd collision voidnce. LVlle, S. (). Plnning Algorithms. Cmbridge University Press. McGrew, J.S., How, J.P., Willims, B., nd Roy, N. (). Air combt strtegy using pproximte dynmic progrmming. AIAA Journl on Guidnce, Control, nd Dynmics, (),. Russell, S. nd Norvig, P. (). Artificil Intelligence: A Modern Approch. Prentice Hll series in rtificil intelligence. Prentice Hll. Shedied, S.A. (). Optiml Control for Two Plyer Dynmic Pursuit Evsion Gme; The Herding Problem. Ph.D thesis, Virgini Polytechnique Institute nd Stte University, Virgini, USA. Stroock, D.W. (). An Introduction to Mrkov Process. Heidelberg: Springer. Virtnen, K., Krelhti, J., Rivio, T., nd Ccc, T. (). Modeling ir combt by moving horizon influence digrm gme. Journl of Guidnce, Control, nd Dynmics, (),. Virtnen, K., Rivio, T., nd Hmlinen, R.P. (). Modeling pilot s sequentil mneuvering decisions by multistge influence digrm. Journl of Guidnce, Control, nd Dynmics,,.