Nash and Correlated Equilibria for Pursuit-Evasion Games Under Lack of Common Knowledge

Size: px

Start display at page:

Download "Nash and Correlated Equilibria for Pursuit-Evasion Games Under Lack of Common Knowledge"

Donald Griffith
5 years ago
Views:

1 Nash and Correlated Equilibria or Pursuit-Evasion Games Under Lack o Common Knowledge Daniel T. Larsson Georgios Kotsalis Panagiotis Tsiotras Abstract The majority o work in pursuit-evasion games assumes perectly rational players who are omnipotent and have complete knowledge o the environment and the capabilities o other agents and, consequently, are correct in their assumption o the game that is played. This is rarely the case in practice. More oten than not, the players have dierent knowledge about the environment either because o sensing limitations or because o prior experience. In this paper, we wish to relax this assumption and consider pursuit-evasion games in a stochastic setting, where the players involved in the game have dierent perspectives regarding the transition probabilities that govern the world dynamics. We show the existence o a (Nash) equilibrium in this setting and discuss the computational aspects obtaining such an equilibrium. We also investigate a relaxation o this problem employing the notion o correlated equilibria. Finally, we demonstrate the approach using a gridworld example with two players in the presence o obstacles. I. INTRODUCTION Game theory studies multi-agent decision problems in which the payo o each player depends not only on its own actions, but also on the actions o the other player(s). Traditionally, pursuit-evasion games have been studied in the context o dierential games [1] [3]. In these ormulations, it is typically assumed that both players are in agreement in regards to the environment in which their interactions take place. Moreover, it is also assumed that the two players not only mutually agree about the environment they operate in, but they are also aware o this agreement, a situation that is reerred to in the literature as common knowledge [4]. In pursuit-evasion problems this leads to modeling the strategic interaction between the two players as a zero-sum dierential game. We would like to progressively relax this assumption. We start by investigating the case when the two players have potentially dierent perceptions o the evolution o the overall system. This implies that the equilibrium calculations or each player may be perormed under potentially erroneous assumptions about the environment. In order to illustrate the peculiarities o the proposed problem ormulation, we investigate a pursuit-evasion problem where the two players have dierent perceptions o the transition probabilities o the overall system, in the context D. Larsson is a PhD student with the D. Guggenheim School o Aerospace Engineering, Georgia Institute o Technology, Atlanta, GA, , USA. daniel.larsson@gatech.edu G. Kotsalis is with the H. Milton Stewart School o Industrial & Systems Engineering, Georgia Institute o Technology, Atlanta, GA, , USA. gkotsalis3@gatech.edu P. Tsiotras is a Proessor with the D. Guggenheim School o Aerospace Engineering and the Institute or Robotics and Intelligent Machines, Georgia Institute o Technology, Atlanta, GA, , USA. tsiotras@gatech.edu o a stochastic game [5]. Stochastic games provide a discrete analog o dierential games and oer a natural ramework to study pursuit-evasion problems. The term stochastic game reers to the scenario where multiple players interact in a dynamic probabilistic environment comprising o a inite number o states where each o the players have initely many actions at their disposal. The motivation or studying this problem stems rom situations o asymmetric and imprecise inormation on the underlying characteristics o the game as a result o deception [6], insuicient measurement capabilities [7] or erroneous modeling assumptions about the environment. Consider, or example, a pursuit-evasion scenario between two small UAVs in the presence o an external wind ield. The presence o the wind ield may have a large impact on the ensuing vehicle trajectories [8] [11]. For a pursuit-evasion problem, accurate knowledge, (or lack thereo) will thereby impact the outcome o the game. Depending on the on-board sensors and the individual player s modeling assumptions, every player is aced with a problem where each player s perception about the environment (and, subsequently, each other s dynamics) may dier. This discrepancy on each player s belie about the true state o the world is called lack o common knowledge. In our work, we investigate the impact o lack o common knowledge to a pursuit-evasion game in a stochastic setting. We show that lack o common knowledge leads naturally to a non-zero-sum setting even i the reward structure or both players initially leads to a zero-sum game (under common knowledge). We show the existence o an equilibrium in a such case, and provide a discussion regarding the numerical aspects o computing equilibria or such general (that is, nonzero-sum) games in Section V. Finally, we present a simple two-player pursuit-evasion game where the players do not share the same understanding o the world dynamics. II. NOTATION AND PRELIMINARIES The set o non-negative integers is denoted by Z 0, the set o positive integers by Z +, the set o real numbers by R and the set o non-negative reals by R +. For n Z + let R n denote the Euclidean n-space and R n + the non-negative orthant in R n. For n, m Z + let R n m denote the space o n-by-m real matrices. Typically, vectors and matrices will be written with boldace letters. The transpose o the column vector x R n is denoted by x T and or i Z +, [x] i reers to the i-th entry o x. Similarly or i, j Z +, [A] ij reers to the (i, j)-th entry o the matrix A R n m. The vector o 1 s in R n is denoted by 1 n = [ 1 1 ] T. n = {x R n n + i=1 [x] i = 1} denotes the simplex

2 in R n. Given x R n, x = max i {1,...,n} [x] i, and x = ( n ) 1 i=1 [x] i. Similarly or : {1,..., n} R, = max i {1,...,n} (i) and x, y R n, x, y denotes the inner product o two vectors. For vectors representing joint probability distributions x j,k = Pr {J = j & K = k} or random variables J, K. For payo matrices A i R n m, A i j,k is the entry corresponding to Player i playing action k when all others play action j. For a given set A, the set o probability measures on A will be denoted by P[A]. The power set, (the set o all subsets) o A, will be denoted by A. III. PROBLEM FORMULATION The notation and terminology used in this work is taken mainly rom [5]. We consider a discrete time, ininite-horizon dierential game with two players, indexed by i I = {1, }. The game evolves on the inite state space S = {1,..., N}, where N Z +, N. We will reer to discrete time instances as stages. At each stage t Z 0 the system occupies a state, say, s t S and player i chooses an action a i t rom the set A i = {1,..., m i }. For notational simplicity, we assume that the action sets are not state dependent. The sample space is Ω = (S A 1 A ), and its sigma-algebra is denoted by Σ = Ω. For t Z 0, i I we deine the maps S t : Ω S and A i t : Ω A, where or ω = (ω 1, ω,...) = ((s 0, a 1 0, a 0), (s 1, a 1 1, a 1),...) such that the relationships S t (ω) = s t and A i t(ω) = a i t hold. We assume that at each stage t Z 0, and or all ω Ω, both players have access to the ull state S t (ω) S o the system. We restrict ourselves to randomized Markovian strategies [1]. This means that when the system occupies the state s S each player i I will select a probability distribution on A i that depends on previous system states and actions only through the current state. Formally, the strategy o player i I is given by the map i : S A i R + where, or all s S, one has i (s, a i ) = 1. (1) a i A i The set o all strategies o player i I will be denoted by F i. For each i I, and every i F i, let us write i = [1 it,..., N it where or s S, i ]T s = [ i (s, 1),..., i (s, m 1 )] T. Each player i I has its own perception about the evolution o the system, encoded in the transition map p i : S S A 1 A R +. In particular, player i I believes that or each (s, a 1, a ) S A 1 A the system will make a transition to the state s S with probability p i (s, s, a 1, a ). Given a pair o strategies = ( 1, ) F = F 1 F, () the evolution o the system on S is Markovian. For a given initial condition s 0 S and a ixed strategy pair F, the measurable space (Ω, Σ) will be equipped with two measures, namely, µ i,s 0, i = 1,, as ollows. Given t Z 0, consider the path ((s 0, a 1 0, a 0), (s 1, a 1 1, a 1),..., (s t, a 1 t, a t )) (S A 1 A ) t+1 and let O Σ denote the corresponding cylinder set, O = {ω Ω S k (ω) = s k, A i k (ω) = ai k, i I, 0 k t}. Then, or i I, and F, one has ( t µ i,s 0 [O] = 1 (s k 1, a 1 k 1) (s k 1, a k 1) k=1 ) p i (s k, s k 1, a 1 k 1, a k 1) 1 (s t, a 1 t ) (s t, a t ). Let r i : S A 1 A R, denote the reward map o player i I. For each choice (a 1, a ) A 1 A at state s S, player i will collect an immediate reward r i (s, a 1, a ). For the class o dierential games we have in mind (e.g., pursuitevasion games) the players are in a purely antagonistic situation. We will thereore assume that r 1 (s, a 1, a ) = r (s, a 1, a ) or all (s, a 1, a ) S A 1 A. For t Z 0, i I, deine R i t : Ω R, as ollows R i t(ω) = r i (S t (ω), A 1 t (ω), A t (ω)). (3) We may write E i,s 0 [R i t] = E i [Ri t S 0 = s 0 ], to denote the expected value o the reward R i t o player i I at stage t Z 0 or a given strategy pair F, initial state s 0 S under measure µ i,s 0. Let β [0, 1) denote a discount actor. For a ixed strategy F we denote the discounted value map o player i, by v i : S R, where or s 0 S, v i (s 0 ) = β t E i,s 0 [Rt]. i (4) t=0 The vector o discounted values or player i I o the streams o expected rewards resulting rom the use o the strategy pair F or every initial state s 0 S, is then given by v i = [vi (1),..., vi, with joint Q-values or (N)]T each player deined by [5], [13] [19] [ Q i (s, v i ) ] = a 1,a [ r i (s, a 1, a ) + β ] p i (s, s, a 1, a )v i (s (5) ). s S The joint Q-values deined in (5) will serve as single-stage payo matrices when ormulating optimization problems or computing equilibrium strategies, as discussed in Section V. We will reer to the collection o objects G = {β, I, S, A 1, A, r 1, r, p 1, p } as a discounted stochastic game. Note that i we assume, as we do in this paper, that p 1 p, then G is a stochastic game with lack o common knowledge. At this point we need to establish which pair = ( 1, ) F is a solution to the stochastic game G. Deinition 3.1: A pair o strategies o = ( 1,o,,o ) F is a Nash equilibrium o the stochastic game G i v 1 ( 1,,o ) v 1 ( 1,o,,o ), 1 F 1, (6a) v ( 1,o, ) v ( 1,o,,o ), F. (6b) The above inequalities are understood component-wise with respect to the partial order induced by the non-negative orthant R N +. The solution concept considered is that o a Nash equilibrium or a non-zero-sum stochastic game. When p 1 = p the stochastic game G reduces to a traditional zero-sum stochastic game and the two sets o

3 inequalities (6) reduce to a single set o saddle-point inequalities v ( 1,,o ) v ( 1,o,,o ) v ( 1,o, ), F, where or F, it holds v = v 1 = v. We irst turn our attention to showing the existence o an equilibrium in the sense o (6). Remark: For the two player zero-sum case, it can be shown [5], [18] that v i o(s) = Nashi (Q 1 o(s, v1 o), Q o(s, v o)) i I, (7) where Nash i ( ) is the expected payo to player i when players use o F. The computation o this value and other aspects o the problem are discussed in Section V. IV. EXISTENCE OF EQUILIBRIA The equilibria and the corresponding strategies are computed via the calculation o the best response maps or each agent. To this end, and or each i I, let i stand or the opponent o player i, i.e. i I\{i}. Deinition 4.1: The best response map or each player i I is the set-valued map deined by B i : F i F i, where, or i F i, B i ( i ) = arg max i F vi i. (8) The best response map or each player can be easily computed by noticing that, given i F i, agent i is aced with a one-sided optimization problem, which is a Markov decision process (MDP). To see this, let us take the perspective o Player 1, keeping in mind that the analysis or Player is completely analogous. Accordingly, let us ix F and deine the transition map p 1 : S S A 1 R +, where or (s, s, a 1 ) S S A 1, p 1 (s, s, a 1 ) = p 1 (s, s, a 1, a ) (s, a ). (9) a A The immediate expected reward map or Player 1 as a unction o the randomized strategy o Player is r 1 : S A 1 R, where or (s, a 1 ) S A 1, r 1 (s, a1 ) = r 1 (s, a 1, a ) (s, a ). (10) a A The collection o objects M 1 = {β, S, A 1, p 1, r 1 } is reerred to as the MDP aced by Player 1 when Player employs the randomized strategy F. Given M 1 let v 1,o = max 1 F 1 v1. (11) Each element 1 B 1 ( ) then satisies the relation v 1 = v 1,o. The above optimization problem is standard in the theory o Markov decision processes and it is well posed or every β [0, 1). Deine now the composite response map or all players as B : F F, where or F, [ ] B B() = 1 ( ) B ( 1. (1) ) The map B denotes the best response map or the game. In view o (6), the existence o a ixed point o the map B is equivalent to the existence o an equilibrium or the given stochastic game G. The existence o a ixed point or B can be shown using Kakutani s ixed point theorem, see or instance [4]. Theorem 4. ( [4] ): Let X R n be a compact and convex set and let F : X X be a set-valued map such that: i) For all x X the set F (x) is non-empty and convex. ii) The graph o F is closed, i.e., or all sequences {x k } and {y k } in X such that k Z +, y k F (x k ), x k x, y k y y F (x). Then there exists x X such that x F (x ). Next, we show that the set-valued map B satisies the assumptions o Theorem 4.. Theorem 4.3: The stochastic game G has at least one equilibrium. Proo: The convexity o F i, or i I, ollows directly rom its deinition. For compactness, consider, or each i I, the map N H i : mi F i, k=1 where or u = [u T 1,..., u T N N ]T k=1 mi, and H i [u] = i with u kl = i (k, l), k S, l A, and notice that H i is a continuous bijection with a compact domain. Since F i are convex and compact or all i I, the same holds or F. Inspection o (8) suggests that or i I and i F i ixed, player i in the process o calculating the elements o the set that constitute B i ( i ) is aced with a one-sided optimization problem which is a Markov decision process (MDP). Since the solution o this problem always exists, an optimal best response to F exists and thus B 1 ( ) is not empty. For s S, F, let P 1,s Rm1 R N deined by [P 1,s ] ij = p 1 (j, s, i), (13) and r 1,s Rm1 deined by [r 1,s ] i = r 1 (s, i). Every element 1 B 1 ( ) satisies or each state s S Bellman s equations o optimality [5], [1] [v 1,o ] s = max s 1, r 1,s + β P1,s v1,o. (14) 1 s From the linear programing ormulation o MDP s, see or instance [5], the optimal policies are obtained as the solution o a linear program, and as such the convex combination o two optimal policies is optimal as well. Thus, we have shown that or F, B 1 ( ) is non-empty and convex and by a symmetric argument so is B ( 1 ) or 1 F 1. It ollows that or every F the set B() is non-empty and convex. It remains to show that the graph o B is closed. To this end, consider a sequence { n} F such that n F. Furthermore consider the sequence {n} 1 F 1 such that or every n Z +, n 1 B 1 ( n) and suppose that n 1 1 F 1. It must be shown that 1 B 1 ( ), which will involve sensitivity arguments in regards to the variations o

4 the optimal value and strategy o Player 1 as a unction o the strategy o Player. To this end, consider the map V 1 : F 1 F R N R N where or any ( 1,, v) F 1 F R N and s S [V 1 ( 1,, v)] s = 1 s, r 1,s + β P1,s v. The map Vβ 1 is continuous, since each o its coordinates is polynomial in its arguments and linear in 1. For a ixed F deine the Bellman operator U 1 : R N R N o the corresponding the MDP o player 1 (M 1 ) as [U 1 (v)] s = max[v 1 ( 1,, v)] s, s 1 where v R N and s S. It is known rom standard theory o MDPs [1] that or every F the mapping U 1 is a contraction with respect to the max-norm and as such its ixed point v 1,o is unique. Furthermore, the amily o maps {U 1 F } is equicontinuous. For v R N consider the map Wv 1 : F R N where or F W 1 v( ) = U 1 (v). The map Wv 1 is continuous on its domain and or any closed and bounded set B R N, the amily o maps {Wv 1 v B} is equicontinuous. It ollows that i v 1,o v n 0 then v 1,o = v 0. Employing the triangle inequality, one has V 1 ( 1,, v 0 ) v 0 V 1 ( 1,, v 0 ) V 1 (n, 1 n, v 1,o ) n + V 1 (n, 1 n, v 1,o ) n v 0 = V 1 ( 1,, v 0 ) V 1 (n, 1 n, v 1,o ) n + v 1,o n v 0. By taking the limit n, it ollows that V 1 ( 1,, v 0 ) = v 0, showing that 1 B 1 ( ). In other words, the best response map B 1 has a closed graph. By a similar argument so does the best response map B and thereore B has a closed graph. V. COMPUTATIONAL ASPECTS We have shown that i two players have dierent perceptions o the environment, then the associated discounted stochastic game G has an equilibrium. We now consider the computational aspects o the problem, discussing both the Nash and correlated equilibria and how such solutions can be ound in games under lack o common knowledge. An inherently attractive approach is to solve the singlesided MDPs M 1 and M. Solutions to these optimization problems are well-known and can be ound through linear programming or value iteration [5], [0]. The resulting optimal policy in M i is stationary as well as deterministic and is constructed as ollows. With knowledge o the optimal value o player i, v i,o, orm the single-agent Q-values i Q i,o (s, a i ) = r i i i(s, ai ) + β p i i(s, s, a i )v i,o (s ), i s S (15) where then the optimal policy in s S is then i,o (a i, s) = δ(a i a i,o (s)), with a i,o (s) = arg max a i Q i,o (s, a i ) [5], i [15], [1]. While existence o equilibrium in stochastic games is guaranteed, they need not be deterministic and thus the above method will not generally yield NE [15]. Secondly, the presence o other players that do not have ixed behavior (e.g., due to learning) induces non-stationarity in the MDP transitions deined by (9), violating assumptions o reinorcement learning techniques that have been suggested to solve this problem [19], [1] [3]. Instead, one can view zero-sum games as a collection o matrix games, replacing single-stage payo matrices with the joint Q-values given by (5) [5], [14] [17], [3]. However, or the game to be zero-sum it must be o common knowledge (p 1 = p ) so that the condition in v = v 1 = v is satisied. In this case, inding NE is done by recursively solving a linear program, as the traditional zero-sum structure guarantees the existence o uniquely-valued NE [5], [14] [16], [4]. However, in the case addressed in this paper, this structure is lost as p 1 p. Unortunately, such non-zero sum games may have multiple NE and inding these solutions amounts to solving a non-concave quadratic program, as ollows [5], [5] [7]. For each s S, let Q 1 (s, v1 ), Q (s, v ) be the payo matrices and take s 1 and s to be the strategies or players 1 and, respectively. The NE strategy and value or each player can be ound by solving T[ max Q 1 (s, v 1 ) + Q (s, v ) ] s α γ, (16) 1 s 1, s,α,γ s subject to a i A i Q 1 (s, v 1 ) s α1 m 1 0, Q T (s, v ) 1 s γ1 m 0, 1 m 1, 1 s 1 = 0, 1 m, s 1 = 0, [ 1 s ] j 0 j {1,..., m 1 }, [ s ] k 0 k {1,..., m }, (17) where α, γ R are the expected payos to players 1 and, respectively, when playing the strategy pair (s 1,o, s,o ) (i.e., Nash 1 ( ) = α, Nash ( ) = γ). Although the problem is non-concave, the objective unction has a global maximum o zero corresponding to NE [5], [6]. Thus, a easible solution to (17) with objective unction equal to zero is a NE [5]. Other approaches to the problem have been ormulated that involve solving larger nonlinear programs, opting to not break the game into smaller state-wise components [8] [30]. These approaches are high-dimensional and nonlinear, making the problem no easier to solve. While these programs provide means to ind NE, they may converge to non- Nash solutions []. Thus, in order to arrive at quantiiable solutions, we consider the correlated equilibrium. Deinition 5.1: Consider the two player game deined by payo matrices Q 1 (s, vi ) and Q (s, vi ) with l = m1 m and let z o l be a joint strategy. The strategy z o is a Correlated Equilibrium i ( Q i (s, v i ) a i,a i Qi k (s, v i ) ) a i,a i h z o a i,a 0, i k (18) is satisied a i k, ai h Ai, i I. The correlated equilibrium (CE) is deined on the joint strategy space and allows or correlation o player actions,

5 thus relaxing the requirement that strategies be independent [13]. Thereore, the set o CE contains the NE, since the NE are the cases where the joint strategy can be actored into independent distributions satisying deinition 3.1 [13]. The CE assumes that a third party recommends actions to players according to z o and hence agents are not aware o the actions given to their opponent(s). The attractiveness o CE is that, even in the general-sum case, they can be ound by solving a linear program [13]. The optimization problem requires two components: (1) a selection unction, deining the CE to seek and () constraints encoding deinition 5.1. Given payo matrices Q 1 (s, v1 ), Q (s, v ), take c Rl as c = F(Q 1 (s, v1 ), Q (s, v )), where F is a selection unction, then the solution to such that max c T z, (19) z l Lz 0, (0) is a CE, where L R n l, n = i=1 mi (m i 1), enorces deinition 5.1 [13], [31]. The expected value to player i I is given by CE i ( ) and is ound by taking the expectation o the player s payo under the strategy z o [13], [31]. Then, provided a model o the environment, inding a CE is reduced to a recursive linear program similar to value iteration [31]. The stochastic game is again viewed as a collection o static games, where the value o player i I is updated according to [v i k+1 ] s = CE i (Q 1 k (s, v 1 k ), Q k (s, v k )). (1) Although the iterations suggested by (1) may not converge to a stationary strategy, the algorithm may give other meaningul solutions such as cyclic-correlated equilibrium [31]. These concepts are next shown with an example o a pursuitevasion game under the lack o common knowledge. VI. NUMERICAL EXAMPLE We consider a two-player pursuit-evasion game on a 6x6 grid world, as shown in Fig. 1. The evading player wins the game and is awarded +100 points or reaching any o the two evade states alone (may not co-occupy with pursuer). Capture occurs i the evader occupies the same or any one Obstacle Cluster Evade State Evade State Pursuer Start Evader Start Fig. 1: Grid or pursuit evasion game: start, goal and obstacle cells are indicated. o the neighboring cells in the our cardinal directions as the pursuer, thereby paying -100 points and losing the game. I both navigate into obstacles on the same move, the game is over with 0 points being given to each (draw). In the event o a crash, the game is awarded to the non-crashing player and classiied as either evasion (pursuer crash) or capture (evader crash). For moves within the game, the evading player is given rewards according to r (s) = κ p(s), where κ > 0 and p(s) R is the relative x-y position o the players when in s S. Since the game is zero-sum, r 1 (s) = r (s) or all s S. Each agent has the same action set, and can select to move in one o the our cardinal directions o the grid by selecting up (U), down (D), let (L) or right (R) and thereore A 1 = A = {U, D, L, R}. The discount actor is β = 0.7. Transitions are generally stochastic and are created by providing each agent an action success probability and distributing the remaining probability uniormly over neighboring cells. We can view stochastic transitions as a UAV navigating an environment in the presence o a disturbance. In what ollows, we seek CE that maximize the joint rewards o the players, with resulting strategy simulated over 5,000 games. To understand how the dierences in asymmetric environment knowledge change the game, we consider a number o scenarios as ollows. CKD: Game is common knowledge with deterministic world dynamics, with both agents able to deduce the absence o a disturbance. CKS: Agents have the same perception o the world and are both able to sense a disturbance. LOCK-EV: Lack o common knowledge game with evader aware o a disturbance and pursuer is not. LOCK-PS: Lack o common knowledge game with pursuer aware o the lack o a disturbance and evader is not. Fig. shows sample trajectories or CKD and CKS, with Table I showing results or all cases. Fig. 3 displays player movements under lack o common knowledge. (a) Fig. : Sample trajectories given the stochastic game with common knowledge. Both cases show the game won by the evading player. (a) CKD; (b) CKS. In CKD the evading player wins all games electing to pass directly next to the obstacles, maintaining its -cell advantage. Contrasting to CKS, where both agents are aware o the disturbance but the probabilistic nature o the environment makes evasion substantially more diicult, since a single mistake when executing a maneuver may lead to capture. Further, note that in the sample trajectory or this case (Fig. (b)), the evading player tends to pass urther away rom obstacles. Interestingly, LOCK-EV does not dier greatly rom CKS, albeit there are a greater number o games in which the evader wins, with an increased number o pursuer crashes, indicative o its alse inormation regarding the environment. In LOCK-PS, the evader again navigates away rom obstacles with the pursuer able to threaten interception (b)

(a) Fig. 3: Sample trajectories given the stochastic game with lack o common knowledge. Both cases show the game won by the pursuing player. (a) LOCK-EV; (b) LOCK-PS.

6 (a) Fig. 3: Sample trajectories given the stochastic game with lack o common knowledge. Both cases show the game won by the pursuing player. (a) LOCK-EV; (b) LOCK-PS. which orces the evader to back-track leading to eventual capture. TABLE I: Capture and Evasion percentages with corresponding scenario numbering. Results are or 5,000 simulated games. (b) CKD CKS LOCK-EV LOCK-PS Capture Due to Evader Crash Evasion Due to Pursuer Crash Average Number Moves In this case, the success o evasion is drastically lower than CKD when it holds the correct understanding o the environment. Nonetheless, the evader manages to prolong capture by requiring a substantial number o moves, on average, until capture occurs. VII. CONCLUSION In this paper we considered the problem when two players engage in a pursuit-evasion scenario in a stochastic setting while having dierent perspectives about the probabilistic environment at their disposal. Under this assumption, we have established that there exists an equilibrium or the corresponding non-zero-sum stochastic game and provided an illustrative example utilizing the correlated equilibrium. ACKNOWLEDGMENTS Support or this work has been provided by ONR awards N and N and NSF award CMMI REFERENCES [1] T. Basar and G. J. Olsder, Dynamic Noncooperative Game Theory. Philadelphia: SIAM, [] R. Isaacs, Dierential Games: A Mathematical Theory with Applications to Warare and Pursuit, Control and Optimization. Dover Publications, [3] A. Balakrishnan, Stochastic Dierential Systems, I,. Springer-Verlag, [4] M. J. Osborne and A. Rubinstein, A Course in Game Theory. Cambridge, MA: MIT Press, [5] J. Filar and K. Vrieze, Competitive Markov Decision Processes. New York: Springer-Verlag, [6] Y. Yavin, Pursuit-evasion dierential games with deception or interrupted observation, Computers & Mathematics with Applications, vol. 13, pp , [7] W. Sun and P. Tsiotras, Pursuit evasion game o two players under an external low ield, in Proceedings o the American Control Conerence, Chicago, IL, July 015, pp [8] R. P. Anderson, E. Bakolas, D. Milutinović, and P. Tsiotras, Optimal eedback guidance o a small aerial vehicle in a stochastic wind, Journal o Guidance, Control, and Dynamics, vol. 36, no. 4, pp , July 013. [9] E. Bakolas and P. Tsiotras, Feedback navigation in an uncertain low-ield and connections with pursuit strategies, AIAA Journal o Guidance, Control, and Dynamics, vol. 35, no. 4, pp , July-August 01. [10] M. Soulignac, Feasible and optimal path planning in strong current ields, IEEE Transactions on Robotics, vol. 7, no. 1, pp , Feb [11] Y. Ketema and Y. J. Zhao, Controllability and reachability or micro-aerial-vehicle trajectory planning in winds, AIAA Journal o Guidance, Control and Dynamics, vol. 33, no. 3, pp , 010. [1] M. L. Puterman, Markov Decision Processes. Wiley-Interscience, [13] A. Greenwald and K. Hall, Correlated Q-Learning, Proceedings o the 0th International Conerence on Machine Learning, pp. 4 49, 003. [14] M. L. Littman, Markov games as a ramework or multi-agent reinorcement learning, in Eleventh International Conerence on Machine Learning, 1994, pp [15], Value-unction reinorcement learning in Markov games, Journal o Cognitive Systems Research, vol. 1, pp , 001. [16], Friend-or-Foe Q-learning in General-Sum Games, in Eighteenth International Conerence on Machine Learning, 001, pp [17] E. Munoz de Cote and M. L. Littman, A Polynomial-time Nash Equilibrium Algorithm or Repeated Stochastic Games, in Proceedings o the Twenty-Fourth Conerence on Uncertainty in Artiicial Intelligence, 008, pp [18] J. Hu and M. P. Wellman, Nash Q-Learning or General-Sum Stochastic Games, Journal o Machine Learning Research, vol. 4, pp , 003. [19] M. Bowling and M. Veloso, An Analysis o Stochastic Game Theory or Multiagent Reinorcement Learning, Carnegie Mellon University, Tech. Rep., October 000. [0] R. S. Sutton and A. G. Barto, Reinorcement Learning. MIT Press, [1] P. Stone and M. Veloso, Multiagent Systems: A Survey rom a Machine Learning Perspective, Autonomous Robots, vol. 8, pp , 000. [] M. H. Bowling, Convergence Problems o General-Sum Multiagent Reinorcement Learning, International Conerence on Machine Learning, pp , 000. [3] L. Buşoniu, R. Babuška, and B. De Schutter, A Comprehensive Survey o Multiagent Reinorcement Learning, IEEE Transactions on Systems, Man, and Cybernetics Part C: Applications and Reviews, vol. 38, no., pp , March 008. [4] A. W. Tucker, Solving a Matrix Game by Linear Programming, IBM Journal o Research and Development, vol. 4, no. 5, pp , November [5] O. Mangasarian and H. Stone, Two-Person Nonzero-Sum Games and Quadratic Programming, Journal o Mathematical Analysis and Applications, vol. 9, pp , [6] O. L. Mangasarian, Equilibrium points o bimatrix games, Journal o the Society or Industrial and Applied Mathematics, vol. 1, no. 4, pp , December [7] C. Lemke and J. T. Howson Jr., Equilibrium points o bimatrix games, Journal o the Society or Industrial and Applied Mathematics, vol. 1, no., pp , June [8] J. A. Filar and T. A. Schultz, Nonlinear programming and stationary strategies in stochastic games, Mathematical Programming, vol. 35, pp , [9] T. E. S. Raghavan and J. A. Filar, Algorithms or Stochastic Games - A Survey, Methods and Models o Operations Research, vol. 35, pp , [30] M. Breton, A. Haurie, and J. A. Filar, On the computation o equilibria in discounted stochastic dynamic games, Journal o Economic Dynamics and Control, vol. 10, no , pp , [31] M. Zinkevich, A. Greenwald, and M. Littman, Cyclic Equilibria in Markov Games, Advances in Neural Inormation Processing Systems 18, pp , 006.

Cyclic Equilibria in Markov Games

Cyclic Equilibria in Markov Games Martin Zinkevich and Amy Greenwald Department of Computer Science Brown University Providence, RI 02912 {maz,amy}@cs.brown.edu Michael L. Littman Department of Computer