Zero-sum semi-markov games in Borel spaces: discounted and average payoff

Size: px

Start display at page:

Download "Zero-sum semi-markov games in Borel spaces: discounted and average payoff"

Archibald Stevenson
5 years ago
Views:

1 Zero-sum semi-markov games in Borel spaces: discounted and average payoff Fernando Luque-Vásquez Departamento de Matemáticas Universidad de Sonora México May 2002 Abstract We study two-person zero-sum semi-markov games in Borel spaces with possibly unbounded payoff function, under discounted and average payoff criteria.conditions are given for the existence of the value of the game, the existence of optimal strategies for both players, and for a characterization of the discounted optimal stationary strategies. Mathematics Subject Classification: 91A15, 91A25, 90C40. Key Words: Zero-sum semi-markov games, Borel spaces, discounted payoff, average payoff, Shapley equation. 1 Introduction This paper deals with two-person zero-sum semi-markov games with Borel spaces and possibly unbounded payoff function, under discounted and average payoff criteria. Under suitable conditions on the transition law, the payoff function and the distribution of the transition times, we show the existence of the value of the game and the existence of optimal pairs of stationary strategies for both the discounted and the average criteria. We also obtain a characterization for a pair of stationary stategies to be discounted optimal. In the discounted case, our approach is based on contraction conditions. Our assumptions ensure that the minimax operator is a contraction map with respect to a weighted norm, and we show that the (unique) fixed point is the value of the game. Fan s Minimax Theorem [3] and measurable selection theorems are used to prove the existence of optimal stationary strategies. This approach has been used for several authors in Markov games (for example, [1], [13], [16]) and for semi-markov games in [10], which considers a countable state space and a bounded payoff function. For the analysis of the average criterion, we suppose, besides the usual continuity/compactness requirements, that the payoff function satisfies a growth condition, and also that the embedded 1

2 Markov chains have suitable stability properties. A key step, is the use of a data transformation to associate a Markov game with our original semi-markov game. This transformation has been used for several authors (for example, [4], [5], [10] and [11]). Markov stochastic games with discounted and average payoff have been widely studied by several authors (for example, [1], [8], [12], [13], [15] [16] and [18]) but, to the best of our knowledge, the only paper that studies zero-sum semi-markov stochastic games is [10]. Our main results generalize to the semi-markov context, some results in [13], [15], [16] and [10]. The remainder of the paper is organized as follows. In Section 2 the semi- Markov game model is described. Next, in Section 3, the discounted and the average criteria are introduced. In Section 4 we introduce the assumptions and present our main results, Theorems 4.3 and 4.4, which are proved in Sections 5 and 6, respectively. Terminology and notation. Given a Borel space X, i.e. a Borel subset of a complete and separable metric space, we denote by B(X) its Borel σ- algebra. P(X) denotes the family of probability measures on X endowed with the weak topology. If X and Y are Borel spaces, we denote by P(X Y ) the family of transition probabilities (or stochastic kernels) from Y to X. For a transition probability f P(X Y ), we write its values as f(y)(b) or f(b y ) for all B B(X) and y Y. If X = Y, then f is called a Markov transition probability on X 2 The semi-markov game A semi-markov game model is defined by the collection (X, A, B, K A, K B, Q, F, r), (1) where X is the state space, and A and B are the action spaces for players 1 and 2, respectively. These spaces are assumed to be Borel spaces, whereas K A B(X A) and K B B(X B) are the constraint sets. For each x X, the x-section A(x) := {a A : (x, a) K A } represents the set of admissible actions for player 1 in state x. Similarly, the x-section B(x) := {b B : (x, b) K B } denotes the set of admissible actions for player 2 in state x. Let K := {(x, a, b) : x X, a A(x), b B(x)}, which is a Borel subset of X A B (see [14]). Moreover, Q( x, a, b) is a stochastic kernel on X given K called the transition law, and F ( x, a, b) is a probability distribution function on R + := [0, ) given K called the transition time distribution. Finally, r is a real-valued measurable function on K that denotes the payoff function, and it represents the reward for player 1 and the cost function for player 2. The game is played as follows: if x is the state of the game at some decision (or transition) epoch, and the players independently choose actions a A(x) and b B(x), then the following happens: player 1 receives an immediate reward r(x, a, b), player 2 incurres a cost r(x, a, b), and the system moves to a new state according to the probability measure Q( x, a, b). The time until the transition occurs is a random variable having the distribution function F ( x, a, b). 2

3 Let H 0 := X and H n := (K R + ) H n 1 for n = 1, 2,.... For each n, an element h n = (x 0, a 0, b 0, δ 1,..., x n 1, a n 1, b n 1, δ n, x n ) of H n represents the history of the game up to the nth decision epoch. A strategy π for player 1 is a sequence π = {π n : n = 0, 1,...} of stochastic kernels π n P(A H n ) such that π n (A(x n ) h n ) = 1 h n H n. We denote by Π the family of all strategies for player 1. A strategy π = {π n } is called a Markov strategy if π n P(A X) for each n = 0, 1,..., that is, each π n depends only on the current state x n of the system. The set of all Markov strategies of player 1 is denoted by Π M. Let Φ 1 denote the class of all transition probabilities f P(A X) such that f(x) P(A(x)). A Markov strategy π = {π n } is said to be a stationary strategy if there exists f Φ 1 such that π n = f for each n = 0, 1,.... In this case, the strategy is identified with f, and the set of all stationary strategies for player 1 with Φ 1. The sets Γ, Γ M and Φ 2 of all strategies, all Markov strategies and all stationary strategies for player 2 are defined similarly. Let (Ω, F) be the canonical measurable space that consists of the sample space Ω := (X A B R + ) and its product σ-algebra F. Then for each pair of strategies (π, γ) Π Γ and each initial state x there exist a unique probability measure Px πγ and a stochastic process {(x n, a n, b n, δ n+1 ), n = 0, 1,...}, where x n, a n and b n represent the state and the actions for players 1 and 2, respectively, at the nth decision epoch, whereas δ n represents the time between the (n 1)th and the nth decision epoch. Ex πγ denotes the expectation operator with respect to P πγ x. 3 Optimality criteria Definition 3.1. For α > 0, x X and (π, γ) Π Γ, the infinite-horizon total expected α-discounted payoff is V (x, π, γ) := E πγ x e αt k r(x k, a k, b k ). (2) k=0 where T 0 = 0 and T n = T n 1 + δ n for n = 1, 2,... Definition 3.2. For x X and (π, γ) Π Γ, the long-run expected average payoff is defined as Ex πγ J(x, π, γ) := lim inf n n 1 k=0 r(x k, a k, b k ) Ex πγ. (3) (T n ) To define our optimality criteria, we need to introduce the following concepts. The functions on X given by L α (x) := sup inf V (x, π, γ) and U α(x) := inf γ Γ π Π 3 sup γ Γ π Π V (x, π, γ) (4)

4 are called the lower value and the upper value, respectively, of the (expected) α-discounted payoff game. It is clear that L α ( ) U α ( ) in general, but if it holds that L α (x) = U α (x) for all x X, then the common value is called the value of the semi-markov game and will be denoted by V α (x). if Definition 3.3. (a) A strategy π Π is said to be α-optimal for player 1 U α (x) V (x, π, γ) γ Γ, x X. (b) A strategy γ Γ is said to be α-optimal for player 2 if V (x, π, γ ) L α (x) π Π, x X. (c) A pair (π, γ ) Π Γ is said to be an α-optimal pair of strategies if, for all x X, U α (x) = inf V (x, γ Γ π, γ) and L α (x) = sup V (x, π, γ ). For the expected average semi-markov game, the lower value L( ), the upper value U( ), the value V ( ) and optimal strategies are defined similarily. We note that the existence of an optimal strategy either for player 1 or player 2, implies that the game has a value. and Remark 3.4. Let β α (x, a, b) := τ(x, a, b) := 0 0 π Π e αt F (dt x, a, b), (5) tf (dt x, a, b). Then using properties of the conditional expectation we can write V (x, π, γ) = E πγ x [r(x 0, a 0, b 0 ) + n 1 n=1 k=0 β α (x k, a k, b k )r(x n, a n, b n )], (6) and J(x, π, γ) = lim inf n E πγ x E πγ x n 1 k=0 r(x k, a k, b k ) n 1 k=0 τ(x k, a k, b k ). 4 Assumptions and main results The problem we are concerned with is to show the existence of optimal strategies for players 1 and 2, which requires imposing suitable assumptions on the semi- Markov game model. In particular, the first one is a regularity condition that ensures that an infinite number of transitions do not occur in a finite time interval. 4

5 and Assumption 1 (A1). There exist θ > 0 and ε > 0 such that F (θ x, a, b) 1 ε (x, a, b) K. An important consequence of this assumption is the following. Lemma 4.1. If A1 holds, then ρ α := sup β α (x, a, b) < 1, (7) (x,a,b) K inf τ(x, a, b) > 0. (8) (x,a,b) K Proof. Let (x, a, b) K be fixed. Then integrating by parts in (5) we have β α (x, a, b) = α = α[ 0 θ 0 e αt F (t x, a, b)dt e αt F (t x, a, b)dt + θ e αt F (t x, a, b)dt] (1 ε)(1 e αθ ) + e αθ = 1 ε + εe αθ < 1. As (x, a, b) K was arbitrary, we get (7). On the other hand, which yields (8). τ(x, a, b) θ tf (dt x, a, b) θε, Our second assumption is a combination of standard continuity and compactness requirements, together with a growth condition on the payoff function. Assumption 2 (A2). (a) For each x X, the sets A(x) and B(x) are compact. (b) For each (x, a, b) K, r(x,, b) is upper semicontinuous on A(x), and r(x, a, ) is lower semicontinuous on B(x). (c) For each (x, a, b) K and each bounded and measurable function v on X, the functions a v(y)q(dy x, a, b) and b v(y)q(dy x, a, b) are continuous on A(x) and B(x), respectively. (d) There exist a measurable function w : X [1, ) and a positive constant m such that for all (x, a, b) K r(x, a, b) mw(x). In addition, part (c) holds when v is replaced with w. 5

6 Remark 4.2. By Lemma 1.11 in [13], it follows that if Assumption 2(a) holds, then the multifunctions A : X 2 P(A) and B : X 2 P(B) defined as A(x) := P(A(x)) and B(x) := P(B(x)) are measurable compact-valued multifunctions. Assumption 3 (A3). (a) There exists a positive constant η, with ηρ α < 1, such that for all (x, a, b) K w(y)q(dy x, a, b) ηw(x), where w is the function in A2(d). (b) For each t 0, F (t x, a, b) is continuous on K. We now introduce the following notation: for any given function h : K R, x X, and probability measures µ A(x) and λ B(x) we write h(x, µ, λ) := h(x, a, b)µ(da)λ(db), B(x) A(x) whenever the integrals are well defined. In particular, r(x, µ, λ) := r(x, a, b)µ(da)λ(db), and β α (x, µ, λ) := τ(x, µ, λ) := Q(D x, µ, λ) := B(x) B(x) B(x) B(x) A(x) A(x) A(x) A(x) β α (x, a, b)µ(da)λ(db), τ(x, a, b)µ(da)λ(db), Q(D x, a, b)µ(da)λ(db), D B(X). B w (X) denotes the linear space of measurable functions u on X with finite w-norm, which is defined as u(x) u w := sup x X w(x). For u B w (X) and (x, a, b) K, we write H(u, x, a, b) := r(x, a, b) + β α (x, a, b) For each function u B w (X) let T α u(x) := sup inf µ A(x) λ B(x) X u(y)q(dy x, a, b). H(u, x, µ, λ), (9) which defines a function T α u in B w (X) (see Lemma 5.1 below). We call T α the Shapley operator, and a function v B w (X) is said to be a solution to the 6

7 Shapley equation if T α v(x) = v(x) for all x X. In the proof of Lemma 5.1, we show that if Assumptions 1, 2 and 3 hold, then for each µ A(x), H(u, x, µ, ) is l.s.c. on B(x), and for each λ B(x), H(u, x,, λ) is u.s.c. on A(x). Thus, by Theorem A.2.3 in [2] the supremum and the infimum are indeed attained in (9). Hence, we can write T α u(x) := max min µ A(x) λ B(x) We are now ready to state our first main result. H(u, x, µ, λ). (10) Theorem 4.3. If A1-A3 hold, then (a) The α-discounted semi-markov game has a value V α ( ), which is the unique function in B w (X) that satisfies the Shapley equation V α (x) = T α V α (x), and, furthermore, there exists a pair of α-optimal stationary strategies. (b) A pair of stationary strategies (f, g) Φ 1 Φ 2 is α-optimal if and only if V (, f, g) is a solution to the Shapley equation. To ensure the existence of the value and optimal strategies for both players in the average criterion, we introduce the following conditions. Assumption 4 (A4). (a) There exists a positive constant M such that for all (x, a, b) K, τ(x, a, b) M. (b) The function τ is continuous on K. The next two assumptions are used to guarantee that the state process is ergodic in a suitable sense. Assumption 5 (A5). There exist a probability measure ν P(X), a positive number α < 1 and a measurable function φ : K [0, 1] for which the following holds for all (x, a, b) K and D B(X) : (a) Q(D x, a, b) φ(x, a, b)ν(d). (b) w(y)q(dy x, a, b) αw(x) + φ(x, a, b) ν w, where w is the function defined in A2(d) and ν w := w(y)ν(dy). (c) φ(x, f(x), g(x))ν(dx) > 0 for each f Φ 1, g Φ 2. Assumption 6 (A6). There is a constant b > 0 such that for each f Φ 1 and g Φ 2, φ(x, f(x), g(x))ν(dx) > b, with ν and φ as in Assumption 5. We are now ready to state our second main result. Theorem 4.4. Suppose that A1, A2, A4-A6 hold. Then the average semi- Markov game has a constant value V (x) = ξ for all x X, and there exists a pair of average optimal stationary strategies. 7

8 5 Proof of Theorem 4.3 First we shall prove some preliminary results. Lemma 5.1. If A1-A3 hold, then for each u B w (X), the function T α u is in B w (X), and T α u(x) = min max H(u, x, µ, λ). (11) λ B(x) µ A(x) Moreover, there exist stationary strategies f Φ 1 and g Φ 2 such that T α u(x) = H(u, x, f(x), g(x)) = max µ A(x) H(u, x, µ, g(x)) (12) = min λ B(x) H(u, x, f(x), λ). Proof. By Lemma 4.1 and A3(a), we have that for u B w (X) and (x, a, b) K, H(u, x, a, b) mw(x) + ρ α u w ηw(x), (13) which, as T α u is measurable, implies that T α u is in B w (X). On the other hand, by A2 and (13), it follows that the function x H(u, x, a, b) is in B w (X) and H(u, x,, b) is u.s.c. on A(x). Then, for fixed λ B(x), by Fatou s Lemma, the function a H(u, x, a, b)λ(db) B(x) is u.s.c. and bounded on A(x). Thus, since convergence on A(x) is the weak convergence of probability measures, by Theorem in [2], the function H(u, x,, λ) is u.s.c. on A(x). Similarily, H(u, x, µ, ) is l.s.c. on B(x). Moreover, H(u, x, µ, λ) is concave in µ and convex in λ. Thus, by Fan s Minimax Theorem [3] we obtain (11). The existence of stationary strategies f Φ 1 and g Φ 2 that satisfy (12) follows from (11) and well-known measurable selection theorems (see for instance Lemma 4.3 in [13]). We use below the following notation: for a pair of stationary strategies (f, g) Φ 1 Φ 2, we define the operator T fg on B w (X) as T fg u(x) := H(u, x, f(x), g(x)). It is clear (see the proof of Lemma 5.1) that T fg u belongs to B w (X) for each u B w (X). Lemma 5.2. If A1-A3 hold, then T α and T fg are both contraction operators on B w (X) with modulus ηρ α < 1. Proof. First we note that both operators are monotone; that is, if u, v B w (X) and u( ) v( ), then for all x X T fg u(x) T fg v(x). 8

9 Similarly, T α u(x) T α v(x) for all x X. Also, it is easy to see using (7) and A3 that for k 0, and T fg (u + kw)(x) T fg u(x) + ρ α ηkw(x) x X, T α (u + kw)(x) T α u(x) + ρ α ηkw(x) x X. (14) Now, for u, v B w (X), by (14), the monotonocity of T α and the fact that u v + w u v w, it follows that so that T α u(x) T α v(x) + ρ α η u v w w(x) x X, T α u(x) T α v(x) ρ α η u v w w(x) x X. (15) If we now interchange u and v we obtain T α u(x) T α v(x) ρ α η u v w w(x) x X, (16) and combining (15) and (16) we get i.e. T α u(x) T α v(x) ρ α η u v w w(x) x X, T α u T α v w ρ α η u v w. Hence, T α is a contraction operator with modulus ρ α η. Using the same arguments we can prove that T fg is a contraction operator with the same modulus ρ α η. Remark 5.3. Since T α and T fg are contraction operators on B w (X), by Banach s Fixed Point Theorem there exist unique functions v and v fg in B w (X) such that T α v (x) = v (x) and T fg v fg (x) = v fg (x) for all x X. Lemma 5.4. For a pair of stationary strategies (f, g) Φ 1 Φ 2, the function V (, f, g) is the unique fixed point of T fg in B w (X). Proof. We have to show that V (x, f, g) = T fg V (x, f, g) for all x X. Now, by (??), V (x, f, g) = E fg x {r(x 0, a 0, b 0 ) + n 1 n=1 k=0 = r(x, f(x), g(x)) + E fg x { n=1 k=0 β α (x k, a k, b k )r(x n, a n, b n )} n 1 β α (x k, a k, b k )r(x n, a n, b n )} = r(x, f(x), g(x)) + Ex fg {β α (x 0, a 0, b 0 )Ex fg [r(x 1, a 1, b 1 ) n 1 + β α (x k, a k, b k )r(x n, a n, b n ) h 1 ]} n=2 k=1 = r(x, f(x), g(x)) + Ex fg {β α(x 0, a 0, b 0 )V (x 1, f, g)} = T fg V (x, f, g). 9

10 Thus, V (, f, g) is the fixed point of T fg. Lemma 5.5. Suppose that A1-A3 hold, and let π and γ be arbitrary strategies for players 1 and 2, respectively. Then for each x X and n = 0, 1,... (a) Ex πγ w(x n ) η n w(x), (b) Ex πγr(x n, a n, b n ) mη n w(x), (c) lim n Ex πγ( n 1 k=0 β α(x k, a k, b k )u(x n )) = 0 for each u B w (X). Proof. For n = 0, (a) and (b) are trivially satisfied. Now, if n 1 then, by A3(a), Ex πγ [w(x n ) h n 1, a n 1, b n 1 ] = w(y)q(dy x n 1, a n 1, b n 1 ) ηw(x n 1 ). Hence Ex πγ w(x n ) ηex πγ w(x n 1 ), which by iteration yields (a). Part (b) follows immediately from (a) and A3(a). To prove (c), we observe that Lemma 4.1 and (a) yield n 1 [ Eπγ x k=0 β α (x k, a k, b k )u(x n )] ρn α Eπγ x This yields (c), since ρ α η < 1. u(x n) ρ n α u w Eπγ x w(x n) (ρ α η) n u w w(x). Proof of Theorem 4.3. (a) Let v be the unique fixed point in B w (X) of T α ; That is, by (9), v (x) = T α v (x) = max µ A(x) min λ B(x) H(v, x, µ, λ). By Lemma 5.1 there exists a pair of stationary strategies (f, g ) Φ 1 Φ 2 such that v (x) = H(v, x, f (x), g (x)) (17) = min λ B(x) H(v, x, f (x), λ) = max µ A(x) H(v, x, µ, g (x)). We will next prove that v is the value of the semi-markov game and that (f, g ) is an α-optimal strategy pair. The first equality in (17) implies that v is the fixed point in B w (X) of T f g (see Remark 5.3). Thus, by Lemma 5.4, v ( ) = V (, f, g ), so that it is enough to show that for arbitrary π Π and γ Γ, V (x, f, γ) V (x, f, g ) V (x, π, g ) x X. (18) We will prove the second inequality in (18). A similar proof can be given for the first inequality. By (6) we have V (x, π, g ) = E πg x {r(x 0, a 0, b 0 ) + n 1 n=1 k=0 10 β α (x k, a k, b k )r(x n, a n, b n )}.

11 From properties of the conditional expectation we have for n 1, h n H n, a n A(x n ), and b n B(x n ), Ex πg { n k=0 β α(x k, a k, b k )V (x n+1, f, g ) h n, a n, b n } = n k=0 β α(x k, a k, b k )E πg x {V (x n+1, f, g ) h n, a n, b n } = n k=0 β α(x k, a k, b k ) V (y, f, g )Q(dy x n, π n (h n ), g (x n )) = n 1 k=0 β α(x k, a k, b k ){β α (x n, a n, b n ) V (y, f, g )Q(dy x n, π n (h n ), g (x n )) +r(x n, π n (h n ), g (x n )) r(x n, π n (h n ), g (x n ))} n 1 k=0 β α(x k, a k, b k )[V (x n, f, g ) r(x n, π n (h n ), g (x n ))]. Equivalently, for n 1 n 1 k=0 β α(x k, a k, b k )V (x n, f, g ) We also have Ex πg { n k=0 β α(x k, a k, b k )V (x n+1, f, g ) h n, a n, b n ] n 1 k=0 β α(x k, a k, b k )r(x n, π n (h n ), g (x n )). V (x 0, f, g ) E πg x [β α (x 0, a 0, b 0 )V (x 1, f, g )] r(x 0, π 0 (x 0 ), g (x 0 )). Now, taking expectations and summing over n = 0, 1,...N we obtain N V (x, f, g ) Ex πg { k=0 E πg x [r(x 0, a 0, b 0 ) + β α (x k, a k, b k )V (x N+1, f, g )} N n 1 n=1 k=0 β α (x k, a k, b k )r(x n, a n, b n )]. Finally, letting N, by (6) and Lemma 5.5 (c) we obtain the required result. (b) (= ) Suppose that (f, g) Φ 1 Φ 2 is a pair of α-optimal stationary strategies. Then for all x X, π Π and γ Γ, V (x, f, γ) V (x, f, g) V (x, π, g). (19) Fix x X and for an arbitrary λ B(x) define ˆγ = (ˆγ n ) as follows: ˆγ 0 = λ and ˆγ n = g for n = 1, 2,... Then, by the first inequality in (19), V (x, f, g) V (x, f, ˆγ) = [r(x, a, b) B(x) A(x) + β α (x, a, b) V (y, f, g)q(dy x, a, b)]f(x)(da)λ(db). 11

12 It follows that from which we get Similarily, we can prove V (x, f, g) H(V (, f, g), x, f, λ), V (x, f, g) T α V (x, f, g). V (x, f, g) T α V (x, f, g), and combining the last two inequalities we get the desired result. ( =) The proof of this part is contained in the proof of part (a). 6 Proof of Theorem 4.4 The proof uses a data transformation introduced in [17] and applied by several authors for semi-markov control models (see e.g. [4], [5] [11]), and for semi- Markov games [10]. Definition 6.1. Let τ be a real number such that 0 < τ < inf K τ(x, a, b) (see Lemma 4.1). Define the function ˆr : K R and the stochastic kernel ˆQ on X given K by r(x, a, b) ˆr(x, a, b) := (20) τ(x, a, b) and ˆQ(B x, a, b) := τ τ Q(B x, a, b) + (1 τ(x, a, b) τ(x, a, b) )δ x(b) (21) where δ x ( ) is the Dirac measure concentrated at x. We refer to (X, A, B, K A, K B, ˆQ, ˆr) as the Markov game model associated to the semi-markov game (1). We will state some lemmas, which essentially show that the associated game model satisfies Assumptions A2, A5 and A6. We omit the proofs of lemmas 6.2, 6.3 and 6.4 because they are similar to the proofs of lemmas 5.2, 5.3 and 5.4 in [11], respectively. Lemma 6.2. If A1, A2 and A4 hold then, parts (b), (c) and (d) of A2 are satisfied when r and Q are replaced by ˆr and ˆQ, respectively. Lemma 6.3. Suppose that A1, A4 and A5 hold and let ˆφ(x, a, b) := [τ/τ(x, a, b)]φ(x, a, b). Then for all D B(X) and x X, (a) ˆQ(D x, a, b) ˆφ(x, a, b)ν(d). (b) w(y) ˆQ(dy x, a, b) ˆαw(x) + ˆφ(x, a, b) ν w where 0 < ˆα < 1 is the constant ˆα := [1 (τ/m)(1 α)]. (c) ˆφ(x, f(x), g(x))ν(dx) > 0 for each f Φ1, g Φ 2. 12

13 Lemma 6.4. If A1, A4, and A6 hold, then for each f Φ 1 and g Φ 2, ˆφ(x, f(x), g(x))ν(dx) > τb M. Remark 6.5. (a) If A1, A4 and A5 hold then for each pair (f, g) in Φ 1 Φ 2 : (a) The Markov chain with transition probability ˆQ ( x, f(x), g(x)) is ν- ireducible and positive Harris recurrent. Hence, in particular, it admits a unique invariant probability measure ˆq(f, g)( ). (b) If in addition A6 holds, Then the Markov chain {x n } is w-geometrically ergodic; that is, there exist positive constants ζ < 1 and L such that u(y) ˆQ n (dy x, f(x), g(x) ) u(y)ˆq(f, g)(dy) w(x) u w Lζn (22) for every u B w (X), x X, and n = 1, 2,..., where ˆQ n denotes the n-step Markov transition probability. (c) For x X, π Π, γ Γ, and u B w (X), lim n Ex πγ u(x n ) = 0. (23) n Part (a) follows from Lemma 3.3 in [19]. Part (b) comes from (a) and Lemmas 3.3 and 3.4 in [6], while part (c) follows from Lemma 4.3 (c) in [11]. Lemma 6.6. If Assumptions A1-A2, A4-A6 hold, then there exist a real number ξ, µ x P(A(x)), λ x P(B(x)) and functions h, h B w (X) such that, ξ + h (x) max {ˆr(x, µ, λ x) + h (y) ˆQ(dy x, µ, λ x )} (24) µ A(x) and ξ + h (x) min {ˆr(x, µ x, λ) + λ B(x) h (y) ˆQ(dy x, µ x, λ)}. (25) The proof of Lemma 6.6 is contained in the proof of Theorem 3 in [15]. The hypotheses C5 and C6 used in [15], are somewhat different from our A5 and A6, but the fact is that the key tool to obtain (24) and (25) is the w-geometric ergodicity (22). For u B w (X) and (x, a, b) K, define G(u, x, a, b) := r(x, a, b) ξ τ(x, a, b) + u(y)q(dy x, a, b), where ξ is the constant in Lemma 6.6. For u B w (X) let T u(x) := sup inf µ A(x) λ B(x) G(u, x, µ, λ). (26) 13

14 Then using similar arguments to those used in the proof of Lemma 5.1 we can show the following. Lemma 6.7. Suppose that the assumptions of Theorem 4.4 are satisfied. Then for each u B w (X), the function T u is in B w (X), the supremum and the infimum are attained in (26), and T u(x) = max min µ A(x) λ B(x) G(u, x, µ, λ) = min max λ B(x) µ A(x) G(u, x, µ, λ). Moreover there exist stationary strategies f Φ 1 and g Φ 2 such that for all x X, T u(x) = G(u, x, f(x), g(x)) = max µ A(x) G(u, x, µ, g(x)) = min λ B(x) G(u, x, f(x), λ). Proof of Theorem 4.4. Fix x X. Then from (24) it follows that for each a A(x), ξ + h (x) ˆr(x, a, λ x ) + h (y) ˆQ(dy x, a, λ x ) inf {ˆr(x, a, b) + h (y) ˆQ(dy x, a, b)}. b B(x) Then, by A2(a) and Lemma 6.2, for a A(x) there exists b a B(x) such that ξ + h (x) ˆr(x, a, b a ) + h (y) ˆQ(dy x, a, b a ). Applying the transformation (20) and (21) we obtain τh (x) r(x, a, b a ) ξ τ(x, a, b a ) + τh (y)q(dy x, a, b a ). (27) Since (27) holds for each a A(x), setting ω ( ) := τh ( ) we get ω (x) min max {r(x, a, b) b B(x) a A(x) ξ τ(x, a, b) + ω (y)q(dy x, a, b)} which in turn yields ω (x) min max {r(x, µ, λ) λ B(x) µ A(x) ξ τ(x, µ, λ) + ω (y)q(dy x, µ, λ )}. (28) Thus, by Lemma 6.7, there is g Φ 2 such that ω (x) max {r(x, µ, µ B(x) g (x)) ξ τ(x, µ, g (x)) (29) + ω (y)q(dy x, µ, g (x) )}. 14

15 Then, by standard dynamic programming arguments (see [1], [4], [5], [7], [8], [9]) based on (29) and (23) we have for each π Π. Thus we get ξ J(x, π, g ) ξ sup J(x, π, g ) U(x). (30) π Π Similarily, from (25) we can show the existence of a stationary strategy f Φ 1 such that ξ inf J(x, f, γ) L(x). (31) γ Γ Combining (30) and (31), we conclude that the game has a value ξ, which is independent of the initial state x, and (f, g ) Φ 1 Φ 2 is an average optimal pair of stationary strategies. References [1] E. Altman, A. Hordijk and F.M. Spieksma, Contraction conditions for average and α-discounted optimality in countable state Markov games with unbounded rewards. Math. Oper. Res. 22 (1997), [2] R.B. Ash and C.A. Doléans-Dade, Probability and Measure Theory. Academic Press, New York, [3] K. Fan, Minimax theorems. Proc. Nat. Acad. Sci. USA 39 (1953), [4] A. Federgruen, P.J. Schweitzer and H.C. Tijms, Denumerable undiscounted semi-markov decision processes with unbounded rewards. Math. Oper. Res. 8 (1983), [5] A. Federgruen and H.C. Tijms, The optimality equation in average cost denumerable state semi-markov decision problems. Recurrence conditions and algorithms. J. Appl. Probab. 15 (1978), [6] E. Gordienko and O. Hernández-Lerma, Average cost Markov control processes with weigthed norms: Existence of canonical policies. Appl. Math. (Warsaw), 23 (1995), [7] O. Hernández-Lerma and J.B. Lasserre, Discrete-Time Markov Control Processes: Basic Optimality Criteria. Springer-Verlag, New York, [8] O. Hernández-Lerma and J.B. Lasserre, Zero-sum stochastic games in Borel spaces: average payoff criteria. SIAM J. Control Optim. 39 (2001),

16 [9] O. Hernández-Lerma and O. Vega-Amaya, Infinite-horizon Markov control processes with undiscounted cost criteria: From average to overtaking optimality. Appl. Math. (Warsaw), 25 (1998), [10] A.K. Lal and S. Sinha, Zero-sum two person semi-markov games. J. Appl. Prob. 29 (1992), [11] F. Luque-Vásquez and O. Hernández-Lerma, Semi-Markov control models with average cost. Appl. Math. (Warsaw), 26 (1999), [12] A. Maitra and T. Parthasarathy, On stochastic games. J. Optim. Theory Appl. 5 (1970), [13] A.S. Nowak, On zero-sum stochastic games with general state space I. Probab. Math. Statist. 4 (1984), [14] A.S. Nowak, Measurable selection theorems for minimax stochastic optimization problems. SIAM J. Control Optim. 23 (1985), [15] A.S. Nowak, Optimal strategies in a class of zero-sum ergodic stochastic games. Math. Meth. Oper. Res. 50 (1999), [16] F. Ramírez-Reyes, Existence of optimal strategies for zero-sum stochastic games with discounted payoff. Morfismos 5 (2001), [17] P.J. Schweitzer, Iterative solution of the functional equations of undiscounted Markov renewal programming. J. Math. Anal. Appl. 34 (1971), [18] L.I. Sennott, Zero-sum stochastic games with unbounded costs: discounted and average cost cases. Z. Oper. Res. 39 (1994), [19] O. Vega-Amaya, The average cost optimality equation: a fixed point approach. Syst. Control Lett.; to appear. 16

Zero-Sum Average Semi-Markov Games: Fixed Point Solutions of the Shapley Equation

Zero-Sum Average Semi-Markov Games: Fixed Point Solutions of the Shapley Equation Oscar Vega-Amaya Departamento de Matemáticas Universidad de Sonora May 2002 Abstract This paper deals with zero-sum average