Zero-Sum Average Semi-Markov Games: Fixed Point Solutions of the Shapley Equation

Size: px

Start display at page:

Download "Zero-Sum Average Semi-Markov Games: Fixed Point Solutions of the Shapley Equation"

Karen Little
5 years ago
Views:

1 Zero-Sum Average Semi-Markov Games: Fixed Point Solutions of the Shapley Equation Oscar Vega-Amaya Departamento de Matemáticas Universidad de Sonora May 2002 Abstract This paper deals with zero-sum average semi-markov games with Borel state and action spaces, and unbounded payoffs and mean holding times. A solution of the Sapley equation is obtained via the Banach Fixed Point Theorem assumming that the model satisfies a Lyapunov-like condition, a growth hypothesis on the payoff function and the mean holding time, besides standard continuity and compactness requirements. Key words. Zero-sum semi-markov games, Average payoff criterion, Lyapunov conditions, Fixed-point approach. AMS subject classification. 90D10, 90D20, 93E05. 1 Introduction Several recent papers have used variants of a Lyapunov-like condition to solve an average payoff optimization problem for markovian systems with unbounded payoff and Borel state and action spaces (see, e.g. [9], [13], [14], for Markov models; [15], [20], [28] for semi-markov models; [11], [16], [23] for zero-sum Markov games and [17] for zero-sum semi-markov games). The key property used in all these papers is that the imposed Lyapunov condition yields the socalled weighted geometric ergodicity (WGE) property, which is a generalization of the standard uniform geometric ergodicity in Markov chain theory (see [10], [12] and [21] for a detailed discussion of these concepts). Roughly speaking, in these papers the WGE property is combined, explicitly or implicitly, either with the vanishing discount factor approach or with some variants of the policy iteration algorithm for proving their main results. These facts are the first main difference with the present paper since, in spite of imposing a similar stability condition, we use instead a fixed-point approach which does not rely, at least explicitly, on the WGE property. This research was supported by CONACyT (México) under Grant E. 1

2 The fixed-point approach allows us to obtain directly the Shapley equation, which in turn yields the existence of a stationary optimal strategy pair or saddle point see Theorem 4.7 (a) and (b). In contrast, the approaches followed in [11], [16], [23] first show the existence of a stationary saddle point and then establish the Shapley equation. On the other hand, [20], [15], [17] recur to auxiliary models related to the original one; more precisely, [20] uses the socalled Schweitzer data transformation [26], while the analysis in [15] and [17] relies on certain perturbed models. A second key difference concerns the times between two consecutive decision epochs. In contrast with discrete-time Markov control processes and Markov games, the decision epochs in semi-markov control processes are random; thus it is necessary to ensure that such processes experience only finitely many transitions in each finite time period. This is usually done by assuming that the mean holding time function is bounded below by a constant, even for the discrete state space case (see, e.g. [2], [5], [19], [24] and their references). In particular, this condition plays a crucial role in the approaches followed in [28], [15], [17] and [20]; in fact, in the three latter references it is also assumed that the mean holding time function is bounded above by a constant while in the present paper it is only assumed that this function is positive. It is important to mention that, as a by-product, the fixed-point approach yields a minimax characterization of certain solution of the Shapley equation Theorem 4.7(c) which, seemingly, have not been previously discussed in the literature dealing with zero-sum stochastic games. We should also mention that the fixed-point approach has been used in several early papers (see, e.g. [7], [12], [18], [25]) but under much stronger ergodicity conditions, which, in particular, exclude the case of unbounded payoffs. The variant of Lyapunov condition we consider here was recently introduced in [27] for Markov control process and used in [8] to study minimax problems. In fact, the present paper extends to zero-sum semi-markov games the results of the two latter references. For brief surveys of the existing literature on stochastic games with finite or denumerable state space the reader can consult [1], [3], [6], [7] and [19]. The remainder of the paper is organized as follows. The semi-markov game model and the (ratio) expected average payoff criterion are introduced in Sections 2 and 3, respectively. The assumptions and main results are stated in Section 4. The proofs of all results are given in Sections 5 and 6. 2 The Game Model Throughout the paper we shall use the following notation. Given a Borel space S that is, a Borel subset of a complete separable metric space B(S) denotes the Borel σ algebra and measurability always means measurability with respect to B(S). The class of all probability measure on S is denoted by P(S). Given two Borel spaces S and S, a stochastic kernel ϕ( ) on S given S is a function such that ϕ( s ) is in P(S) for each s S, and ϕ(b ) is a measurable 2

3 function on S for each B B(S). Moreover, R + stands for the nonnegative real number subset and N (N 0, resp.) denotes the positive (nonnegative, resp.) integers subset. The semi-markov game model. This paper is concerned with a zero-sum semi-markov game modeled by (, A, B, K A, K B, Q, F, r) where is the state space, and the sets A and B are the control spaces for players 1 and 2, respectively. It is assumed that all these sets are Borel spaces. The constraint sets K A and K B are Borel subsets of A and B, respectively. Thus, for each x, the x-sections A(x) := {a A : (x, a) K A } B(x) := {b B : (x, b) K B }, stand for the sets of admissible actions or controls for players 1 and 2, respectively. Now, let K := {(x, a, b) : x, a A(x), b B(x)}, which, by [22], is a Borel subset of A B. The transition law Q( ) of the system is a stochastic kernel on given K. For each (x, a, b, y) K, F ( x, a, b, y) is a distribution function on R + := [0, + ), and F (t ) is a measurable function on K for each t R +. Finally, the payoff r is a measurable function on K R +. The game is played over an infinite horizon as follows: at time t = 0 the game is observed in some state x 0 = x and the players independently choose controls a 0 = a A(x 0 ) and b 0 = b B(x 0 ). Then, the system remains in state x 0 = x for a nonnegative random time δ 1 and player 1 receives the amount r(x, a, b, δ 1 ) from player 2. At time δ 1 the system jumps to a new state x 1 = x according to the probability measure Q( x, a, b). The distribution of the random variable δ 1, given that the system has jumped into state x, is F ( x, a, b, x ); that is, F (t x, a, b, x ) = Pr [δ 1 t x 0 = x, a 0 = a, b 0 = b, x 1 = x ] t R +. Thus, given that x 0 = x, a 0 = a and b 0 = b, the distribution of δ 1 is G(t x, a, b) := + 0 F (t x, a, b, y)q(dy x, a, b), t R +, (x, a, b) K, and it is called the holding time distribution. Immediately after the transition occurs, the players again choose controls, say, a 1 = a A(x ) and b 1 = b B(x ), and the above process is repeated over and over again. 3

4 This procedure yields a stochastic processes {(x n, a n, b n, δ n+1 )} where, for each n N 0, x n is the state of the system, a n and b n are the control variables for player 1 and 2, respectively, and δ n+1 is the holding time at state x n. The goal of player 1 (player 2, resp.) is to maximize (minimize, resp.) his/her flow rewards (costs, resp.) r(x 0, a 0, b 0, δ 1 ), r(x 1, a 1, b 1, δ 2 ), over an infinite horizon using an expected average reward (cost) criterion defined by (5) below. The functions on K given as τ(x, a, b) := + 0 tg(dt x, a, b) (1) R(x, a, b) := + 0 r(x, a, b, t)g(dt x, a, b) (2) are called the mean holding time and the mean payoff, respectively. Strategies. Let H 0 := and H n := K R + H n 1 for n N. Then, for each n N 0, a generic element of H n is denoted as h n := (x 0, a 0, b 0, δ 1,, x n 1, a n 1, b n 1, δ n, x n ) which can be thought of as the history of the game up to the time of the nth transition T n := T n 1 + δ n, n N, (3) where T 0 := 0. Thus a strategy for player 1 is a sequence π 1 = {π 1 n } of stochastic kernels π1 n on A given H n satisfying the constraint π 1 n (A(x n) h n ) = 1 h n H n, n N 0. The class of all strategies for player 1 is denoted by Π 1. For each x, let A(x) := P(A(x)) and denote by Φ 1 the class of all stochastic kernels ϕ 1 on A given such that ϕ 1 ( x) A(x) for all x. A policy π 1 is called stationary if π 1 n( h n ) = ϕ 1 ( x n ) h n H n, n N 0, for some stochastic kernel ϕ 1 in Φ 1. Following an standard convention, Φ 1 is identified with the class of stationary strategies for player 1. The sets of strategies Π 2 and Φ 2 for player 2 are defined in a similar way but writing B(x) and B(x) instead of A(x) and A(x), respectively. 4

5 Let (Ω, F) be the (canonical) measurable space consisting of the sample space Ω := (K R + ) and its product σ-algebra. Thus, for each strategy pair (π 1, π 2 ) Π 1 Π 2 and each initial state x, there exists a probability measure P π1,π 2 x defined on (Ω, F) which governs the evolution of the stochastic process {(x n, a n, b n, δ n+1 )}. The expectation operator with respect to the measure probability P π1,π 2 x is denoted as E π1,π 2 x. Throughout the paper we shall use the following notation: for a measurable function u on K and a stationary strategy pair (ϕ 1, ϕ 2 ) Φ 1 Φ 2, let u ϕ 1,ϕ 2(x) := u(x, a, b)ϕ 1 (da x)ϕ 2 (db x) x. (4) B(x) A(x) Thus, in particular, we shall write and, similarly, for all x. R ϕ 1,ϕ 2(x) := τ ϕ 1,ϕ 2(x) := Q ϕ 1,ϕ 2( x) := B(x) A(x) B(x) A(x) B(x) A(x) R(x, a, b)ϕ 1 (da x)ϕ 2 (db x), τ(x, a, b)ϕ 1 (da x)ϕ 2 (db x), Q( x, a, b)ϕ 1 (da x)ϕ 2 (db x), If the players use a stationary strategy pair, say (ϕ 1, ϕ 2 ), then the state process {x n } is a Markov chain with transition probability Q ϕ 1,ϕ2( ). In this case, the n-step transition probability is denoted by Q n ϕ 1,ϕ ( ) for each n N 2 0, where Q 0 ϕ 1,ϕ ( x) is the Dirac measure at x. Thus, for each u B 2 W (), Q n ϕ 1,ϕ 2u(x) := u(dy)q n ϕ 1,ϕ 2(dy x) = Eϕ1,ϕ 2 x u(x n ) x, n N 0. 3 The expected average payoff criterion The (ratio) expected average payoff (EAP) for the strategy pair (π 1, π 2 ) Π 1 Π 2, given the initial state x 0 = x, is defined as J(π 1, π 2, x) := lim inf n E π1,π 2 x n 1 k=0 r(x k, a k, b k, δ k+1 ) E π1,π 2 x T n. (5) It is easy to verify using properties of conditional expectation that 5

6 and also that E π1,π 2 x δ k+1 = E π1,π 2 x τ(x k, a k, b k ) E π1,π 2 x r(x k, a k, b k, δ k+1 ) = E π1,π 2 x R(x k, a k, b k ), for all x, (π 1, π 2 ) Π 1 Π 2, k N 0. Thus, (5) can be rewritten as J(π 1, π 2, x) = lim inf n E π1,π 2 x E π1,π 2 x n 1 Now consider the following functions on defined as k=0 R(x k, a k, b k ) n 1 k=0 τ(x k, a k, b k ). (6) L(x) := sup π 1 Π 1 inf π 2 Π J(π1, π 2, x) and U(x) := inf 2 π 2 Π 2 sup J(π 1, π 2, x), (7) π 1 Π 1 which are called the lower value and the upper value of the game, respectively, for the ratio EAP criterion. In general, L( ) U( ), but if it holds L( ) = U( ), the common function is called the value of the game and denoted by V ( ). If the game has a value V ( ), a strategy π 1 Π1 is said to be expected average payoff (EAP-) optimal for player 1 if Similarly, π 2 player 2 if inf π 2 Π J(π1, π 2, x) = V (x) x. 2 Π 2 is said to be expected average payoff (EAP-) optimal for sup J(π 1, π, 2 x) = V (x) x. π 1 Π 1 If π i is EAP-optimal for player i (i = 1, 2), then (π, 1 π ) 2 is called an EAPoptimal pair or saddle point. Note that (π 1, π2 ) is EAP-optimal if and only if J(π 1, π 2, x) J(π1, π2, x) J(π1, π2, x) x, (π 1, π 2 ) Π 1 Π 2. 4 Assumptions and main results The first condition imposed on the model, Assumption 4.1 below, ensures that the systems is regular, which means that it experiences finitely many jumps or transitions over each finite period of time. Usually, the regularity property is obtained assuming that the mean holding time τ is bounded below by a positive constant (see, e.g. [2], [5], [15], [17], [18], [19], [20], [24], [26], [28] and their references). In the present paper is only assumed that the mean holding time is a positive function. 6

7 Assumption 4.1.(Regularity condition) τ(x, a, b) > 0 for all (x, a, b) K. The second hypothesis imposes a growth condition both in the mean holding time and the mean payoff. Assumption 4.2. There exists a measurable function W ( ) on bounded below by a constant θ > 0 such that max {τ(x, a, b), R(x, a, b) } KW (x) (x, a, b) K, for a fixed positive constant K. To state the third set of hypotheses as well as several of its consequences some notation is required. For a measurable function u( ) on, define the weighted norm with respect to W (W norm, for short) as u(x) u W := sup x W (x), and denote by B W () the Banach space of all measurable functions with finite W norm. Moreover, for a measure γ( ) on let γ(u) := u(x)γ(dx), whenever the integral is well defined. Assumption 4.3.(Lyapunov condition) There exists a non-trivial measure ν( ) on, a nonnegative measurable function S( ) on K and a positive constant λ < 1 such that: (a) ν(w ) < ; (b) Q(B x, a, b) ν(b)s(x, a, b) B B(), (x, a, b) K; (c) W (y)q(dy x, a, b) λw (x) + S(x, a, b)ν(w ) (x, a, b) K; (d) ν(s ϕ 1,ϕ 2) > 0 (ϕ1, ϕ 2 ) Φ 1 Φ 2. As we mentioned in the Introduction, Assumption 4.3 allows us to use a fixed-point approach. More precisely, we consider the kernel Q( x, a, b) := Q( x, a, b) ν( )S(x, a, b) (x, a, b) K, (8) which, under Assumption 4.3, is nonnegative. The point here is that Assumption 4.3(c) can be expressed equivalently as W (y) Q(dy x, a, b) λw (x) (x, a, b) K, (9) which, roughly speaking, means that Q( ) satisfies a certain contraction property. This contraction property is precisely what we shall exploit to prove our main results (Theorems 4.5 and 4.7 below). 7

8 Assumption 4.3 was first used in [27], though it is actually a simplified version of the Lyapunov condition introduced in [9]. Specifically, besides the conditions in Assumption 4.3, [9] assume the existence of a common irreducibility measure for the transition laws induced by the stationary strategies and also that the inequality in Assumption 4.3(c) holds uniformly, that is, inf ϕ 1,ϕ 2 ν(s ϕ 1,ϕ2) > 0. However, as it is shown in [27, Thm. 3.3] see Proposition 4.4 below the latter condition is not required while the irreducibility condition is redundant. On the other hand, several other papers have used similar Lyapunov conditions to Assumption 4.3 (see, e.g. [13], [14], [15], [16], [17], [23]) but with some important differences, which seemingly precludes the fixed point-approach. For instance, the fourth latter papers suppose instead of the conditions in Assumption 4.3 that W (y)q(dy x, a, b) λw (x) + bi C (x) (x, a, b) K where C is a Borel subset of, b is a positive constant, λ (0, 1) and W ( ) is bounded on C, and also that Q ϕ 1,ϕ 2(B x) δi C(x)ν ϕ 1,ϕ 2(B) for all x, B B(), (ϕ 1, ϕ 2 ) Φ 1 Φ 2, where each ν ϕ 1,ϕ2( ) is a probability measure concentrated on C and δ is a positive constant. A quick glance at the latter conditions shows that they do not lead to a contraction property as in (9), so the fixed-point approach is not applicable, at least in the way we do here. Finally, it is convenient to point out again that, in spite of imposing similar conditions to Assumption 4.3, the approaches followed in all the papers so far cited rely on the WGE mentioned in the Introduction, with the only exception of [27] and [8]. In the next proposition are stated some important consequences of Assumption 4.2 and 4.3, which are proved in [27] using fixed-points arguments too. Proposition 4.4. Suppose that Assumptions 4.3 holds. Then, for each stationary strategy pair (ϕ 1, ϕ 2 ) Φ 1 Φ 2, the following holds: (a) The transition law Q ϕ 1,ϕ2( x) is positive Harris recurrent. Thus, in particular, there exists a unique invariant probability measure µ ϕ 1,ϕ2( ), that is, µ ϕ 1,ϕ 2( ) = Q ϕ 1,ϕ 2( x)µ ϕ 1,ϕ 2(dx). Moreover, ν is an irreducibility measure for Q ϕ 1,ϕ 2( ). (b) µ ϕ 1,ϕ2(W ) is finite; in fact, it holds the bounds θ µ ϕ 1,ϕ 2(W ) ν(w ) (1 λ)ν(). (10) 8

9 Next observe that, under the Assumptions , by Proposition 4.4 the constants ρ(ϕ 1, ϕ 2 ) := µ ϕ 1,ϕ 2(R ϕ 1,ϕ 2) µ ϕ 1,ϕ 2(τ ϕ 1,ϕ 2) (ϕ 1, ϕ 2 ) Φ 1 Φ 2 (11) are finite. Then, for each (ϕ 1, ϕ 2 ) Φ 1 Φ 2, define on B W () the operator L ϕ 1,ϕ 2u(x) := R ϕ 1,ϕ 2(x) + u(y)q ϕ 1,ϕ2(dy x) x, (12) where R ϕ 1,ϕ 2( ) := R ϕ 1,ϕ 2( ) ρ(ϕ1, ϕ 2 )τ ϕ 1,ϕ2( ). (13) Theorem 4.5. Suppose that Assumptions 4.1, 4.2 and 4.3 hold. Then for each stationary strategy pair (ϕ 1, ϕ 2 ) Φ 1 Φ 2 : (a) There exists a unique function h ϕ 1,ϕ 2 B W (), with ν(h ϕ 1,ϕ2) = 0, that satisfies the (semi-markov) Poisson equation h ϕ 1,ϕ 2(x) = L ϕ 1,ϕ 2h ϕ 1,ϕ 2(x) = R ϕ 1,ϕ 2(x) + (b) Moreover, J(ϕ 1, ϕ 2, ) = ρ(ϕ 1, ϕ 2 ). h ϕ 1,ϕ 2(y)Q ϕ 1,ϕ2(dy x) x ; Now, we impose some compactness/continuity conditions on the model to assure the existence of measurable minimizers/maximizers; notice that this can be done in several settings (see, e.g. [10, Thm. 3.5, p. 28] or [8, Lemma 3.5]). Here, for simplicity, we consider the following one. Assumption 4.6.(Compactness/continuity conditions) For each (x, a, b) K : (a) A(x) and B(x) are non-empty compact subsets; (b) R(x,, b) is upper semicontinuous on A(x), and R(x, a, ) is lower semicontinuous on B(x); (c) τ(x,, b) and τ(x, a, ) are continuous on A(x) and B(x), respectively; (d) S(x,, b) and S(x, a, ) are continuous on A(x) and B(x), respectively; (e) For each bounded measurable function v on, the functions v(y)q(dy x,, b) and v(y)q(dy x, a, ) are continuous on A(x) and B(x), respectively; (f) The functions 9

10 W (y)q(dy x,, b) and are continuous on A(x) and B(x), respectively. W (y)q(dy x, a, ) Theorem 4.7. Suppose that Assumptions 4.1, 4.2, 4.3 and 4.6 hold. Then: (a) There exists a unique function h B W () with ν(h ) = 0, a stationary strategy pair (ϕ 1, ϕ 2 ) Φ 1 Φ 2 and a constant ρ which satisfy the Shapley equation } h (x) = min {R ϕ 1,ϕ 2(x) ϕ 2 Φ ρ τ ϕ 1,ϕ 2(x) + h (y)q ϕ 1 2,ϕ 2(dy x) x, } = max {R ϕ 1,ϕ 2 (x) ϕ 1 Φ ρ τ ϕ 1,ϕ 2 (x) + h (y)q ϕ 1,ϕ 2(dy x) 1 = R ϕ 1,ϕ 2 (x) ρ τ ϕ 1,ϕ 2 (x) + h (y)q ϕ 1,ϕ 2 (dy x). (b) The constant ρ is the value of the game and (ϕ 1, ϕ 2 ) is an EAP-optimal stationary strategy pair. That is, J(ϕ 1, ϕ2, ) = ρ and Hence, by Theorem 4.5, J(π 1, ϕ 2, ) ρ J(ϕ 1, π2, ) (π 1, π 2 ) Π 1 Π 2. (c) Moreover, h ( ) = h ϕ 1,ϕ 2 ( ). ρ = ρ(ϕ 1, ϕ 2 ) = max min ϕ 2 Φ ρ(ϕ1, ϕ 2 ) = min 2 ϕ 1 Φ 1 ϕ 2 Φ 2 max ϕ 1 Φ ρ(ϕ1, ϕ 2 ), (14) 1 h ( ) = h ϕ 1,ϕ 2 ( ) = min h ( ) = max h ( ), (15) ϕ 2 F 2 ϕ 1,ϕ2 ϕ 1 Φ 1 ϕ 1,ϕ 2 where F i stands for the class of all stationary EAP-optimal strategies for player i (i = 1, 2). It is worth mentioning that, to the best of our knowledge, the minimax characterization of the solution h ( ) of the Shapley equation given in (15) has been discussed in any of the previous paper dealing with zero-sum stochastic games, even for the case of discrete state space. 10

11 5 Proof of Theorem 4.5 For the proof of the results in Section 4 several preliminary results are needed. The first one are collected in the next lemma, which we state without proofs because they follow directly from Assumption 4.1, 4.2, and 4.3. Lemma 5.1. Suppose that Assumption 4.3 holds. Then: (a) For each function u in B W (), 1,π lim 2 n n Eπ1 x u(x n ) = 0 x, (π 1, π 2 ) Π 1 Π 2 ; (b) For each stationary strategy pair (ϕ 1, ϕ 2 ) Φ 1 Φ 2, it holds that µ ϕ 1,ϕ 2(S (1 λ)θ ϕ 1,ϕ2) > 0; ν(w ) (c) If in addition Assumptions 4.1 and 4.2 hold, then µ ϕ 1,ϕ 2(S ϕ 1,ϕ 2) µ ϕ1,ϕ 2(τ ϕ 1,ϕ 2) 1 λ Kν(W ) > 0. The following lemma concerns the existence of solutions to the Poisson equation which, in addition to being interesting in itself, plays a key role in our development. In fact, its proof exhibits the way we take advantage of the contraction property (9). Lemma 5.2. Suppose Assumption 4.2 and 4.3 holds and let (ϕ 1, ϕ 2 ) Φ 1 Φ 2 be fixed but arbitrary. Then, for each function v in B W () there exists a unique function h v in B W (), with ν(h v ) = 0, which satisfies the Poisson equation h v (x) = v(x) µ ϕ 1,ϕ 2(v) + h v (y)q ϕ 1,ϕ2(dy x) x. (16) Thus, from Lemma 5.1(a), 1 µ ϕ 1,ϕ2(v) = lim n n 1 1 n Eϕ,ϕ 2 x k=0 v(x k ) x. (17) Proof of Lemma 5.2. Fix a function v B W (), and let µ( ) := µ ϕ 1,ϕ 2( ), S( ) := S ϕ 1,ϕ 2( ) and Q( ) := Q ϕ 1,ϕ2( ). Next, define T u(x) = v(x) µ(v) + u(y) Q(dy x) x, u B W (). By Assumption 4.3(c), it is clear that T maps B W () into itself. Moreover, for any functions u, w B W (), it holds that 11

12 T u(x) T w(x) u(y) w(y) Q(dy x) for all x. Hence, u w W W (y) Q(dy x) u w W λw (x) T u T w W λ u w W. That is, T is a contraction operator from BW () into itself with modulus λ. Then, by the Banach Fixed Point Theorem, there exists a unique function h v B W () that satisfies the equation h v (x) = v(x) µ(v) + h v (y) Q(dy x) x = v(x) µ(v) + h v (y)q(dy x) ν(h v )S(x). Now, an integration with respect to the invariant probability measure µ( ) in both sides of the last equation yields ν(h v )µ(s) = 0, which, by Lemma 5.1(b), implies that ν(h v ) = 0. Therefore, h v satisfies the Poisson equation h v (x) = v(x) µ(v) + h v (y)q(dy x) x, which proves (16). Finally, the property (17) is obtained by iteration of the Poisson equation and using Lemma 5.1(a). Now we proceed to prove Theorem 4.5. Proof of Theorem 4.5. Let (ϕ 1, ϕ 2 ) Φ 1 Φ 2 be fixed but arbitrary. Thus, since the function v( ) := R ϕ 1,ϕ 2( ) = R ϕ 1,ϕ 2( ) ρ(ϕ1, ϕ 2 ) τ ϕ 1,ϕ 2( ) is in B W (), by Lemma 5.2 there exists a unique function h ϕ 1,ϕ 2 B W () with ν(h ϕ 1,ϕ2) = 0 that satisfies the Poisson equation h ϕ 1,ϕ 2(x) = R ϕ 1,ϕ 2(x) + h ϕ 1,ϕ 2(y)Q ϕ 1,ϕ2(dy x) x. 12

13 This proves part (a) of the theorem. Next, to prove part (b), first note that iteration of the last equation yields h ϕ 1,ϕ 2(x) = Eϕ1,ϕ 2 x [ n 1 ] n 1 R ϕ 1,ϕ 2(x k) ρ(ϕ 1, ϕ 2 ) τ ϕ 1,ϕ 2(x k) k=1 + h ϕ 1,ϕ 2(y)Qn ϕ 1,ϕ 2(dy x) k=1 (18) for all n N and x. Moreover, by Assumptions 4.1 and 4.2, applying Lemma 5.2 with v( ) := τ ϕ 1,ϕ2( ), we obtain n 1 µ ϕ 1,ϕ 2(τ 1,ϕ ϕ 1,ϕ2) = lim 2 n n Eϕ1 x τ ϕ 1,ϕ 2(x k) > 0 x, k=1 which combined with (18) and Lemma 5.1(a) implies that ρ(ϕ 1, ϕ 2 E ϕ1,ϕ 2 x ) = lim n E ϕ1,ϕ 2 x n 1 k=0 R ϕ 1,ϕ 2(x k) n 1 k=0 τ ϕ 1,ϕ 2(x k) x. 6 Proof of Theorem 4.7 Define the constants ρ l := sup ϕ 1 Φ 1 inf ϕ 2 Φ ρ(ϕ1, ϕ 2 ) and ρ u := inf 2 ϕ 2 Φ 2 sup ρ(ϕ 1, ϕ 2 ). ϕ 1 Φ 1 We show in the next lemma that this constants are finite. Observe that this trivially holds if one assume that the mean holding time function is bounded below by a positive constant. Lemma 6.1. Suppose that Assumptions 4.1, 4.2, 4.3 and 4.6 hold. Then ρ l < and ρ u <. Proof of Lemma 6.1. Let ϕ 1 be a fixed but arbitrary stationary strategy for player 1 and consider the Markov (one player) model M = (, K B, Q, τ) where and K B are as above, and the transition law and the one-step cost function are defined as 13

14 Q( x, b) := Q( x, a, b)ϕ 1 (da x) A(x) τ(x, b) := A(x) τ(x, a, b)ϕ 1 (da x) for all (x, b) K B, respectively. Thus following the notation (4), for all x and ϕ 2 Φ 2, define Q ϕ 2( x) := B(x) Q( x, b)ϕ 2 (db x) τ ϕ 2(x) := B(x) τ(x, b)ϕ 2 (db x). Note that Q ϕ 2( ) = Q ϕ 1,ϕ 2( ) and τ ϕ 2( ) = τ ϕ 1,ϕ 2( ) for all ϕ2 Φ 2. The Markov model M satisfies all the conditions in [27, Thm. 3.6]; hence, in particular, there exists a stationary policy ϕ 2 + Φ 2 such that µ ϕ1,ϕ 2 (τ + ϕ 1,ϕ 2 ) = µ + ϕ 1,ϕ 2 ( τ ϕ2) = inf µ + ϕ 2 Φ 2 ϕ 1,ϕ 2 ( τ ϕ 2). + Then, by Assumption 4.1, it holds that µ ϕ1,ϕ 2 (τ + ϕ 1,ϕ 2 ) > 0. Next observe that + ρ(ϕ 1, ϕ 2 ) µ ϕ 1,ϕ 2( R ϕ 1,ϕ 2 ) µ ϕ 1,ϕ 2(τ ϕ 1,ϕ 2) µ ϕ1,ϕ2(w ) µ ϕ 1,ϕ 2 + (τ ϕ 1,ϕ 2 + ) k µ ϕ 1,ϕ 2 + (τ ϕ 1,ϕ 2 + ) where the last inequality follows from (10) with k := ν(w )[(1 λ)ν()] 1. Hence, k < µ ϕ 1,ϕ 2 (τ + ϕ 1,ϕ 2 ) inf ϕ 2 Φ ρ(ϕ1, ϕ 2 ) ρ(ϕ 1, ϕ 2 ) ϕ 1 Φ 1. (19) 2 + Now fix ϕ 2 Φ 2 and proceed as above to get a stationary strategy ϕ 1 + Φ such that µ ϕ 1 +,ϕ 2(τ ϕ 1 +,ϕ2) = inf µ ϕ 1,ϕ 2(τ ϕ 1,ϕ2) > 0. ϕ 1 Φ 1 14

15 Then, Hence, ρ(ϕ 1, ϕ 2 ) µ ϕ 1,ϕ 2( R ϕ 1,ϕ 2 ) µ ϕ 1,ϕ 2(τ ϕ 1,ϕ 2) Therefore, by (19)-(20), ρ(ϕ 1, ϕ 2 ) sup ϕ 1 Φ 1 ρ(ϕ 1, ϕ 2 ) k µ ϕ 1 +,ϕ 2(τ < +. ϕ 1 +,ϕ2) k µ ϕ 1 +,ϕ 2(τ. (20) ϕ 1 +,ϕ2) < ρ l = sup ϕ 1 Φ 1 which proves the desired result. inf ϕ 2 Φ ρ(ϕ1, ϕ 2 ) ρ u = 2 inf ϕ 2 Φ 2 sup ρ(ϕ 1, ϕ 2 ) < + ϕ 1 Φ 1 For the proof of Theorem 4.7 introduce the following operators: for each u B W () define L l u(x, a, b) := R l (x, a, b) + u(y) Q(dy x, a, b) (x, a, b) K, (21) where R l (x, a, b) := R(x, a, b) ρ l τ(x, a, b) (x, a, b) K. (22) Thus, following the notation (4), for each strategy pair (ϕ 1, ϕ 2 ) Φ 1 Φ 2 define the operators L l ϕ 1,ϕ 2u( ) := Rl ϕ 1,ϕ 2( ) + u(y) Q ϕ 1,ϕ2(dy ), (23) for each u B W (). L u( ) := sup inf ϕ 1 A(x) ϕ 2 B(x) Ll ϕ 1,ϕ 2u( ), (24) The results in the next lemma are a combination of well-known measurable selection theorem [22] and Fan Minimax Theorem [4]. The proof is omitted since it is the same as the proof of Lemma 6.5 in [11] and Lemmas 2, 3 and 4 in [23]. Lemma 6.2. Suppose that Assumption 4.1, 4.2, 4.3 and 4.6 hold and let u be a fixed function in B W (). Then (a) For each x, the sets A(x) and B(x) are compact with respect to the weak convergence of measures; (b) For each x, (ϕ 1, ϕ 2 ) Φ 1 Φ 2 and u B W (), the mappings 15

16 ϕ 1 L l ϕ 1,ϕ 2u(x) ϕ 2 L l ϕ 1,ϕ 2u(x) are upper semicontinuous and lower semicontinuous on A(x) and B(x), respectively, with respect to the weak convergence of measures; (c) Moreover, there exists a stationary strategy pair (ϕ 1 u, ϕ 2 u) Φ 1 Φ 2 such that L u( ) = L l ϕ 1 u,ϕ2 u u( ) Hence, L u( ) is in B W (). = max ϕ 1 Φ Ll 1 ϕ 1,ϕ u( ) = min 2 u ϕ 2 Φ Ll 2 ϕ,ϕ2u( ). 1 u The proof of Theorem 4.7 follows the same scheme as that of Lemma 5.2, so we first show in Lemma 6.3 below that L is a contraction operator from B W () into itself with modulus λ; hence, by the Banach Fixed Point Theorem, there exists a unique function h in B W () such that h ( ) = L h ( ) = sup ϕ 1 A(x) As a second step, in Lemma 6.4, we prove that ρ := ρ l = ρ u and ν(h ) 0. Once the latter is done, we show in Lemma 6.5 that inf ϕ 2 B(x) Ll ϕ 1,ϕ 2h ( ). (25) Then, (25) becomes ν(h ) = 0. h (x) = sup inf ϕ 1 A(x) ϕ 2 B(x) [R ϕ1,ϕ 2(x) ρ τ ϕ1,ϕ 2(x) + ] h (y)q ϕ1,ϕ 2(dy x) for all x. Hence, Lemma 6.2 yields a stationary strategy pair (ϕ 1, ϕ2 ) Φ 1 Φ 2 satisfying Theorem 4.7(a). Lemma 6.3. Suppose that assumptions in Theorem 4.7 hold. Then, L in (24) is a contraction operator from B W () into itself with modulus λ. Thus, by the Banach Fixed Point Theorem and Lemma 6.2, there exists a unique function h in B W () and a stationary strategy pair (ϕ 1, ϕ2 ) Φ1 Φ 2 such that 16

17 h ( ) = L h ( ) = L l ϕ 1,ϕ2 h ( ) (26) = min ϕ 2 B(x) Ll ϕ 1,ϕ2h ( ) = max ϕ 1 A(x) Ll ϕ 1,ϕ 2h ( ). (27) Proof of Lemma 6.3. By Lemma 6.2 it only remains to prove that L is a contraction operator from B W () into itself with modulus λ. To prove this, consider arbitrary functions u, v in B W () and observe, by Assumption 4.3(b) and (9), that L l ϕ 1,ϕ 2u( ) Ll ϕ 1,ϕ 2v( ) u v W W (y) Q ϕ1,ϕ 2(dy ) for all (ϕ 1, ϕ 2 ) Φ 1 Φ 2. This implies that u v W λw ( ) L l ϕ 1,ϕ 2u( ) Ll ϕ 1,ϕ 2v( ) + u v W λw ( ) (ϕ1, ϕ 2 ) Φ 1 Φ 2. Thus, the latter inequality together Lemma 6.2 implies inf ϕ 2 B(x) Ll ϕ 1,ϕ 2u( ) inf ϕ 2 B(x) Ll ϕ 1,ϕ 2v( ) + u v W λw ( ) ϕ1 Φ 1, which, using again Lemma 6.2, yields L u( ) L v( ) + u v W λw ( ). Similarly, interchanging the role of u and v, it also holds that Therefore, L v( ) L u( ) + u v W λw ( ). L u L v W λ u v W. That is, L is a contraction operator from B W () into itself with modulus λ. Now, the Banach Fixed Point Theorem together with Lemma 6.2 ensures the existence of a unique function h B W () and a stationary strategy pair (ϕ 1, ϕ 2 ) Φ 1 Φ 2 satisfying (26)-(27). Lemma 6.4. Suppose that assumptions in Theorem 4.7 hold and let h be as in Lemma 6.3. Then, 17

18 ν(h ) 0 and ρ l = ρ u. Proof of Lemma 6.4. Let (ϕ 1, ϕ2 ) be as in Lemma 6.3. Then, h (x) = min [R lϕ1,ϕ2(x) ] + h (y) Q ϕ 1 ϕ 2 B(x),ϕ 2(dy x) (28) Rϕ l 1,ϕ2(x) + h (y) Q ϕ 1,ϕ 2(dy x) = Rϕ l 1,ϕ2(x) + h (y)q ϕ 1,ϕ 2(dy x) ν(h )S ϕ 1,ϕ 2(x) for all x, ϕ 2 Φ 2. Then, an integration with respect to the invariant probability measure µ ϕ 1,ϕ 2 yields 0 µ ϕ 1,ϕ 2(Rl ϕ 1,ϕ2) ν(h )µ ϕ 1,ϕ 2(S ϕ 1,ϕ2) ϕ2 Φ 2, which implies that ν(h )µ ϕ 1,ϕ 2(S ϕ 1,ϕ2) µ ϕ 1,ϕ2(R ϕ 1,ϕ2) ρl µ ϕ 1,ϕ 2(τ ϕ 1,ϕ2) = µ ϕ 1,ϕ 2(τ ϕ 1,ϕ2) [ ρ(ϕ 1, ϕ2 ) ρ l], for all ϕ 2 Φ 2. Now, taking infimum over Φ 2, we obtain [ ν(h )µ ϕ 1 inf,ϕ2(s ϕ 1,ϕ2) ] ϕ 2 B(x) µ ϕ 1,ϕ 2(τ ϕ 1,ϕ2) inf ϕ 2 B(x) ρ(ϕ1, ϕ 2 ) ρ l 0, which, by Assumption 4.1 and Lemma 5.1(b), implies that ν(h ) 0. This inequality combined with (27) implies h (x) = ] max [R lϕ1,ϕ2 (x) + h (y) Q ϕ 1,ϕ 2(dy x) ϕ 1 A(x) ] max [R lϕ1,ϕ2 (x) + h (y)q ϕ 1,ϕ 2(dy x) ϕ 1 A(x) R l ϕ 1,ϕ 2 (x) + h (y)q ϕ 1,ϕ 2 (dy x) 18

19 for all x, ϕ 1 Φ 1. Now, integrating both sides of the latter inequality with respect to the invariant probability measure µ ϕ 1,ϕ 2, we see that 0 µ ϕ 1,ϕ 2 (Rl ϕ 1,ϕ 2 ) = µ ϕ 1,ϕ 2 (R ϕ 1,ϕ 2 ) ρl µ ϕ 1,ϕ 2 (τ ϕ 1,ϕ 2 ) ϕ1 Φ 1, which implies that Hence, ρ l ρ(ϕ 1, ϕ 2 ) = µ ϕ 1,ϕ 2 (R ϕ 1,ϕ 2 ) µ ϕ1,ϕ 2 (τ ϕ 1,ϕ 2 ) ϕ 1 Φ 1. Therefore, ρ l = ρ u. ρ l sup ϕ 1 Φ 1 ρ(ϕ 1, ϕ 2 ) inf ϕ 2 Φ 2 sup ρ(ϕ 1, ϕ 2 ) = ρ u. ϕ 1 Φ 1 Lemma 6.5. Suppose that assumptions in Theorem 4.7 hold and let h be as in Lemma 6.3. Then, ν(h ) = 0. Proof of Lemma 6.5. Let (ϕ 1, ϕ2 ) be as in Lemma 6.3 and put ρ := ρ l = ρ u. By (27), we have h (x) = ] max [R ϕ 1,ϕ 2 (x) ϕ 1 A(x) ρ τ ϕ 1,ϕ 2 (x) + h (y) Q ϕ 1,ϕ 2 (dy x) R ϕ 1,ϕ 2 (x) ρ τ ϕ 1,ϕ 2 (x) + h (y) Q ϕ 1,ϕ 2 (dy x) for all x, ϕ 1 Φ 1. As above, integrating with respect to the invariant probability measure µ ϕ1,ϕ 2 in both sides of the latter inequality we obtain ν(h )µ ϕ 1,ϕ 2 (S ϕ 1,ϕ 2 ) µ ϕ 1,ϕ 2 (τ ϕ 1,ϕ 2 ) [ ρ(ϕ 1, ϕ 2 ) ρ ] which implies that = µ ϕ1,ϕ 2 (τ ϕ 1,ϕ 2 ) [ρ(ϕ 1, ϕ 2 ) inf ϕ 2 Φ 2 sup ρ(ϕ 1, ϕ 2 ) ϕ 1 Φ 1 µ ϕ 1,ϕ 2 (τ ϕ 1,ϕ 2 ) [ρ(ϕ 1, ϕ 2 ) sup ϕ 1 Φ 1 ρ(ϕ 1, ϕ 2 ) ], ] 19

20 Then, ν(h )µ ϕ 1,ϕ 2 (S ϕ 1,ϕ 2 ) µ ϕ 1,ϕ 2 (τ ϕ 1,ϕ 2 ) ρ(ϕ 1, ϕ 2 ) sup ϕ 1 Φ 1 ρ(ϕ 1, ϕ 2 ) ϕ1 Φ 1. [ ν(h )µ ϕ sup 1,ϕ 2(S ϕ 1,ϕ 2 ) ] ϕ 1 Φ 1 µ ϕ 1,ϕ 2(τ ϕ 1,ϕ 2) 0. This inequality implies that ν(h ) 0. Hence, by Lemma 6.4, ν(h ) = 0. Finally we are ready for the proof of Theorem 4.7. Proof of Theorem 4.7. Let h and (ϕ 1, ϕ 2 ) be as in Lemma 6.3. First note that proof of part (a) is given throughout Lemmas 6.3, 6.4 and 6.5. Part (b) follows using standard dynamic programming arguments, while the first statement in part (c) is exactly Lemma 6.4. Thus, it only remains to prove the equalities in (15). To do this first recall that F i denotes the class of all stationary optimal strategies for player i, with i = 1, 2, which is nonempty because of part (b). Now, define the following operators on B W (): Mu(x) := Nu(x) := ] max [R ϕ 1,ϕ 2(x) ϕ 1 A(x) ρ τ ϕ 1,ϕ 2(x) + u(y) Q ϕ 1,ϕ 2(dy x) ] min [R ϕ 1,ϕ 2(x) ϕ 2 B(x) ρ τ ϕ 1,ϕ 2(x) + u(y) Q ϕ 1,ϕ 2(dy x) for all x. Proceeding as above it is easy to check that M and N are welldefined and they are λ contraction operators on B W () into itself. In fact, from part (a), h is the fixed point for both operators; that is, h ( ) = Mh ( ) = Nh ( ). Next choose an arbitrary strategy ϕ 1 0 in F 1 and note that ρ = ρ(ϕ 1 0, ϕ2 ). Then, by Theorem 4.5, there exists a unique function h ϕ 1 0,ϕ 2 ν(h ϕ 1 0,ϕ ) = 0, which satisfies 2 in B W (), with h ϕ 1 0,ϕ 2 (x) = R ϕ 1 0,ϕ2 (x) ρ τ ϕ 1 0,ϕ 2 (x) + Next, observe that h ϕ 1 0,ϕ 2 ( ) Mh ϕ 1 0,ϕ2 ( ), h ϕ 1 0,ϕ (y) Q 2 ϕ 1 0,ϕ (dy x) x. 2 20

21 which implies that h ϕ 1 0,ϕ ( ) M n h 2 ϕ 1 0,ϕ ( ) n N. 2 Now, since M is a contraction and h is its fixed point, we have h ϕ 1 0,ϕ 2 ( ) h ( ). Hence, since h ( ) = h ϕ 1,ϕ 2 ( ) and the policy ϕ1 0 was chosen arbitrarily in F 1, we have max ϕ 1 F 1 h ϕ 1,ϕ 2 ( ) = h ( ). Similar arguments, but using the operator N instead of M, show that h ( ) = min ϕ 2 F 2 h ϕ 1,ϕ2( ). Acknowledgment. The author thanks to Prof. Onésimo Hernández-Lerma for his valuable comments on a early version of this work. References [1] E. Altman, A. Hordijk and F. M. Spieksma, Contraction conditions for average and α discount optimality in countable state Markov games with unbounded rewards, Math. Oper. Res. 22 (1997), [2] S. Bathnagar and V. S. Borkar, A convex analitic framework for ergodic control of semi-markov processes, Math. Oper. Res. 20 (1995), [3] V.S. Borkar, M. K. Gosh, Denumerable stochastic games with limiting average payoff, J. Optim. Theory Appl. 76 (1993), [4] K. Fan, Minimax theorems, Proc. Acad. Sci. USA 39 (1953), [5] A. Ferdegruen, P. J. Schweitzer and H. C. Tijms, Denumerable undiscounted semi-markov decision processes with unbounded rewards, Math. Oper. Res. 8 (1983), [6] J. Filar and K. Vrieze, Competitive Markov Decision Processes, Springer- Verlag, New York, [7] M. K. Gosh and A. Bagchi, Stochastic games with average payoff criterion, Appl. Math. Optim. 38 (1998), [8] J.I. González-Trejo, O. Hernández-Lerma and L. F. Hoyos-Reyes, Minimax control of discrete-time stochastic systems, SIAM J. Control Optim., to appear. 21

22 [9] E. Gordienko and O. Hernández-Lerma, Average cost Markov control processes with weighted norms: existence of canonical policies, Appl. Math. (Warsaw) 23 (1995), [10] O. Hernández-Lerma and J.B. Lasserre, Further Topics on Discrete-Time Markov Control Processes, Springer-Verlag, New York, [11] O. Hernández-Lerma and J.B. Lasserre, Zero-sum stochastic games in Borel spaces: average payoff criteria, SIAM J. Control Optim. 39 (2001), [12] O. Hernández-Lerma, R. Montes-de-Oca, R. Cavazos-Cadena, Recurrence condtions for MDPs with Borel state space, Ann. Oper. Res. 28 (1991), [13] O. Hernández-Lerma and O. Vega-Amaya, Infinite-horizon Markov control processes with undiscounted cost criteria: from average to overtaking optimality, Appl. Math. (Warsaw) 25 (1998), [14] O. Hernández-Lerma, O. Vega-Amaya and G. Carrasco, Sample-path optimality and variance-minimization of average cost Markov control processes, SIAM J. Control Optim. 38 (1999), [15] A. Jaśkiewicz, An approximation approach to ergodic semi-markov control processes, Math. Methods Oper. Res. 54 (2001), [16] A. Jaśkiewicz and A. S. Nowak, On the optimality equation for zero-sum ergodic stochastic games, Math. Methods Oper. Res. 54 (2001), [17] A. Jaśkiewicz, Zero-sum semi-markov games, SIAM J. Control and Optim., to appear. [18] M. Kurano, Average optimal adaptive policies in semi-markov decision processes including an unknown parameter, J. Oper. Res. Soc. Japan 28 (1985), [19] A. K. Lal and S. Sinha, Zero-sum two-person semi-markov games, J. Appl. Prob. 29 (1992), [20] F. Luque-Vásquez and O. Hernández-Lerma, Semi-Markov models with average costs, Appl. Math. (Warsaw) 26 (1999), [21] S. P. Meyn and R. L. Tweedie, Markov Chains and Stochastic Stability, Springer-Verlag, London, [22] A. S. Nowak, Measurable selection theorems for minimax stochastic optimizations problems, SIAM J. Control Optim. 23 (1985), [23] A. S. Nowak, Optimal strategies in a class of zero-sum ergodic stochastic games, Math. Methods Oper. Res. 50 (1999),

23 [24] M. L. Puterman, Markov Decision Processes. Discrete Stochastic Dynamic Programming, Wiley, New York, [25] U. Rieder, Average optimality in Markov games with general state space, Proc. 3rd Conf. on Approx. Theory and Optim. (1995), Puebla, México. (Available in [26] P. J. Schweitzer, Iterative solutions of functional equations of undiscounted Markov renewal programming, J. Math. Anal. Appl. (1971), [27] O. Vega-Amaya, The average cost optimality equation: a fixed point approach, Reporte de Investigación No. 4 (2001), Departamento de Matemáticas, Universidad de Sonora, México. (Available in: tedi/reportes). [28] O. Vega-Amaya and F. Luque-Vásquez, Sample-path average cost optimality for semi-markov control processes on Borel spaces: unbounded costs and mean holding times, Appl. Math. (Warsaw) 27 (2000),

Zero-sum semi-markov games in Borel spaces: discounted and average payoff

Zero-sum semi-markov games in Borel spaces: discounted and average payoff Fernando Luque-Vásquez Departamento de Matemáticas Universidad de Sonora México May 2002 Abstract We study two-person zero-sum