Existence of optimal strategies in Markov games with incomplete information

Size: px

Start display at page:

Download "Existence of optimal strategies in Markov games with incomplete information"

George Crawford
5 years ago
Views:

1 Int J Game Theory (2008) 37: DOI /s ORIGINAL PAPER Exstence of optmal strateges n Markov games wth ncomplete nformaton Abraham Neyman Accepted: 1 July 2008 / Publshed onlne: 1 August 2008 Sprnger-Verlag 2008 Abstract The exstence of a value and optmal strateges s proved for the class of two-person repeated games where the state follows a Markov chan ndependently of players actons and at the begnnng of each stage only Player 1 s nformed about the state. The results apply to the case of standard sgnalng where players stage actons are observable, as well as to the model wth general sgnals provded that Player 1 has a nonrevealng repeated game strategy. The proofs reduce the analyss of these repeated games to that of classcal repeated games wth ncomplete nformaton on one sde. Keywords Repeated games Repeated games wth ncomplete nformaton Markov chan games 1 Introducton The class of two-person zero-sum repeated games where the state follows a Markov chan ndependently of players actons, and at the begnnng of each stage only Player 1 s nformed about the state, and players stage actons are observable, s termed n Renault (2006) Markov chan games wth ncomplete nformaton on one sde. The play of a Markov chan game wth ncomplete nformaton on one sde proceeds as follows. Nature chooses the ntal state z 1 n the fnte set of states M accordng Ths research was supported n part by Israel Scence Foundaton grants 382/98, 263/03, and 1/06, and by the Zv Hermann Shapra Research Fund. A. Neyman (B) Insttute of Mathematcs and Center for the Study of Ratonalty, Hebrew Unversty, Jerusalem, Israel e-mal: aneyman@math.huj.ac.l URL:

2 582 A. Neyman to an ntal probablty q 0.Atstaget Player 1 observes the current state z t M and chooses an acton t n the fnte set of actons I and (smultaneously) Player 2 (who does not observe the state z t ) chooses an acton j t n the fnte set of actons J. Both players observe the acton par ( t, j t ). The next state z t+1 depends stochastcally on z t only;.e., t depends nether on t, nor on current or past actons, nor on past states. Thus the states follow a Markov chan wth ntal dstrbuton q 0 and transton matrx Q on M. The payoff at stage t s a functon g of the current state z t and the actons t and j t of the players. Formally, the game Ɣ s defned by the 6-tuple M, Q, q 0, I, J, g where M s the fnte set of states, Q s the transton matrx, q 0 s the ntal probablty of z 1 M, I and J are the state-ndependent acton sets of Player 1 and Player 2, respectvely, and g : M I J R s the stage payoff functon. The transton matrx Q and the ntal probablty q 0 defne a stochastc process on sequences of states by P(z 1 = z) = q 0 (z) and P(z t+1 = z z 1,...,z t ) = Q zt,z. A pure, respectvely, behavoral, strategy σ of Player 1 n the game Ɣ that s defned by M, Q, q 0, I, J, g s a sequence of functons σ t : (M I J) t 1 M I (σ t : (z 1, 1, j 1,..., t 1, j t 1, z t ) I ), respectvely (I ) (where for a fnte set D we denote by (D) all probablty dstrbutons on D). A pure, respectvely behavoral, strategy τ of Player 2 s a sequence of functons τ t : (I J) t 1 J, respectvely (J). A par σ, τ of pure (mxed, or behavoral) strateges (together wth the ntal dstrbuton q 0 ) nduces a stochastc process wth values z 1, 1, j 1,...,z t, t, j t,... n (M I J), and thus a stochastc stream of payoffs g t := g(z t, t, j t ). Astrategyσ (respectvely, τ ) of Player 1 (respectvely, 2) guarantees v f for nt=1 g t v (respectvely, E q 0 σ,τ 1 n nt=1 g t v) for all suffcently large n, E q 0 σ,τ n 1 every strategy τ (respectvely, σ ) of Player 2 (respectvely, 1). We say that Player 1 (respectvely, 2) can guarantee v n Ɣ(q 0 ) f for every ε>0there s a strategy of Player 1 (respectvely, 2) that guarantees v ε (respectvely, v + ε). The game has a value v f each player can guarantee v. A strategy of Player 1 (respectvely, 2) that guarantees v ε (respectvely, v + ε) s called an ε-optmal strategy, and a strategy that s ε-optmal for every ε>0scalled an optmal strategy. Renault (2006) proved that the Markov chan game Ɣ has a value v and Player 2 has an optmal strategy. The present paper (1) shows that Renault s result follows from the classcal results of repeated games wth ncomplete nformaton (Aumann and Maschler 1995); and (2) proves the exstence of an optmal strategy for Player 1. Thus, Theorem 1 The Markov chan game Ɣ has a value and both players have optmal strateges. In addton, these results are extended n the present paper to the model wth sgnals. Secton 2 presents a proof of Renault s results Renault (2006) that the Markov chan game Ɣ has a value and that Player 2 has an optmal strategy, and sketches the proof of the exstence of an optmal strategy of Player 1. Secton 3 ntroduces a class of auxlary repeated games wth ncomplete nformaton that serves n the proof of Theorem 1 as well as n approxmatng the value of Ɣ. Secton 4 couples the Markov chan wth stochastc processes that enable us to reduce the analyss of a Markov chan

3 Exstence of optmal strateges n Markov games wth ncomplete nformaton 583 game to that of a classcal repeated game wth ncomplete nformaton on one sde. Secton 5 contans the proof of Theorem 1. Secton 6 extends the model and the results to Markov games wth ncomplete nformaton on one sde and sgnals, where players actons are unobservable and each player only observes a sgnal that depends stochastcally on the current state and actons. The proof for the model wth sgnals requres only mnor modfcaton. For smplcty of notaton and exposton, albet at the cost of some repetton, we ntroduce the games wth sgnals only after completng the proof of Theorem 1. 2 Informal proofs The proofs are based on the observaton that f (z t ) t s a Markov chan then for properly chosen sequences n < n < n +1, the Markov chan has wth probablty close to 1 entered at stage n a communcatng class C, and, condtonal on the entered communcatng class C, the processes z[] =z n +1,...,z n, 1, are almost ndependent, and the dstrbutons of the ntal states n the th block of stages z n +1, 1, are almost dentcal. Therefore, a slght alternaton of the process (z t ) leads to a process ( z t ) t such that z n1 +1 s n one of the communcatng classes C, and, condtonal on z n1 +1 C, z[1], z[2],...are ndependent and z n +1 dentcally dstrbuted. The alternaton s such that Player 1 can compute the state z n +1 as a functon of z 1,...,z n +1 and a prvate lottery X, and therefore can play n the Markov chan game as f the process of states follows the altered process ( z t ) t. Note that the altered process s not a Markov chan. If the states z t follow a general (not necessarly a Markov chan) stochastc process (z t ) t wth z t M we can defne the game Ɣ((z t ) t ) as follows. Nature chooses an nfnte sequence (z 1, z 2,...) accordng to the law of the process. At the begnnng of each stage only Player 1 s nformed about the state, and players stage actons are observable. The proofs assgn to the Markov chan process (z t ) t (and ε>0) a stochastc process z[1], z[2],..., where each z[] s a fnte sequence of states, such that Ɣ( z[1], z[2],...) has a value and both players have optmal strateges, and there are natural maps σ σ and τ τ from strateges n the game Ɣ( z[1], z[2],...) to strateges n the Markov chan game Ɣ((z t ) t ) so that f the strategy σ of Player 1 (respectvely, τ of Player 2) guarantees v n Ɣ( z[1], z[2],...)then σ (respectvely, τ ) garantees v ε (respectvely, v + ε) nɣ((z t ) t ). For the proofs that Markov games have a value and that Player 2 has an optmal strategy, the fnte sequences z[] wll all have the same length, and the game Ɣ( z[1], z[2],...) wll be a classcal repeated game wth ncomplete nformaton on one sde. For the proof of the exstence of an optmal strategy of (the nformed) Player 1, the lengths of the fnte sequences z[] wll converge to nfnty, and the proof that Player 1 has an optmal strategy n Ɣ( z[1], z[2],...) and n Ɣ((z t ) t ) wll rely also on the structure of approxmate optmal strateges of the nformed player n repeated games wth ncomplete nformaton on one sde. The fnte sequence of states z[] wll be a mnor (stochastc) alternaton of the sequence z[] =z n +1,...,z n of states of the Markov chan n stages n < t n,

4 584 A. Neyman where n < n < n +1. For notatonal convenence we defne the process ( z t ) t and set z[] = z n +1,..., z n. The stochastc process ( z t ) t s a functon of the process (z t ) t and an ndependent lottery X, wth z t beng a functon of X and z 1,...,z t. The random varable X can be vewed as a prvate lottery of Player 1 n the game Ɣ((z t ) t ). Therefore a (pure) strategy of Player 1 n the game Ɣ(( z t ) t ) defnes a (mxed) strategy of Player 1 n the Markov chan game Ɣ((z t ) t ). The sets of strateges of (the unnformed) Player 2 n both games are dentcal. In addton, the constructon of ( z t ) t wll be such that for most stages t the probablty that z t = z t s close to one. Therefore a strategy of Player 1 (respectvely, 2) that guarantees v n the game Ɣ(( z t ) t ) guarantees v ε (respectvely, v + ε) n the Markov chan game Ɣ((z t ) t ). The natural lftng of a strategy n the game Ɣ( z[1], z[2],...)toastrategynɣ(( z t ) t ) (and thus to a strategy n Ɣ((z t ) t )) s obtaned by consderng stages t n 1 and n < t n +1 redundant and playng nonrevealngly n these redundant stages. Now we turn to the detals of the constructon. Frst, we recall basc termnology and facts regardng (statonary/homogeneous) Markov chans wth a fnte state space M and transton matrx Q. For a postve nteger n,anm M matrx Q, and z, z M, we denote by Q n z,z (or Q n (z, z ))the(z, z )-th entry of the matrx Q n ;fq s a transton matrx then Q n z,z s the probablty that we move from z to z n n steps when the sngle-step transton probabltes are defned by Q. Fx a fnte transton matrx Q. Astatez M s recurrent f n Qn z,z =, equvalently, f the Markov chan that starts at z returns to z wth probablty 1. A state z communcates wthasetz f there are postve ntegers n and m such that Q n z,z > 0 and Q m z,z > 0. A set of states C s a communcatng class (or ergodc set) fevery state n C communcates wth any other state of C and no state of C communcates wth a state outsde C. Every state n a communcatng class s recurrent. The perod of a state z s the greatest common devsor of all n such that Q n z,z > 0. A state s aperodc f ts perod s 1. Obvously, f Q z,z > 0 then z s aperodc. All states n the same communcatng class have the same perod. A probablty dstrbuton k (M) s Q-nvarant f for every z M we have k(z) = z M k(z )Q z,z. Every communcatng class C has a unque nvarant dstrbuton k C (C) that s Q-nvarant, and f z C s aperodc, then for every z, z C the lmt of Q n z,z exsts and equals k C (z ).Theset{k C : C a communcatng class} depends obvously on the transton matrx Q and s denoted K (Q). Equvalently, K (Q) s the (nonempty fnte) set of the extreme pont of the (polytope of) Q-nvarant probablty dstrbuton. Next, we state and prove a smple lemma regardng Markov chans. Lemma 1 Let M be a fnte state space. There s a postve nteger m such that for every M M transton matrx Q, (1) all recurrent states of the transton matrx Q m are aperodc, and, moreover, (2) for every two states z, z n the support of a dstrbuton k K (Q m ) we have Q m z,z > 0. Proof Let R be the set of all recurrent states of the Markov chan wth state space M and transton matrx Q. For every recurrent z R there s 0 < n(z) M such that > 0. Therefore, there s n > 0 (e.g., n = M! or the least common multple of Q n(z) z,z

5 Exstence of optmal strateges n Markov games wth ncomplete nformaton 585 {n(z) : z R}) such that Q n z,z > 0 for every z R.LetC be a communcatng class of the transton matrx Q n. Note that f z C and Q mn z,z > 0forsome0< m M then Q m n z,z > 0 for every m m, and there s m M such that Q m n z,z > 0 (otherwse z s not a recurrent state of the transton matrx Q n ). Therefore, there s m (e.g., m = M ) such that Q mn z,z > 0forsomez C mples that (z C and) Q mn z,z > 0. In partcular, all reccurent states of the transton matrx Q mn are aperodc (wth respect to the transton matrx Q mn ). Let K = K (Q m ) where m s gven by Lemma 1, and let p(k) be the lmt (as l ) of the probablty that z lm+1 s n the support S(k) of the nvarant probablty k. (The lmt exsts because {z lm+1 S(k)} {z (l+1)m+1 S(k)} and therefore P(z lm+1 S(k)) s monotonc nondecreasng.) For sequences (n ) and ( n ) wth n < n n +1,setz[] =z n +1,...,z n. Fx ε>0and a unform [0, 1]-valued random varable X (equvalently, a sequence X 1, X 2,...of ndependent unform [0, 1]-valued random varable) that s ndependent of the process (z t ) t.ifn and n are multples of m, and n 1 and mn (n +1 n ) are suffcently large, we can defne a new process ( z t ) such that (1) z t s a functon of z 1,...,z t and X, (2) on z n1 +1 S(k) (k K ), ( z[] := z n +1,..., z n +lm) s a sequence of ndependent Markov chans wth ntal probablty k and transton matrx Q, and (3) the probablty that z[] =z[] (where z[] :=z n +1,...,z n +lm+1) s 1 2ε. For example, for k K and z S(k), we denote by A kz the event z n +1 = z and A k = z S(k) A zk.letā1 kz denote the ntersecton of the events A1 kz and X 1 (1 ε)p(k)k(z)/p(z n1 +1 = z) 1, and Ā 1 k = z S(k) Ā1 kz.for > 1 we denote by Ā kz the ntersecton of the events z n +1 = z, Ā 1 k, and X (1 ε)p(k)k(z)/q n n 1 z,z 1, where z n 1 +1 = z. Note that for suffcently large n 1 we have P(Ā 1 kz ) = (1 ε)p(k)k(z), and for suffcently large mn ( n n ) we have P(Ā kz Ā1 k ) = (1 ε)k(z). OnĀ kz we set z[] =z[]. It s now easy to complete the defnton of the sequence z[1], z[2],... (on those parts of the probablty space where t s not defned by the above rules) so that that the process ( z t ) obeys (1) (3). For example, f ( z t ) t s a process that s ndependent of X and (z t ) t, and where P( z n1 +1 = z) = p(k)k(z) and on z n1 +1 S(k), ( z[]) s a sequence of ndependent Markov chans wth ntal probablty k and transton matrx Q, set z[] = z[] on the complement of Ā := k K,z S(k) Ā kz. In order to prove that Ɣ has a value and that Player 2 has an optmal strategy we set n = (lm + l m) and n = n + lm, where l l are suffcently large. Then, on z n1 +1 S(k), z[1], z[2],..., are n addton dentcally dstrbuted. Therefore, the game Ɣ( z[1], z[2],...)(defned formally n Sect. 3 and denoted Ɣ(p,lm)) that follows the states z[1], z[2],... and where at the begnnng of each stage only Player 1 s nformed about the state, s a classcal repeated game wth ncomplete nformaton (where each stage s an lm-stage game n extensve form), and thus has a value v(p,lm). Each player can follow hs optmal strategy n ths auxlary repeated game n stages n < t n + lm of the Markov game (where Player 1 computes the state z t as a functon of the process and the prvate sgnal/lottery X and plays nonrevealngly n the other stages) to guarantee n the Markov game a payoff wthn O(ε + l /l) of v(p,lm). Therefore, the lmt lm l v(p,lm) exsts and

6 586 A. Neyman equals the value of the Markov game. Note that Player 2 can start followng hs auxlary repeated game strategy at any stage n + 1. Therefore Player 2 can paste hs ε-optmal strateges n the games Ɣ(p,lm) nto an optmal strategy n the Markov game. The sketched proof above provdes an alternatve proof to the results of Renault (2006) that a Markov game wth standard sgnalng has a value and that Player 2 has an optmal strategy. Patchng ε-optmal strateges of Player 1 nto an optmal strategy s more nvolved, and reles on a more detaled descrpton and propertes of approxmate optmal strateges n Ɣ(p,lm). The addtonal needed care stems from the rreversblty of the revelaton of nformaton about the process, and the fact that nformaton about the aperodc class that s revealed to Player 2 when Player 1 plays an optmal strategy n Ɣ(p,lm) depends on l. Therefore, the constructed optmal strategy of Player 1 has the followng characterstcs. Frst, the startng tme n of Player 1 usng/revealng hs nformaton about the Markov chan s a random tme, whch Player 1 can compute as a functon of past states z 1,...,z n1 +1 of the chan and the auxlary prvate lottery X and wth the property that for every z n the support of k K we have P(z n1 +1 = z) = k(z)p(k). Second, the length of the th stage of the auxlary game s l 2 m and whatever nformaton revealed eventually by the optmal strategy of Player 1 can be communcated to Player 2 before Player 1 makes hs mxed acton choce at stage n Let us recall a few basc facts about repeated games wth ncomplete nformaton. Let G q l be the l-stage repeated game where nature chooses a Markov chan z 1,...,z l wth transton matrx Q and ntal probablty P(z 1 = z) = q(k)k(z) for z n the support of k K. Equvalently, nature chooses k K wth probablty q(k) and then a Markov chan wth transton matrx Q and ntal dstrbuton k. The state z t s revealed to Player 1 just before the play at stage t.atstaget the players choose an acton t I and an acton j t J, and followng the play at stage t Player 2 observes a stochastc sgnal s 2 whose condtonal dstrbuton gven the past s a functon of the trple (z t, t, j t ). (In the case of standard sgnalng s 2 = ( t, j t ).) The payoff to Player 1 of the play z 1, 1, j 1,...,z l, l, j l s the average of the stage payoffs g(z t, t, j t ).Letu l (q) denote the maxmn of G q l where Player 1 maxmzes over hs nonseparatng strateges. It s known (Aumann and Maschler 1995) that the value of the game Ɣ(p,l)s the maxmum over all convex combnatons K =0 α()u l(p()) where p = K =0 α()p(). We select a sequence of l j so that (1) the values v(p,l 2 j m) of Ɣ(p,l2 j m) converge to v(p), and (2) the values v(p,l 2 j m) are approxmately a convex combnaton of the form K =0 α()u l 2m(p()) wth p = K j =0 α()p(). Note that α() and p() are ndependent of j and hence the need for an approxmaton. The fact that α() and p() are ndependent of j enables us to patch together approxmate optmal strateges of the games Ɣ(p,l 2 jm) and obtan an optmal strategy of Player 1 n the Markov game. Defne n 0 = 0 and for 0setn +1 = n + l 2m + l m and n = n + l 2m. The optmal strategy of Player 1 wll use the nformaton of the states only n stages

7 Exstence of optmal strateges n Markov games wth ncomplete nformaton 587 n < t n. Set z[] =z n +1,...,z n. We couple the process (z t ) t wth a process ( z t ) t so that for some postve-nteger-valued functon T, where the event T = s a functon of z 1,...,z n +1 and the couplng enablng [0, 1]-valued random varable X (whch s ndependent of the process (z t )), we have (1) z nt +1 s a recurrent state of the Markov chan wth transton Q m, (2) condtonal on z nt +1 S(k) the process z[t + ] = z nt + +1,..., z nt + s a Markov chan wth ntal probablty k and transton Q, (3) condtonal on z nt +1 S(k) the processes z[t ],..., z[t + ],... are ndependent, and (4) the probablty that z[t + ] = z[t + ] converges to 1 as. The propertes of the auxlary coupled process ( z t ) t enables us to patch the approxmate optmal strateges of Player 1 n the games G p nto an optmal strategy n the l 2 j m Markov game. 3 The auxlary repeated games Ɣ(p, l) The analyss of the game Ɣ(q 0 ) s by means of auxlary repeated games wth ncomplete nformaton on one sde, wth a fnte state space K, ntal probablty p, and stage game G k. The stage game G k,l,org k for short, s a game n extensve form. More explctly, t s an l-stage game wth ncomplete nformaton on one sde. Nature chooses r = (z 1 = z,...,z l ) M l where z M s chosen accordng to the probablty k, and z 1 = z,...,z l follow the law of the Markov chan wth transton matrx Q; before Player 1 takes hs acton at stage t l he s nformed of z t, but Player 2 s not nformed of z t. Stage actons are observable. 1 Note that G k s a fnte game wth fnte strategy sets A for Player 1 and B for Player 2. An element a A, respectvely, b B, s a sequence of functons a t, respectvely, b t,1 t l, where a t : (z 1, 1, j 1,..., t 1, j t 1, z t ) I, respectvely, b t : ( 1, j 1,..., t 1, j t 1 ) J. The trple (r, a, b) defnes a play (z 1, 1, j 1,...,z l, l, j l ). Therefore, the trple (k, a, b) defnes a probablty dstrbuton on the plays (z 1, 1, j 1,...,z l, l, j l ) where P(z 1 = z) = k(z), P(z t+1 = z z 1,...,z t, t, j t ) = Q zt,z, t = a(z 1, 1, j 1,..., t 1, j t 1, z t ), and j t = b( 1, j 1,..., t 1, j t 1 ). The payoff of the game G k equals G k (a, b) = E k a,b 1 l lt=1 g(z t, t, j t ) where the expectaton s wth respect to the probablty defned by (k, a, b). 3.1 The game Ɣ(p,l) Nature chooses k K wth probablty p(k). Player 1 s nformed of k; Player 2 s not. The play proceeds n stages. In stage n, nature chooses r = (z 1,...,z l ) M l wth probablty k(z 1 ) 1 t<l Q z t,z t+1, Player 1 chooses a A, and Player 2 chooses b B. The payoff to Player 1 s G k (a, b). The sgnal s 2 to Player 2 s the functon s 2 that assgns to the trple (r, a, b) the sequence of realzed stage actons 1, j 1,..., l, j l. The sgnal s 1 to Player 1 s the functon s 1 that assgns to the trple (r, a, b) the play (z 1, 1, j 1,...,z l, l, j l ). 1 The case of mperfect montorng where each player observes a sgnal that depends stochastcally on the current state and actons s covered n Sect. 6.

8 588 A. Neyman The value of Ɣ(p,l)exsts by Aumann and Maschler (1995, Theorem C, p. 191), and s denoted by v(p,l). Set v(p) := lm sup l v(p,lm) and v(p) := lm nf l v(p,lm). Obvously v(p) v(p). We wll show n Lemma 4 (Sect. 5) that (n the Markov chan game Ɣ) Player 1 can guarantee v(p) and Player 2 can guarantee v(p). Thus v(p) = v(p) s the value of Ɣ (Corollary 2). Lemma 3, respectvely Lemma 4, demonstrates the exstence of an optmal strategy of Player 2, respectvely, Player 1. 4 Auxlary coupled processes Let m, K = K (Q m ), and p (K ), as defned n Sect. 2. Recall that the support of a probablty dstrbuton k (M) s denoted S(k). An admssble par of sequences s a par of ncreasng sequences, (n ) 1 and ( n ) 1, wth n < n < n +1 and such that n and n are multples of m. Fora gven admssble par of sequences and a stochastc process (x t ) we use the notaton x[] =(x n +1,...,x n ). 4.1 A couplng result Let (n ) 1 and ( n ) 1 be an admssble par of sequences wth (n n 1 ) >1 nondecreasng and wth n 1 suffcently large so that for every k K and z S(k) we have P(z n1 +1 = z) p(k)k(z)/2 (and thus P(z n1 +1 S(k)) p(k)/2). Let X, X 1, Y 1, X 2, Y 2...be a sequence of d random varables that are unformly dstrbuted on [0, 1] and so that the process (z t ) t (that follows the Markov chan wth ntal dstrbuton q 0 and transton matrx Q) and the random varable (X, X 1, Y 1,...)are ndependent. Let F denote the σ -algebra of events generated by X 1,...,X and z 1,...,z n +1. For k K and z S(k) the event z n +1 = z s denoted A kz.leta k be the event that z n +1 S(k),.e., A k = z S(k) A kz, and A = k K A k.asp(a kz ) p(k)k(z) and P(A 1 kz )>p(k)k(z)/2 by assumpton, there exsts a strctly decreasng sequence ε j 0 such that P(A kz ) (1 ε )p(k)k(z) for every k K and 2ε 1 < 1. Moreover, as each k K s nvarant under Q m, we can choose such a sequence for P(A any ε 1 > 1 nf 1 kz ) k K, z S(k) p(k)k(z) and thus we can assume that ε 1 = ε 1 (n 1 ) n1 0. A postve-nteger-valued random varable T such that for every 1 the event {T = } s F -measurable s called an (F ) -adapted stoppng tme. We wll defne an (F ) -adapted stoppng tme T wth T 1 such that condtonal on {T = } the process z[t + j] ( j 0) s wth probablty p(k) a Markov chan wth ntal probablty k and transton Q. Because the dstrbuton k s nvarant under Q m and n +1 n s a multple of m, t suffces to guarantee that for every k K and every z S(k) the probablty that z n +1 = z, condtonal on T =, equals p(k)k(z). In addton, our constructon s such that T s fnte wth probablty 1 (equvalently, P(T ) 1). In fact, by requrng n addton that P(T ) = 1 2ε the stoppng tme T s defned as follows.

9 Exstence of optmal strateges n Markov games wth ncomplete nformaton 589 Defne the (F ) -adapted stoppng tme T wth T 1 by defnng the event {T = } recursvely: 2 1 on z n1 +1 = z S(k) and X 1 (1 2ε 1)p(k)k(z) P(A T = 1 kz ) f T > 1, z n +1 = z S(k) and X (2ε 1 2ε )p(k)k(z) P(A kz ) (1 2ε 1)p(k)k(z). Lemma 2 () k K and z S(k), Pr(z nt +1 = z T ) = p(k)k(z) (and thus Pr(z nt +1 S(k) T ) = p(k)); () Condtonal on z nt +1 S(k), for every fxed 0 the process z[t + ] s a Markov chan wth ntal probablty k and transton Q; () Pr(T ) = 1 2ε. Proof For k K and z S(k) let Bkz denote the event that T and z n +1 = z S(k), and Bk := z S(k) Bkz. It follows that P(B1 kz ) = P(A1 kz )(1 2ε 1)p(k)k(z)/ P(A 1 kz ) = (1 2ε 1)p(k)k(z) and thus P(Bk 1) = z S(k) (1 2ε 1)p(k)k(z) = (1 2ε 1 )p(k) and P(T = 1) = k K (1 2ε 1)p(k) = 1 2ε 1. By nducton on t follows that P(Bkz ) = (1 2ε )p(k)k(z) and P(T ) = 1 2ε ; ndeed, as the dstrbuton k s nvarant under Q m we have P(A kz B 1 k ) = P(Bk 1 )k(z) = (1 2ε 1 )p(k)k(z), and thus for > 1wehaveP(Bkz ) = P(B 1 k )k(z) + P(A kz \ Bk 1 (2ε ) 1 2ε )p(k)k(z) P(A kz ) (1 2ε 1)p(k)k(z).AsP(A kz \ B 1 k ) = P(A kz ) (1 2ε 1)p(k)k(z) we deduce that P(Bkz ) = (1 2ε )p(k)k(z). In partcular, P(z n +1 = z S(k) T = ) = p(k)k(z). Set B = k K Bk and note that P(B ) = 1 2ε. Ths completes the proof of () and (). Obvously, z[t +] s a Markov chan wth transton Q.Ask s nvarant under Q m we deduce that for every 0wehavePr(z nt + +1 = z S(k) z nt +1 S(k)) = k(z), whch proves (). The next lemma couples the process (z t ) t wth a process (zt ) t where the states zt are elements of M = M { }wth / M.Gven 1 we denote by [] the sequence of s of length n n.let0<δ<1besuch that for every k K, and y, z S(k),we have Q m (y, z) (1 δ)k(z).ask s Q m -nvarant, t follows by nducton on j 1 that Q jm (y, z) (1 δ j )k(z). Indeed, Q jm (y, z) = z Q( j 1)m (y, z )Q m (z, z) = z ((1 δ j 1 )k(z )+Q ( j 1)m (y, z ) (1 δ j 1 )k(z ))Q m (z, z) (1 δ j 1 )k(z)+ δ j 1 (1 δ)k(z) = (1 δ j )k(z). Let l = (n n 1 )/m, and let B be the event Y (1 δ l )k(z)/q l m (y, z) where y = z n 1 +1 S(k) and z = z n +1 S(k). Lemma 3 There exsts a stochastc process (zt ) t wth values zt M such that for n < t n the (auxlary) state zt s a (determnstc) functon of z 1,...,z t and X 1, Y 1,...,X, Y such that () n 1 < t n and t n T,zt = () Everywhere, ether z [] =z[] or z [] = [] 2 Note that the event {T } s the complement of the event {T < }.

10 590 A. Neyman () z [T ]=z[t] and thus Pr(zn T +1 = z T ) = p(k)k(z) (v) Pr(z [T + ] =z[t + ] T ) = 1 δ l T + 1 δ l (v) For 1, condtonal on T, z [T ],...,z [T + 1], the process z [T + ] on B T + (and thus wth probablty = 1 δ l T + ) s a Markov chan wth ntal probablty k and transton Q, and on the complement of B T + (and thus wth condtonal probablty = δ l T + ) t s [T + ]. Proof n 1 < t n and t n T,setzt = ; n partcular, z[] = [] for < T. Defne z [T ]=z[t] and thus () holds, and for > T set z [] =z[] on B and z [] = [] on the complement B c of B. It follows that everywhere, ether z [] =z[] or z [] = [] and thus () holds. For 1 the condtonal probablty that z nt + +1 = z gven T and z nt = y S(k) equals Q l j m (y, z)(1 δ l j )k(z)/q l j m (y, z) = (1 δ l j )k(z), where j = T +. Note that ths condtonal probablty s ndependent of y. Therefore, the condtonal probablty that z [T + ] =z[t + ] gven T and z nt +1 S(k) equals 1 δ l j ( 1 δ l ), whch proves (v) and (v). Corollary 1 There exsts a stochastc process ( z t ) t wth values z t M such that for n < t n the (auxlary) state z t s a (determnstc) functon of z 1,...,z t and X 1, Y 1,...,X, Y such that 1.1 The probablty that z nt +1 = z equals p(k)k(z) for z S(k) 1.2 For 1, condtonal on T, z[t ],..., z[t + 1], the process z[t + ] s a Markov chan wth ntal probablty k and transton Q 1.3 Pr( z[t + ] =z[t + ]) 1 δ l T + 1 δ l Proof Let k and z[k, ], k K and 1, be ndependent random varables such that Pr(k = k) = p(k) and each random varable z[k, ] s a Markov chan of length n n wth ntal dstrbuton k and transton matrx Q. W.l.o.g. we assume that k and z[k, ], k K and 1, are determnstc functons of X. Set z t = z t for t n T and for n < t n +1. Defne z[t + ] =z[t + ] on z [T + ] =z[t + ], and z[t + ] = z[k, T + ] on z [T + ] = [T + ] and z nt +1 S(k). 5 Exstence of the value and optmal strateges n Ɣ(q 0 ) Assume wthout loss of generalty that all payoffs of the stage games g(z,, j) are n [0, 1]. Lemma 4 Player 1 can guarantee v(p) and Player 2 can guarantee v(p). Proof Note that for l<l we have v(p,l ) v(p,l)l/l and therefore v(p) = lm sup l v(p,l 2 m). Smlarly, v(p) = lm nf l v(p,l 2 m). Fxε>0. Let l be suffcently large wth v(p,l 2 m)> v(p) ε, respectvely v(p,l 2 m)<v(p) + ε, 1/l < ε, and so that δ lm <εand Pr(z lm+1 = z) (1 ε)p(k)k(z) for every k K and z S(k).

11 Exstence of optmal strateges n Markov games wth ncomplete nformaton 591 Set n 0 = 0, and for 1, n = (l+l 2 )m + lm and n = n 1 +lm + lm where l s 3 a nonnegatve nteger. Let (zt ) t be the auxlary stochastc process obeyng 1.1, 1.2, and 1.3 of Corollary 1. Defne gt = g(zt, t, j t ) (and recall that g t = g(z t, t, j t )). Let σ be a 1 l -optmal (and thus an ε-optmal) strategy of Player 1 n Ɣ(p,l2 m) and let σ be the strategy n (the Markov game) Ɣ defned as follows. Set h[, t] = zn +1, n +1, j n +1,...,zn +t, n +t, j n +t, and h[] =h[,l 2 m]. In stages n < t n +1 ( 0) and n all stages on T > 1, the strategy σ plays a fxed acton I. On T = 1, n stage n + t wth 1 t l 2 m the strategy σ plays the mxed acton σ(h[1],...,h[ 1], h[, t 1], zn +t ) (where h[, 0] stands for the empty strng). The defnton of σ, together wth the ε-optmalty of σ and the propertes of the stochastc process z [1], z [2],..., mples that for all suffcently large > 1 and every strategy τ of Player 2 we have E σ,τ j=1 n j <t n j g t l 2 m( v(p) 2ε Pr(T > 1)) On z [ j] =z[ j],wehave n j <t n j g t = n j <t n j g t. Therefore, E σ,τ j=1 n j <t n j g t l 2 m( v(p) 4ε) and therefore, as the densty of the set of stages {t : n 1 < t n } s l/(l+l 2 )<ε, we deduce that σ guarantees v(p) 5ε and therefore Player 1 can guarantee v(p). Respectvely, f τ s an ε-optmal strategy of Player 2 n the game Ɣ(p,l 2 m), we defne the strategy τ (= τ [l, τ, l]) of Player 2 n Ɣ(q 0 ) as follows. Set h 2 [, t] = n +1, j n +1,..., n +t, j n +t, and h 2 [] =h 2 [,l 2 m]. In stages t n 1 and n stages n + t wth 1 t lm the strategy τ plays a fxed acton j J. In stage n + t wth 1 and 1 t l 2 m the strategy τ plays the acton τ(h 2 [1],...,h 2 [ 1], h 2 [, t 1]) (where h 2 [n, 0] stands for the empty strng). The defnton of τ, together wth the ε-optmalty of τ and the propertes of the stochastc process z [1], z [2],...and z[1], z[2],..., mples that τ guarantees v(p) + 5ε and therefore Player 2 can guarantee v(p). 4 Corollary 2 The game Ɣ(q 0 ) has a value v(ɣ(q 0 )) = v(p) = v(p). Lemma 5 Player 2 has an optmal strategy. Proof Recall that the 5ε-optmal strategy τ appearng n the proof of Lemma 4 depends on the postve nteger l, the strategy τ of Player 2 n Ɣ(p,l 2 m), and the auxlary nonnegatve nteger l. 3 The dependence on l enables us to combne the constructed ε-optmal strateges of Player 2 nto an optmal strategy of Player 2. 4 An alternatve constructon of a strategy σ of Player 1 that guarantees v(p) ε s provded later n ths secton, and an alternatve constructon of a strategy τ that guarantees v(p) + ε s gven n Sect. 6.

12 592 A. Neyman Fx a sequence l j wth v(p,l 2 j m)<v(p) + 1/j and strateges τ j of Player 2 that are 1/j-optmal n Ɣ(p,l 2 j m).letd j j be a sequence of postve ntegers such that for every strategy σ j of Player 1 n Ɣ(p,l 2 j m) and every d d j we have d Eσ p j,τ j s=1 G k (a(s), b(s)) dv(p,l 2 j m) + d/j. Let N 0 = 0, N j N j 1 = d j (l 2 j + l j )m where d j > d j s an nteger, and ( j 1)d j l 2 j m N j 1. E.g., choose ntegers d j d j + jd j+1 ml 2 j+1 /l2 j and let N 0 = 0 and N j = N j 1 + d j (l 2 j + l j)m. By settng n j j 0 = 0, n = N j 1 + (l j + l 2 j ) for 1, n j 1 = N j 1 + l j m, and n j = n j l 2 j m, we construct strateges τ [ j] =τ [l j,τ j, l j = N j 1 + l j m] such that f τ s the strategy of Player 2 that follows τ [ j] n stages N j 1 < t N j we have for every postve nteger T wth N j 1 + d j (l 2 j + l j)m < T N j, E σ,τ T t=n j 1 +1 g t (T N j 1 )(v(p) + 2/j) and therefore for every every postve nteger T wth N j 1 < T N j we have E σ,τ T g t T v(p) + (N N 1 )2/) + (T N j 1 )2/j + d j (l 2 j + l j). < j t=1 For every ε>0there s j 0 such that for j j 0 we have ε, 2/j <ε, and 1 N j 1 d j (l 2 j + l j)<ε. Thus for T > N j0 we have 1 N j 1 < j (N N 1 )2/) < E σ,τ 1 T T g t v(p) + 3ε t=1 and therefore τ s an optmal strategy of Player 2. Lemma 6 Player 1 has an optmal strategy. Proof By Aumann and Maschler (1995), for everyl there exsts p(0,l),...,p( K,l) (K ) and a probablty vector α(0,l),...,α( K,l)(.e., α(,l) 0 and K =0 α(,l) = 1) such that K =0 α(,l)p(,l) = p and v(p,l2 m) = K =0 α(,l)u l (p(,l))where u l (q) s the max mn of G q l := Ɣ 1(q,l 2 m) where Player 1 s maxmzng over all nonseparatng strateges n G q l, and Player 2 mnmzes over all strateges. Let l j such that lm j v(p,l 2 j m) = lm sup l v(p,l2 m), and the lmts lm j α(,l j ), lm j p(,l j ) and lm j u l j (p(,l j )) exst and equal α(), p() and u() respectvely. Then

13 Exstence of optmal strateges n Markov games wth ncomplete nformaton 593 lm sup l K v(p,l 2 m) = α()u(). Let p(,l j )[k] =p(,l j )[k]/ k S(p()) p(,l j)[k] f k S(p()), and p(,l j )[k] = 0fk S(p()). Note that p(,l j ) j p(). By the defnton of a nonseparatng strategy t follows that a nonseparatng strategy n Ɣ 1 (q,l) s a nonseparatng strategy n Ɣ 1 (q,l) whenever the support of q s a subset of the support of q. Therefore, u() lm nf j u l j ( p(,l j )) = lm nf j u l j (p()). Letθ 0 wth u l j (p()) > u() θ. By possbly replacng the sequence l j by another sequence where the jth element of the orgnal sequence, l j, repeats tself L j (e.g., l 2 j+1 ) tmes, we may assume n addton that l 2 j+1 / j l2 j 0. Let σ j be a nonseparatng optmal strategy of Player 1 n the game Ɣ 1 (p(), l 2 j m). Set n j = r j (l2 r + l r )m and n j = n j l 2 j m. We couple the process (z t ) t wth a process (zt ) t that satsfes condtons () (v) of Lemma 3. Player 1 can construct such a process (zt ) t as zt s a functon of the random varables X, X 1, Y 1,...and z 1,...,z t. Defne the strategy σ of Player 1 as follows. Let β(k, ) := p()[k]α()/p(k) for k K wth p(k) > 0. Note that β(k, ) = 1 for every k, and α() = k p(k)β(k, ). Condtonal on z N T +1 S(k), choose wth probablty β(k, ) and n stages n j < t n j wth j T and zn j +1 = z n j +1 (equvalently, z [ j] =z[ j]) play accordng to σ j usng the states of the process z[ j] (= z [ j]),.e., by settng h[ j, t] =z n j +1, n j +1, j n j +1,..., n j +t 1, j n j +t 1, z n j +t, =0 σ(z 1,...,z n j +t) = σ j (h[ j, t]). In all other cases, σ plays a fxed 5 acton,.e., n stages t n T and n stages n j 1 < t n j as well as n stages n j < t n j wth z [ j] = [j] σ plays a fxed 6 acton. The condtonal probablty that z [ j] =z[ j], gvent j, s1 δ l2 j 1. For smplcty of the notatons below, set δ j = δ l2 j 1. It follows from the defnton of σ that for every strategy τ of Player 2 and every j we have on T j E σ,τ l 2 j m g n j +t T l 2 j m t=1 l 2 j m α()u l j (p()) l 2 j mδ j α()u() l 2 j m(θ j + δ j ). 5 In the model wth sgnals ths s replaced by the mxed acton x z t. 6 Same comment as n footnote 5.

14 594 A. Neyman As P(T > j) = 2ε j,wehave l 2 j m E σ,τ t=1 and thus for n j < n n j+1 we have E σ,τ t=1 g n j +t l 2 j m v(p) l2 j m(θ j + 2ε j + δ j ) n g t n v(p) l 2 s m(θ s + 2ε s 1 + δ s + 1/l s ) (n n j ). s j As (θ s +ε s 1 +δ s +1/l s ) s 0 we deduce that s j l2 s m(θ s+ε s +δ s )/ n j j 0. In addton, ( n j+1 n j )/ n j j 0. Thus for every ε>0 there s N suffcently large such that for every n N and for every strategy τ of Player 2, we have E σ,τ 1 n n g t v(p) ε. t=1 6 Markov chan games wth ncomplete nformaton on one sde and sgnals The game model Ɣ wth sgnals s descrbed by the 7-tuple M, Q, q 0, I, J, g, R where M, Q, q 0, I, J, g s as n the model wthout sgnals and observable actons and R = (R, z j ) z,, j descrbes the dstrbuton of sgnals as follows. For every (z,, j) M I J, R, z j s a probablty dstrbuton over S 1 S 2. Followng the play z t, t, j t at stage t, a sgnal s t = (st 1, s2 t ) S 1 S 2 s chosen by nature wth condtonal probablty, gven the past z 1, 1, j 1, s 1,...,z t, t, j t, that equals R z t t, j t, and followng the play at stage t Player 1 observes st 1 and z t+1 and Player 2 observes st 2. Assume that for every z M Player 1 has a mxed acton xz (I ) such that for every j J the dstrbuton of the sgnal s 2 s ndependent of z;.e., for every j J the margnals on S 2 of x z ()Rz, j are constant as a functon of z. Defne m, K = K (Q m ), p (K ), and the games Ɣ(p,l)as n the basc model but wth the natural addton of the sgnals. Let v(p,l) be the value of Ɣ(p,l). Set v = lm sup l v(p,lm) and v = lm nf l v(p,lm). Let A and B denote the pure strateges of Player 1 and Player 2 respectvely n Ɣ 1 (p,lm). A pure strategy a A of Player 1 n Ɣ 1 (p,lm) s a sequence of functons (a t ) 1 t lm where a t : (M S 1 ) t 1 M I. A pure strategy b B of Player 2 n Ɣ 1 (p,lm) s a sequence of functons (b t ) 1 t lm where b t : (S 2 ) t 1 J. A trple

15 Exstence of optmal strateges n Markov games wth ncomplete nformaton 595 (x, k, b) (A) K B nduces a probablty dstrbuton, denoted s 2 (x, k, b),on the sgnal n S2 lm to Player 2 n Ɣ 1 (p,lm). For every q (K ) we defne NS(q) as the set of nonseparatng strateges of Player 1 n Ɣ 1 (p,lm),.e., x NS(q) ff for every b B the dstrbuton s 2 (x, k, b) s ndependent across all k wth q(k) >0. Theorem 2 The game Ɣ has a value and both players have optmal strateges. The lmt of v(p,lm) as l exsts and equals the value of Ɣ. Proof The proof that Player 1 has a strategy σ that guarantees v ε for every ε>0 s dentcal to the proof (n the basc model) that Player 1 has an optmal strategy. Next, we prove that Player 2 can guarantee v. Letγ n,orε for short, 7 be a postve number wth 0 <ε<1/2, and let l n,orl for short, be a suffcently large postve nteger such that (1) for every k K and z, z S(k) we have Q lm z,z >(1 ε)k(z ), (2) v(p,lm) <v+ ε, and (3) for every k K and z S(k) Pr(z lm+1 = z) (1 ε)p(k)k(z). Let τ be an optmal strategy of Player 2 n Ɣ(p,lm). Fx a postve nteger j n and construct the followng strategy τ [n], orτ for short, of Player 2 n Ɣ. Set N = (+1) 2 lm and n j = N + ( j 1)lm and n j = n j + jlm.letb j be the block of lm consecutve stages n j < t n j. For every j j n consder the sequence of blocks B j j, B j j+1,..., as stages of the repeated game Ɣ(p,lm) and play n these blocks accordng to the strategy τ;formally,fŝ j s the sequence of sgnals to Player 2 n stages n j < t n j, then play n stages n j < t n j the stage strategy τ(ŝ j j,...,ŝ j 1 ). (In stages t j B j τ plays a fxed acton.) Note that for every j, n +1, j n j = lm, and therefore there s an event C j wth probablty 1 ε ε j 1 ε > 1 3ε such that on C j, the stochastc process z[ j, j], z[ j+1, j],...,z[, j],..., where z[, j] :=z nj +1,...,z nj ( j), s a mxture of d stochastc processes of length lm: wth probablty p(k) the dstrbuton z[, j] s the dstrbuton of a Markov chan of length lm wth ntal dstrbuton k(z) and transton matrx Q. It follows that τ (= τ [n]) guarantees v + 2ε + 3ε + ε. Indeed, the defnton of τ mples that for every suffcently large j we have E σ,τ g t C j ( j + 1)lm(v + 2ε) = j t B j and therefore E σ,τ = j g t ( j + 1)lm(v + 2ε + 3ε) t B j 7 The dependence on n enables us to combne the ε-optmal strateges nto an optmal strategy.

16 596 A. Neyman Thus, f (T ) s the mnmal such that N T, then for a suffcently large postve nteger T we have E σ,τ (T ) = j g t ((T ) j + 1)lm(v + 2ε + 3ε) t B j and therefore E Tt=1 (T ) (T ) σ,τ g t s E σ,τ j= j n = j t B j g t, whch s less than or (T )((T )+1) (T )((T )+1) equal to 2 lm(v+2ε+3ε)+ j n (T )lm.as(t) = o(t ) and 2 lm T < (T )lm, the strategy τ guarantees v + 6ε. Choose a sequence 0 <γ n 0 and a correspondng sequence l n. By properly choosng an ncreasng sequence T n (T 0 = 0) and a sequence j n wth j n( j n +1) 2 l n m + ( j n 1)l n m T n 1 and playng n stages T n 1 < t T n the strategy τ [n] we construct an optmal strategy of Player 2. Remarks 1. The value s ndependent of the sgnalng to Player The exstence 8 of a nonrevealng mxed acton xz enables Player 1 to play nonrevealngly n the prefaced play, before the process enters nto a communcatng class, as well as n the remxng stages t. 3. If the model s modfed so that the state process s a mxture of Markov chans (namely, nature chooses a par (z, Q) accordng to a commonly known probablty α on the product of the set of states M and the set of transton matrces, and condtonal on the choce of (z, Q) the stochastc process of states obeys z 1 = z and P(z t = z z 1,...,z t 1 ) = Q zt 1,z for t > 1) the results about the exstence of a value and optmal strateges for the unnformed Player 2 hold. However, snce Player 1 s not nformed of the choce of Q, the nformed Player 1 need not have an optmal strategy. References Aumann RJ, Maschler M (1995) Repeated games wth ncomplete nformaton. MIT Press, Cambrdge, MA, USA Kemeny JG, Snell JL (1976) Fnte Markov chans. Sprnger, New York Renault J (2006) The value of Markov chan games wth lack of nformaton on one sde. Math Oper Res 31: Ths assumpton s not needed n the classcal results about repeated games wth ncomplete nformaton on one sde and sgnals.

Existence of Optimal Strategies in Markov Games with Incomplete Information

Existence of Optimal Strategies in Markov Games with Incomplete Information Abraham Neyman 1 December 29, 2005 1 Institute of Mathematics and Center for the Study of Rationality, Hebrew University, 91904