Towards a Belief-Based Theory of Repeated Games with Private Monitoring: An Application of POMDP

Size: px

Start display at page:

Download "Towards a Belief-Based Theory of Repeated Games with Private Monitoring: An Application of POMDP"

Silvester Ryan
5 years ago
Views:

1 Towards a Belef-Based Theory of Repeated Games wth Prvate Montorng: An Applcaton of POMDP KANDORI, Mchhro Faculty of Economcs Unversty of Tokyo OBARA, Ichro Department of Economcs UCLA Ths Verson: June 3, 2010 Abstract An equlbrum n a repeated game wth mperfect prvate motorng s called a fnte state equlbrum, feachplayer sactonon the equlbrum path s gven by an automaton wth a fnte number of states. We provde a tractable general method to check the equlbrum condtons n ths class. Our method s based on the belef-based approach and employs the theory of POMDP (Partally Observable Markov Decson Processes). Ths encompasses the majorty of exstng works. 1 Introducton Repeated games wth mperfect prvate montorng represent long-term relatonshps, where each player receves a nosy prvate sgnal about others actons. Although ths class of games has a wde range of applcatons, the characterzaton of all equlbra has yet to be obtaned. Ths s n sharp contrast to the theory of repeated games wth perfect or mperfect publc montorng, where complete characterzatons of all equlbra have been obtaned. The present paper provdes valuable general methods to verfy equlbrum condtons n repeated games wth prvate montorng. In partcular, we focus on a fnte state equlbrum, where each player s acton on the equlbrum path s gven by an automaton wth a fnte state space. We provde a complete characterzaton of ths class of equlbra and provde a tractable computatonal method to determne f a gven profle of fnte automata (one for Prevous versons of ths paper has been crculated under the ttle "Fnte State Equlbra". e-mal: kandor@e.u-tokyo.ac.jp e-mal: obara@econ.ucla.edu 1

2 each player to determne hs acton on the path of play) can consttute a (fnte) equlbrum. The present paper provdes a unfyng general theory to encompass the majorty of the exstng works, because most of them are based on some form of fnte state equlbra. The belef-based approach by Sekguch (1997), Bhaskar and Obara (2002) consder trgger strateges on the path of play, hence fnte state equlbra. The present paper can be regarded as a generalzaton of those papers. The beleffree approach by Ely and Välmäk (2002) consders an equlbrum whch can be mplemented by a fnte state automata on and off the path of play. Proposton 4 n Ely, Horner and Olszewsk (2005) (bang-bang property) shows that most of belef-free equlbrum payoffs (ndeed all of them f the dscount factor s close to 1) can be obtaned by a fnte state equlbrum wth only two states. Matsushma s revew strategy equlbra (2004) and Hörner and Olszewsk (2006) also employ fnte equlbra. An excepton whch does not employ a fnte equlbrum s Pccone (2002), whose equlbrum path requres nfnte (countably many) states. However, the result by Ely, Horner and Olszewsk shows that Pccone s equlbrum payoff can be obtaned by a fnte state equlbrum. A more recent paper by Phelan and Skrzypacz (2009) also proposes an algorthm to compute a class of statonary fnte state equlbra. Ther approach focuses on dynamcs of belefs, whle we utlze the theory of POMDP (Partally Observable Markov Decson Processes) to solve the belef-based dynamc programmng problem. 2 Repeated Games wth Prvate Montorng We use repeated games wth prvate montorng as our base model. Let A be the (fnte) set of actons for player =1,...,N and A := A 1 A N. Wthn each perod, player observes her own acton a and prvate sgnal ω Ω. We denote ω =(ω 1,...,ω N ) Ω := Ω 1 Ω N and let q(ω a) be the probablty of prvate sgnal profle ω gven acton profle a (we assume that Ω s a fnte set). We denote the margnal dstrbuton of ω by q (ω a). It s also assumed that no player can nfer whch actons were taken (or not taken) for sure; to ths end, we assume that each ω Ω occurs wth a postve probablty for any a A ( full support assumpton). Player s realzed payoff s determned by her own acton and sgnal, and denoted π (a, ω ). Hence her expected payoff s gven by g (a) = X ω Ω π (a, ω )q(ω a). Ths formulaton ensures that the realzed payoff π conveys no more nformaton than a and ω do. Amxedactonforplayer s denoted by α (A ),where (A ) s the set of probablty dstrbutons over A. Wth an abuse of notaton, we denote the expected payoff and the sgnal dstrbuton under a mxed acton profle α =(α 1,...,α N ) by g (α) and q(ω α) respectvely. The stage game s to be played 2

3 repeatedly over an nfnte tme horzon t =1, 2,... Player s dscounted payoff from a sequence of acton profles a(t) A, t =0, 1, 2... s gven by P t=1 δt g (a(t)), where δ (0, 1) s the dscount factor. 3 Repeated Game Strateges and Path Automata We now explore several ways to represent repeated game strateges. We start wth the conventonal representaton of strateges n the repeated game defned above. 3.1 Repeated Game Strateges A prvate hstory for player at the begnnng of tme t s the record of player s past actons and sgnals, h t =(a (0), ω (0),...,a (t 1), ω (t 1)) H t := (A Ω ) t. To determne the ntal acton of each player, we ntroduce a dummy ntal hstory (or null hstory) h 0 and let H0 be a sngleton set {h0 }. A pure strategy s for player s a functon specfyng an acton after any hstory: formally, s : H A,where H = t 0 H t. Smlarly, a (behavorally) mxed strategy for player s denoted by σ : H (A ). 3.2 Path Automata and Fnte State Equlbra A path automaton M (Θ, b θ,f,t ) of player specfes the path of play (but not the behavor off the path of play) for player, by specfyng the followng: 1. a set of states Θ 2. the ntal state b θ Θ 3. (pure) acton choce for each state, f : Θ A (wthout loss of generalty, we can assume that a pure acton s played n each state 1.) 4. (possbly stochastc) state transton T : Θ Ω (Θ ). Specfcally, T (θ (t +1) θ (t), ω (t)) s the probablty of the next state beng θ (t +1) gven the current state θ (t), actona (t) =f (θ (t)), and prvate sgnal ω (t). A path automaton wthout the specfcaton of the ntal state, denoted by m (Θ,f,T ),sreferredtoasapath preautomaton. Ths concept turns out to be useful n our analyss. For any path preautomaton m,wedenotethe correspondng path automaton wth ntal state θ Θ by (m, θ ). 1 Stochastc transton functon can represent mxed acton. For example, suppose that acton C and D are played wth an equal probablty at state θ. Then, we can splt ths state θ nto two states θ C and θ D and assume that n state θ a, pure acton a s played (a = C, D). Furthermore we can specfy the stochastc state transson functon such that, n the event that θ s to be the next stae, state θ C or θ D realzes wth an equal probablty. 3

4 An mportant part of our defnton above s that the transton functon T presumes that the equlbrum acton a (t) =f (θ (t)) s played. Hence, our path automaton does not specfy the behavor after a dsequlbrum acton a (t) 6= f (θ (t)) (and therefore a path automaton only represents a part of a repeated game strategy). To represent a repeated game strategy, one need to extend the transton functon to specfy the state transton after a devatng acton a (t) 6= f (θ (t)). To ths end, let us defne an extended transton functon T : Θ Ω A (Θ ),where T (θ (t+1) θ (t), ω (t),a (t)) stheprobabltyofthenextstatebengθ (t+1) gven the current state θ (t), an arbtrary current acton a (t), and the current prvate sgnal ω (t). An extended automaton s denoted by M (Θ, b θ,f, T ). An extended automaton specfes an acton after any prvate hstory, so t nduces a full repeated game strategy. A message of the present paper s that, somewhat surprsngly, path automata often turn out to be more useful than extended automata n the belefbased analyss of repeated games wth prvate montorng. A fnte path automaton s a path automaton wth a fnte number of states. A fnte path preautomaton and a fnte extended automaton are defned smlarly. We are nterested n equlbra where each player s behavor on the equlbrum can be descrbed by a fnte path automaton. We call a profle of fnte path preautomata m =(m 1,...,m N ) compatble f, for every, there exsts some state θ Θ and some belef b (Θ ) such that (m, θ ) s the optmal plan gven hs subjectve belef b. Compatblty s necessary for any profle of fnte path preautomata to be played on the equlbrum path. Note that compatblty does not guarantee that a set of such state-belef pars across players are consstent: t may not be generated by some jont dstrbuton on Θ. Suppose that there s a common pror r (Θ) = (Θ 1 Θ N ) such that (m, θ ) s optmal aganst r( θ ) (Θ ) for every θ Θ n the support of r and =1,..,N. Snce we assume full support (of the margnal dstrbuton on Ω ), such profle of fnte path preautomata and a jont dstrbuton r on Θ consttutes a (correlated) sequental equlbrum once optmal strateges are assgned after every off the equlbrum hstory. 2 A fnte state equlbrum s such a correlated sequental equlbrum where on-path behavor can be represented by fnte state preautomata and a jont dstrbuton on the product state space. Defnton 1 A fnte state equlbrum s a (correlated) sequental equlbrum of a repeated game wth prvate montorng, where players behavor on the equlbrum path s gven by fnte path preautomata m (Θ,f,T ), =1,...,N and a jont probablty dstrbuton of the ntal states r (Θ). 2 Snce we assume full support, (1) there s no ambguty regardng the defnton of sequental equlbrum although our game s an nfnte horzon game and (2) every correlated sequental equlbrum s a correlated equlbrum.e. sequental ratonalty s not a restrcton. 4

5 The probablty dstrbuton r s the ntal correlaton devce. At the begnnng of t =0,aprofle of recommended ntal states b θ realzes wth probablty r( b θ), and player observes hs recommended ntal state b θ. In a fnte state equlbrum, each player fnds t optmal to follow path automaton (m, b θ ), gven that others obey ther recommended ntal states. In other words, (m, r) consttutes a correlated equlbrum. A standard argument shows that (m, r) can always be "extended" to obtan a sequental equlbrum: Remark 1 The path preautomata m 1,...,m N do not specfy how each player should behave after hs own devatons, but we can always fnd, for each player, arepeated game strategy σ (m ) such that () t specfes the same behavor as m on the equlbrum path and () t specfes an optmal contnuaton strategy for each nformaton set of player. 3 We wll refer both (σ 1 (m 1 ),...,σ N (m N ),r) and (m, r) as a fnte state equlbrum, f no confuson ensues. Note that, n a fnte state equlbrum, a player s behavor off the equlbrum path need not be descrbed by a fnte automata. In fact, there s an equlbrum dentfed n the exstng lterature whch has the on-path behavor descrbed by fnte path automata, whle specfyng ts off-path behavor requres an extended automaton wth nfntely many states: Example 1 The followng example s from Bhaskar and Obara (2002). Consder a standard repeated prsoner s dlemma game wth almost perfect and ndependent prvate montorng: A = {C, D}, Ω = {c, d}, Pr (ω = c a,c)=pr(ω = d a,d)= 1 ε for any a A. Suppose that player 2 s play s determned by the followng trgger strategy" preautomaton m, Θ = {R, P }, f(r) = C, f (P )= D, T (R R, c) = 1,T (P R, d) =T (P P, ω 2 )=1for any ω 2 Ω 2. Let b 1 be player 1 s belef that player 2 s at state R. For a certan range of dscount factor and small enough ε > 0, t can be shown that the optmal strategy aganst ths preautomaton can be represented as a smple cut-off strategy ( belef-based strategy") n terms of belef: play C f b 1 >b and play D f b 1 <b for some cut-off belef b (0, 1). It can be shown that ths optmal strategy can be descrbed by a fnte state automaton gven any belef. In fact, t s exactly gven by the above preautomaton tself: (m, P ) s the optmal strategy for b 1 [0,b ] and (m, R) s the optmal strategy for b 1 [b, b] 3 The strategy σ (m ) s obtaned by specfyng an optmal contnuaton strategy for each nformaton set of player that s not reached on the equlbrum path (such an nformaton set can bereachedonlybyplayer s own devatons). Ths does not change player s payoff, andt does not affect other playres best reples ether (because player s devaton s never detected by other players under the full support assumpton). It remans to check f σ (m ) specfes optmal contnuaton strateges on the nformaton sets that are reached on the equlbrum path. If ths were not true, player s payoff assocated wth m couldbemprovedbyreplacngthesupoptmal contnuaton strategy (specfed by m ) wth an optmal one n some nformaton set that s reached wth a postve probablty. Ths would mply that m were not a best reply, contradctng our premse that (m, r) s a correlated equlbrum. 5

6 ([0, b] s a belef-closed set,.e. player 1 s belef never leaves [0, b] when the ntal belef s n [0, b].. So we don t have to consder b 1 above b). However, player 1 s optmal strategy off the equlbrum path cannot be descrbed by a fnte state automaton. Consder a hstory where b 1 s close to 0, for example, a hstory such as(..., Dd, Dd, Dd, Dd). Suppose that player 1 devates repeatedly by playng C (because D s the unque optmal acton for b 1 [0,b ]) and observes c n many perods. In ths off the equlbrum path, player 1 s belef eventually exceeds b after many realzaton of (Cc), after whch the optmal contnuaton strategy must be (m, R). Tomplementsuchastrategyoff the equlbrum path by an extended fnte state automaton, one needs to use as many number of states as the requred number of Cc to go back to (m, R). But ths number can be arbtrarly large f b 1 s arbtrarly close to 0. Hence there s no fnte extended automaton to mplement suchastrategy.. 4 Prvate Montorng Problem as Partally Observable Markov Decson Process (POMDP). In a fnte state equlbrum, player opponents equlbrum behavor s gven by fnte path preautomata m j = (Θ j,f j,t j ), j 6=. Under the full support assumpton of prvate sgnals, player never receves an evdence that hs opponents j 6= devated from ther equlbrum behavor, so that he always beleves that hs opponents are usng ther path preautomata. Hence, after any prvate hstory, player s nformaton regardng the current and future behavor of player j 6= s summarzed by hs belef over hs opponents states b 4 (Θ ). Belef b compactly summarzes the relevant nformaton contaned n a prvate hstory h t =(a (1), ω (1),...,a (t), ω (t)) (a much more complcated object than b ). Furthermore, player s belef at tme t, denoted by b (t), can be calculated by hs prevous belef b (t 1), acton a (t 1) and sgnal ω (t 1) (the dynamcs of belefs s a controlled Markov process). Ths means that, n a fnte state equlbrum, each player faces a relatvely smple problem: a Markov decson problem (dynamc programmng problem) wth a fnte dmensonal state space 4 (Θ ). Ths pont has been emphaszed by, for example, Bhaskar and Obara (2002), Malath and Morrs (2002), and Phelan and Skrzypacz (2009). We go one step further and observe that ths problem corresponds to a reasonably tractable class of Markov decson problems, known as POMDP (Partally Observable Markov Decson Process). The present paper ntroduces a general technque to solve such a Markov decson problem, buldng heavly on a technque developed n operatons research and computer scence (see, for example, Kaelblng, Lttman, and Cassandra (1998)). A crucal observaton s that ths decson problem s easer to solve than t appears. An apparent dffculty n solvng ths problem s that the value functon W s defned on uncountable number of states b 4 (Θ ). Hence, computng value teraton W n+1 (b )=Γ W n(b ) or verfyng the fxed pont equaton W (b )= 6

7 Γ W (b ) for all b n prncple nvolves uncountably many calculatons (one for each b ). Ths s certanly true for a general non-lnear value functons. Fortunately, however, the theory of POMDP shows that we can confne attenton to pecewse lnear value functons, and for those partcular value functons the computaton can be exactly done wth a fnte number of calculatons. Let us explan how the theory of POMDP works to verfy that a gven fnte path preautomata m (Θ,f,T ), =1,...,N, consttute a correlated sequental equlbrum for some ntal correlaton devce r (Θ). We wll focus on player s decson problem, gven the opponents fnte state path preautomata m. The relevant state for player s the profle of hs opponents state θ,whchevolvesby the Markov transton functons T j (j 6= ). However, the true state s not drectly observable, and player only obtans partal nformaton through hs prvate sgnal ω. Ths s an nstance of a Markov decson process wth partally observable state varable. The jont dstrbuton of current sgnal ω and the next state θ 0 gven the current state and acton (θ,a ) s gven by r (ω, θ 0 θ,a ) X Y Tj (θ 0 j θ j, ω j )q(ω, ω a,f (θ )), (1) ω j6= Usng r thus defned, we can derve player s posteror belef χ [a, ω,b ] after current acton a and prvate sgnal ω gven her current belef b (Θ ): χ [a, ω,b ](θ 0 ) = Prb,a (ω, θ 0, ) Pr b,a (ω ) P θ = r (ω, θ 0 θ,a )b (θ ) Pθ q (ω a,f (θ ))b (θ ). Gven any value functon W : (Θ ) <, defne the Bellman operator Γ W by Γ W (b )= X max b (θ ) X a A θ j θ j Call W (b ) a belef-based value functon. (2) g (a,f (θ )) + δ X W (χ [a, ω,b ])q (ω a,f (θ )). ω The standard theory of dynamc programmng shows that the optmal value functon W (b ), whch represents the maxmum dscounted payoff to player under current belef b, s the unque fxed pont Γ W = W.Furthermore,W can be obtaned as the lmt of the sequence of value functons {W n n+1 },wherew = Γ W n (W 0 can be any functon). The theory of POMDP shows that the Bellman operator Γ W (2) has a much smpler expresson for some W. To explan how t works, we need to ntroduce a couple of concepts. Frst, for any fnte set of player s path automata M = {M 1,...,MK }, consder a path automaton such that 7

8 1. t starts wth a state where some pure acton a s played and 2. after ω s realzed, an automaton n M, denoted by M (ω ), s mplemented. Such a path automaton (a,m ( )) s called a one-shot extenson of path automata M. It s a path automaton constructed by attachng an ntal state to the set of path automata M. Let fm denote the set of all one-shot extensons of M. Note that M f has a fnte number (= A M Ω )ofelements. Second, for any path automaton M of player, letv M θ be player s payoff assocated wth M, when the opponents states are θ. Then the expected payoff assocated wth M under belef b 4 (Θ ) s gven by V M (b )= X v M θ b (θ ). (3) θ As an expected payoff, ths functon s lnear n belef b. Ths fact plays an mportant role n what follows. Gven those concepts, the essence of POMDP can be summarzed by the followng proposton. It shows that the Bellman operator Γ W (2) has a very smple representaton for some partcular value functons W : Proposton 1 (POMDP) Let M be a set of player s path automata and consder value functon W (b ) max M M V M (b ). Then, the Bellman operator for ths value functon s gven by Γ W (b )= max M M V M (b ), where f M s the set of one-shot extensons of M. The proof s gven by a drect calculaton. Here we provde an ntuton. The value functon W (b ) max M M V M (b ) s smply player s payoff under the constraned best reply to b (gven that player must chose a path automata n set M ). Hence, the maxmand on the rght-hand sde of the Bellman operator Γ W (2) s the payoff assocated wth some acton a today, followed by a best path automata n M. Hence, Γ W (b ) s the payoff assocated wth the best one-shot extenson of M, gven belef b (=max M M V M (b )). Why s ths proposton useful? In prncple, computng Γ W (b ) for all b 4 (Θ ) nvolves uncountably many calculatons (one for each b ). However, computaton of max M M V M (b ) s easy. Ths s smply the upper envelope of a fnte number of lnear functons V M (b ), M M f, and t can be exactly calculated n a fnte number of steps. Now let us formally explan how the theory of POMDP works. Recall that our goal s to verfy that the gven fnte path preautomaton profle m =(m 1,...,m N ) 8

9 can consttute a fnte state equlbrum. follows. The POMDP procedure s descrbes as P OMDP V alue Iteraton: Defne the ntal canddate path automata as M 0 {(m, θ )} θ Θ. In step n (n=1,2...), gven the set of ntal canddate path automata M n 1, compute the value functon by W n (b )= max M M n 1 V M (b ), where M fn 1 s the set of one-shot extensons of path automata M n 1. Ths can be computed n a fnte number of steps, because W n s the upper envelope of a fnte number of lnear functons V M (b ). Then, construct the set of canddate path automata for the nest step by "cream skmmng" the one-shot extensons: n M n M M f o n 1 W n (b )=V M (b ) for some b. Ths defnes an ncreasng sequence of value functons W 0 W 1,where, for each n<, W n s a pecewse lnear, convex and contnuous functon (because t s the upper envelope of a fnte number of lnear functons). The optmal value functon s obtaned as the lmt W =lm n W n. The monotoncty of the sequence of value functon s shown by a standard argument. It s easy to show W 0 W 1.4 By the defnton, operator Γ s monotone: W 1 W 0 mples W 2 = Γ W 1 Γ W 0 = W 1. Proceedng nductvely, we obtan W 0 W 1... Also the followng s a standard result n POMDP. Lemma 1 The optmal value functon W s convex and contnuous. 4 The proof W 0 W 1 : If the transton functons of canddate automata n M 0 are determnstc, the proof s trval (because M 0 M 0 ). Here we consder the case wth stochastc transton functons. W 1 (b ) s the payoff when () an optmal acton a s played today and () the contnuaton play after ω s gven by a best on-path automaton n M 0, gven the posteror belef b = χ [a, ω,b ]. On the other hand, W 0 (b ) s contnuaton payoff s asspcated wth some on-path automataon n M 0. Hence, t s the payoff when () some acton s played today and () the contnuaton play after ω s gven by a probablty dstrbuton over M 0 (ths s gven by the possbly stochastc transton functon T of preoutomaton m ). Snce the choce of contnuaton automata n () may not be optmal, W 0 (b ) s no greater than W 1 (b ), whch uses optmal contnuaton automata n M 0. Q.E.D. 9

10 Proof. Let M be the set of optmal path automata (.e., an element of M s a path automaton whch s optmal for some belef b ). Then, W (b ) s the upper envelope of a famly of lnear functons V M (b ), M M. Hence t s convex and contnuous. Remark 2 An mportant goal n the lterature on POMDP s to desgn a fast computer algorthm to mplement the value teraton descrbed above, and there are some free programs ("POMDP solvers") avalable over the nternet. The POMDP value teraton provdes a teratve procedure to fnd player s best reples aganst m j, whch represents varous contnuaton strateges the opponents mght use n the equlbrum. In each step n, afnte number of path automata M n s gven and the procedure search for better reples n a neghborhood of M n. Ths neghborhood (denoted M gn ) s the set of one-shot extensons of M n. The cream skmmng procedure s just to fnd undomnated strateges n M gn (n the game wth restrcted strategy spaces gven by M gn and m ). (Note that the cream skmmng procedure s to fnd strateges whch can be a best reply to some jont dstrbuton over the opponents states. A standard result n game theory shows that such a strategy s equvalent to a strategy that s not strctly domnated.) The POMDP teraton shows that repeated search for better reples by fndng undomnated oneshot extensons eventually leads to the best response. A partcularly useful mplcaton s that verfyng optmalty s easy. To show that a fnte set of path automata M provdes best reples to m, one only needs to show that there s no better reples n the one-shot extensons of M. Ths s a counterpart of the celebrated one-shot devaton prncple n the repeated games wth publc montorng. We elaborate on ths pont n Secton Fnte State Equlbrum as Multperson POMDP So when does a gven fnte state path automaton m =(m 1,...,m N ) consttute a fnte state equlbrum? Observe that (m, θ ) s optmal gven b 4 (Θ ) f and only f the dscounted value generated by path automaton (m, θ ) s exactly equal to the optmal value functon W (b ). Hence m s compatble f and only f W (b )=W 0 (b ) for some b for every. Proposton 2 Aprofle of path preautomata m =(m 1,...,m N ) s compatble f there exsts b (Θ ) such that W (b )=W 0 (b ) for =1,...,N. Once we verfy that m s compatble, then we need to fnd a jont dstrbuton r on (Θ) such that (m, r) consttutes a correlated sequental equlbrum. Thus our basc computatonal procedure works as follows. 10

11 Verfcaton Procedure: Gven fnte path preautomata m =(m 1,...,m N ) 1. Use the theory of POMDP to fnd the optmal value functon W. 2. Look for the regon of belefs where the optmal value concdes wth the payoff assocated wth a canddate path automaton (m, θ ): B θ {b W (b )=V (m,θ ) (b )}. Snce W s convex and contnuous and V (m,θ ) s lnear, B θ s a (possbly empty) closed and convex set. If ths s empty for all θ, m cannot consttute an equlbrum. Otherwse, proceed to Fnd an ntal correlaton devce r (Θ) such that r (θ ) > 0 mples r ( θ ) B θ 6= (r s the margnal dstrbuton, and r ( θ ) s the condtonal dstrbuton of θ ). If there s such r, then(m, r) s a fnte state equlbrum. If there s no such r, m cannot consttute an equlbrum. We frst dscuss the frst two steps, whch concerns wth the compatblty of path automata, then dscuss the thrd step n the next subsecton. To verfy compatblty, we frst need to fnd the optmal value functon W. One possblty s that we fnd a fxed pont W n = Γ W n = W n a fnte number of step. We have examples of ths case n Sectons It s sometmes useful to focus on a subset of belefs. A belef-closed set X (Θ ) for player s the set of player s belefs such that player s posteror belef wll never leave t f player s current belef s n t after any prvate hstory of player (ncludng off the equlbrum hstory). Formally, X s a belef-closed set f χ [a, ω,b ] X for any a A, ω Ω and b Θ. If we can fnd a fxed pont W = Γ W on X,thentmustbethecasethatW (b )=W (b ) on X. Ths s because player s dynamc programmng problem s unchanged even f we use a smaller state space X when the ntal belef b s n X. Examples of ths case are provded n Sectons The prevous case s a specal case where X = (Θ ), whch s obvously a belef-closed set. The followng proposton summarzes ths observaton. Proposton 3 Aprofle of path preautomata m =(m 1,...,m N ) s compatble f there exsts a belef-closed set X (Θ ) and W for each such that (1) W (b )= Γ W (b ) for every b X and (2) W (b 0 )=V (m,θ ) (b 0 ) for some b0 X and some θ. 11

12 Another possblty s that W n s strctly ncreasng n each step (at least for some belefs), hence t takes (or seems to take) nfnte teratons to reach the optmal value functon. Now suppose that, n such a case, there s a fnte n and some belef b where W n (b )=V (m,θ ) (b ) (= W 0 (b )). Even f ths holds for a farly large n, however, there s a possblty that n a succeedng step k>nthe value functon shfts upward W k (b ) >V (m,θ ) (b ) and the optmalty of automaton (m, θ ) at belef b s dsproved. How can we verfy the optmalty of a canddate automaton n a fnte step, whenw n does not (seem to) converge to W n any fnte step? The followng proposton addresses ths ssue. To state our proposton, we need to defne the followng concepts. Gven a profle of on-path preautomata m, wedefne a profle of on-path belef closed sets for player as X =(X (θ )) θ Θ, that satsfes θ X (θ ) 4 (Θ ) (X (θ ) can be an empty set) and the followng property: For any θ 0, θ 00,b, ω, f b X (θ 0 ) and T (θ θ 0, ω ) > 0, then χ [f (θ 0 ), ω,b ] X (θ ). Ths means that player s posteror belef never leaves the on-path belef closed sets as long as player does not devate from the specfed acton at each state. Let p (M,X,b ) be the probablty that the posteror belef n the next perod moves outsde of X when the current belef s b and a path-automaton M s played. For any belef-based value functons V and W,defne V W sup b (Θ ) V (b ) W (b ). Proposton 4 (Optmalty Verfcaton n A Fnte Step) Let X =(X (θ )) θ Θ be any profle of on-path belef closed sets of player wth respect to m. Suppose that, at the nth step of the POMDP value teraton, for any θ and b X (θ ), we have W n(b )=V (m,θ ) (b ) and V (m,θ ) (b ) max M M n ½ ¾ V M (b )+p(m,x,b ) δn+1 1 δ W 1 W 0. (4) Then, path automaton (m, θ ) s optmal at all b X (θ ): W (b )=V (m,θ ) (b ) for any b X (θ ) and any θ Θ. Remark 3 Ths proposton s useful when () W n = V (m,θ ) for some belefs but () for some other belefs the value teraton takes (or seems to take) nfnte steps to reach the optmal value functon W. The ntuton of ths proposton s qute smple. The standard theory of dynamc programmng shows that the value functon at the nth teraton s very close to the optmal one, for a large n. In partcular, the optmal value functon s bounded above by W n + 1 δ δn W 1 W 0, and the condton (4) bascally says that any one-shot devaton s unproftable, even when the player can receve ths upper bound of the optmal value off the equlbrum path. 12

13 Proof. Defne U 0 by U 0 (b ) = V (m,θ ) (b ) f b X (θ ) for some θ and U 0 (b )=W ) for every other belef (outsde of X ). (U 0 (b ) s well-defned even when b belongs to more than one sets X (θ ),X (θ 0 ),..., because our premse requres V (m,θ ) (b )=V (m,θ 0 ) (b )= = W n(b ).) We wll show that U 0 = W. We frst show that U 0 (b )=Γ U 0 (b ) f b X (θ ) for some θ. When we denote player s current expected stage payoff gven current acton a and belef b by u (a,b ) and the posteror belef n the next perod by b 0, Γ U 0(b ) s expressed as Γ U 0 (b ) = max u (a,b )+δe[u 0 (b 0 ) a,b ] ª a A = u (a 1,b )+δe[u 0 (b 0 ) a 1,b ]. Snce U 0 canbeexpressedas ( U 0 (b 0 W )= n(b0 ) ( = V (m,θ ) (b 0 ) )fb0 X (θ ) for some θ W (b0 ) otherwse, we have Γ U 0 (b ) = u (a 1,b )+δe[w n (b 0 )+ U 0 (b 0 ) W n (b 0 ) ª a 1,b ]. u (a 1,b )+δe[w n (b 0 ) a 1,b ]+p M,X 0,b δ W W n, where M 0 s any path automaton whose ntal acton s a1.recallthatp(m 0,X,b ) s the probablty that the posteror b 0 moves out of the profle of belef-closed sets ( b 0 / X (θ ) for any θ ) under automaton M 0, so that t only depends on the ntal acton of M 0. Now, by the standard theory of dynamc programmng ( W W n 1 δ δn W 1 W 0 ), we have Γ U 0 (b ) max u (a,b )+δe[w a n (b 0 ) a,b ] ª + p M,X 0 δ n+1,b 1 δ W 1 W 0. Snce the frst term on the rght hand sde s equal to Γ W n we have Γ U 0 (b ) =max M M n V M (b ), ½ ¾ max V M M M n (b )+p(m,x,b ) δn+1 1 δ W 1 W 0. (5) By the premse of the proposton (4), ths s no greater than V (m,θ ) (b ). Snce U 0(b )=V (m,θ ) (b ) for b X (θ ),wehaveestablshedthatγ U 0(b ) U 0(b ) f b X (θ ) for some θ. Now we show Γ U 0(b ) U 0(b ) f b X (θ ) for some θ. Ths follows from the fact that, f b X (θ ) for some θ, Γ U 0 (b ) u (f (θ ),b )+δe[u 0 (b 0 ) f (θ ),b ]=V (m,θ ) (b )=U 0 (b ). 13

14 The frst equalty holds because of the followng reasonng. Suppose that player plays acton f (θ ) today under current belef b X (θ ), and consder the case where the posteror belef n the next perod s b 0 X (θ 0 ). Ths means that tomorrow s state s θ 0 (for a moment, consder the case where (m, θ ) has determnstc transton functon). Therefore, the contnuaton payoff of automaton (m, θ ) s V (m,θ 0 ) (b 0 ). By defnton, ths s equal to U 0 (b0 ) (because b0 X (θ 0 )). Hence, u (f (θ ),b )+ δe[u 0 (b0 ) f (θ ),b ] represents the value assocated wth automaton (m, θ ),sothat t s equal to V (m,θ ) (b ). The case of stochastc transton functon s smlar. 5 Hence, we have shown U 0(b )=Γ U 0(b ) for b X (θ ),forsomeθ. Lastly, for b / X (θ ) for any θ,weshowγ U 0(b )=U 0(b ). ClearlyΓ U 0 (b ) W (b )=U 0 (b ) f b / X (θ ) for any θ,becauseu 0 (b ) W (b ) for every b. Hence, we have U 1 (b )=Γ U 0 (b ) U 0 (b ) for all b,andu n would be a decreasng sequence and lm n U n = W by the theory of dynamc programmng. Therefore, f U 1 (b0 )=Γ U 0 (b0 ) <W (b0 )(=U0 (b )) for some b 0 / X (θ ) for any θ,wewould have a contradcton W (b0 ) U 1 (b0 ) <W (b0 ). Hence,tmustbethecasethat Γ U 0 (b )=U 0 (b ) f b / X (θ ) for any θ. Those arguments have shown that U 0 s a fxed pont of Γ, so we obtan W = U 0 everywhere. In partcular, W (b )=U 0(b )=V (m,θ ) (b ) for any b X (θ ) and any θ Θ. An example of Proposton 4 (Example 4) s provded n Secton A Smpler Verfcaton Method When Off-Path Canddate Automata Are Also Gven Suppose that you have some guess about "comprehensve" path preautomata m and a belef-closed set X,=1,...,N that work. That s, m should nclude not only the automata to be played on the equlbrum path but also all automata used off the equlbrum path. You expect that W 0, whch s generated from m, s a fxed pont on X. We can use POMDP to verfy ths numercally as descrbed above. But there s a smpler, albet equvalent, way to verfy ths. When we have such a "comprehensve" canddate m, ourverfcaton procedure descrbed above bols down to: Belef-Based One-Shot Devaton Prncple: Canddate preautomaton m provdes the best reples aganst m on a belef-closed set X,ftcannotbemproved upon by one-shot extensons. That s, there s no one-shot extenson M of 5 When m has a stochastc transton functon, two staes θ 0 and θ may be possble for a gven posteror belef b 0 (because of the stochastc transton functon). Hence the same posteor belef b 0 may be contaned n X (θ 0 ) and X (θ ). However, the premse n ths proposton guarantees V (m,θ 0 ) (b 0 )=V (m,θ ) (b 0 )(=W n (b 0 )), sothatwecanemploythesameargumentasbefore. 14

15 {(m, θ ) θ Θ } and b X such that V M (b ) >V (m,θ ) (b ) for all θ. Ths statement s equvalent to Γ W 0(b )=W 0(b ) for all b X,wthW 0(b )= max θ V (m,θ ) (b ). In what follows, we show that, nstead of checkng Γ W 0(b )= W 0(b ) for all belefs, we only need to check ths at a fnte number of belefs (ths s essentally because the maxmzed value functon s the upper envelope of lnear functons). Let B θ,0 4 (Θ ) be the set of belefs where the followng holds W 0 (b )=V (m,θ ) (b ). Clearly [ B θ,0 covers 4 (Θ ). Snce each B θ,0 s an ntersecton of a fnte number θ of half spaces (and a hyperplane P θ b (θ )=1), t s a closed convex polyhedron. Take any convex polyhedron D such that X D. Defne a fnte subset of D as follows: n D θ = b D b s an extreme pont of D B θ,0 o, θ Θ. Remember that the objectve functon of the dynamc programmng problem s lnear n belefs. Hence, f we can verfy that Γ W 0(b )=W 0(b ) for every b D θ, whchsjustasetoffnte ponts, then we also obtan Γ W 0(b )=W 0(b ) every where n D B θ,0.e. W 0 s a fxed pont on D ( X ). Fgure 1 llustrates our pont. Ths corresponds to a two-player game, where the canddate path pre automaton of each player has three states, θ 1, θ 2,andθ 3. In the fgure, D B θk,0 s denoted by B k. D s the set of all belefs (= 4 (Θ )) n ths example, and player has canddate optmal plans m, θ 1, m, θ 2,and m, θ 3 [. conssts of seven ponts b 1,..., b 7, and each of them can be easly computed. θ D θ For example, b 3 s a soluton to the system of lnear equaltes V (m,θ 1 ) (b )=V (m,θ 2 ) (b ) V (m,θ 2 ) (b )=V (m,θ 3 ) P (b ). θ b(θ )=1 Let Θ b Θ be the set of states such that B θ,0 ncludes b 0 [ θ D θ. Our argument above can be summarzed as follows: Proposton 5 (Fnte Verfcaton of the Belef-Based One-Shot Devaton Prncple) To verfy Γ W 0 = W 0 =max θ V (m,θ ) on a belef-closed set X,t 15

16 b 1 b 2 b 3 B 1 b 4 B 2 B 3 b 5 b 6 b 7 Fgure 1: s suffcent to check the followng condtons ("no gan from one-shot devatons") at a fnte number of "extreme pont" belefs b [ D θ, θ (1) for each θ Θ b, path automaton (m, θ ) transts to an optmal path automata among m, θ 0 ª, θ 0 Θ after every realzaton of prvate sgnal ω (because (m, θ ) must be optmal gven b ),.e. χ [f(θ ), ω,b ] B θ0,0 for any ω and θ 0 wth T θ 0 θ, ω > 0. (2) for any acton a A that s not played by any θ Θ b, no one-shot extenson (a,m ( )) should generate a larger value than V (m,θ ) (b ) for any θ Θ b Ṡecton 6.2 provdes an example of ths proposton. 5.2 When Can We Fnd a Rght Intal Correlaton Devce? We show that, when a profle of path preautomata s compatble, the exstence of a rght ntal correlaton devce s guaranteed under a mld set of assumptons. Theorem 4 Let m (Θ,f,T ), =1,...,N be the canddate fnte path preautomata, and suppose that () Markov chan nduced by (m 1,...,m N ) on Θ has a unque recurrent communcaton class 6 X Θ () X contans an aperodc state 7 6 AsubsetX Θ s called a recurrent communcaton class f () any two ponts n X are mutually reachable and () no pont θ / X s reachable from any state wthn X. 7 Astatex Θ s aperodc f the greatest common dvsor of {t >0 Pr(θ(t) =x θ(0) = x) > 0} 16

17 and () for all, there s a belef b 0 such that V θ0 (b 0 )=W (b0 ) for some θ0 Θ, where W s the optmal value functon. Then, there s a probablty dstrbuton μ over X such that choosng the ntal states by μ and followng (m 1,...,m N ) s a correlated equlbrum. Example 2 Let us present an example of condtons () and (). Ths s the example n the next secton. Let m be the path preautomaton assocated wth the grm trgger strategy n our prsoner s dlemma game. The state space s Θ = {R, P }, andr and P are the cooperatve and non-cooperatve states respectvely. The Markov chan nduced by (m 1,m 2 ) has a unque recurrent communcaton class X = {(P, P)}, and (P, P) s an aperodc state (because {t >0 Pr(θ(t) =(R, R) θ(0) = (R, R)) > 0} = {1, 2,...}). Remark: A smlar observaton s made n Phelan and Skrzypacz (2009). It seems that, lke us, PS mpose some suffcent condtons on the transton of Markov chan so that there s the unque and globally stable ergodc dstrbuton. Snce the jont dstrbuton on Θ converges to the statonary dstrbuton for any ntal dstrbuton, they can use the statonary dstrbuton as the ntal jont dstrbuton to check f the fnte extended automata to be an equlbrum. Then they check whether one-shot devaton constrant s satsfed at every possble subjectve belef condtonal on each state (but the set of condtonal subjectve belefs s larger n ther settng because they keep track of belefs of on and off the equlbrum path). On the other hand, we frst fnd a belef on whch a gven path automaton s optmal. Then we mpose suffcent condtons for the exstence of unque and globally stable ergodc dstrbuton on Θ andverfythatpathpreautomatonsoptmalnthelmt,.e. when the ergodc dstrbuton s used as a correlaton devce. Proof. Let MC(Θ,m) be the Markov chan nduced by (m 1,...,m N ) on Θ. The theory of Markov chan shows that there s a unque nvarant dstrbuton μ of MC(Θ,m) and () ts support s X and () for any ntal dstrbuton of state profle μ 0 (Θ), the probablty dstrbuton of state profle at tme t, denotedby μ t, converges to μ. By condton () n the Theorem, each player has a belef b 0 under whch automaton (m, θ 0 ) s hs optmal strategy. Let μ 0 be the ntal dstrbuton of Markov chan MC(Θ,m) where player s n state θ 0 and the probablty of other players states s gven by b 0 (θ ). (More precsely, μ 0 (θ 0, θ ) = b 0 (θ ) and μ 0 (θ) =0f θ 6= θ 0.) Let μ t be the probablty dstrbuton of θ(t), gvenμ 0. Then we have lm t μ t = μ (μ s the nvarant dstrbuton wth support X). Snce μ assgns a strctly postve probablty for any θ X, sodoesμ t for all suffcently large t. Hence, for any θ X and any, the condtonal dstrbutons gven θ are well defned for all large t and we have s one. lm t μt (θ θ )=μ(θ θ ) for all θ X. 17

18 Fx any θ X and let H t(θ ) be the set of possble prvate hstores leadng to state θ at tme t: Formally, H t(θ ) s the set of prvate hstores h t such that () θ (0) = θ 0, θ (t) =θ,and()h t realzes wth a postve probablty under Markov chan MC(Θ,m) wth ntal dstrbuton μ 0. The above argument shows that H t(θ ) 6=, for all suffcently large t. Then, the condtonal probablty s well defned for all large t and expressed as μ t Pr(θ(t) =θ) (θ θ )= Pr(θ (t) =θ ) P h = t Ht (θ ) b (θ h t )Pr(ht ) Ph t Ht (θ ) Pr(ht ), (6) where b (θ h t ) s the condtonal probablty of θ (t) =θ gven prvate hstory h t (thssequaltoplayer s belef after ht ). Snce automaton (m, θ 0 ) s optmal for player at t =0, followng t after any prvate hstory whch realzes wth a postve probablty should also be optmal. In partcular, after prvate hstory h t Ht (θ ) player s n state θ and therefore followng automaton (m, θ ) must be optmal after h t. Note also that player has belef b ( h t ) after ht. Those facts, taken together, mply that automaton (m, θ ) s optmal under belef b ( h t ), for any ht Ht (θ ). Now let B θ be the set of belefs over θ under whch automaton (m, θ ) s optmal for player. Snce () B θ = {b W (b )=V θ (b )} () the optmal value functon W s a contnuous convex functon and () V θ ( ) s a lnear functon, B θ s a closed convex set (ths may be stated as a lemma elsewhere. Also note that a drect proof, based on the lnearty of expected payoff wth respect to belefs, s easy.). Equaton (6) and our argument n the prevous paragraph show that μ t ( θ ) s a convex combnaton of the belefs (.e., b (θ h t ), ht Ht (θ )) nb θ. By the convexty of B θ,wehave μ t ( θ ) B θ for all t t. By the closedness of B θ and lm t μ t ( θ )=μ( θ ) mples μ( θ ) B θ for all θ X. Ths means that the jont dstrbuton μ over ntal state profle θ X nduces a correlated equlbrum. 18

19 6 Examples 6.1 Example 1 We apply the POMDP technque to the prsoner s dlemma model analyzed by Sekguch (1997) and Bhaskar and Obara (2002). The stage payoff s gven by C D C 1, 1 1+g, l D l,1+g 0, 0 Each player s prvate sgnal s ω = c, d, whch s a nosy observaton of the opponent s acton. For example, when the opponent chose C, player s more lkely to receve the correct sgnal ω = c, but sometmes an observaton error provdes a wrong sgnal ω = d. More precsely, we assume that exactly one player receves a wrong sgnal s ε > 0 and both receve wrong sgnals wth probablty ξ > 0. For example, when acton profle s (C, C), the jont dstrbuton of prvate sgnals s c d c 1 2ε ξ ε d ε ξ Hence n ths model we have fve parameters g, l, ε, ξ, and δ. In Example 1, we assume that ε =(1 r)r, ξ = r 2 (ndependent observaton errors), r =1/6, g =1, l =1,andδ =0.9. The canddate path preautomaton m corresponds to the grm trgger strategy and s shown n the followng fgure.we show that ths preautomaton consttutes a correlated equlbrum, gven some ntal dstrbuton of states. Frst, we determne v θ θ j,player s payoff under state profle (θ, θ j ). Ths can be done by fndng a soluton to the system of equatons The soluton shows v RR =1+δ (1 2ε ξ)v RR + εv RP + εv PR + ξv PPª v RP = l + δ (ε + ξ)v RP +(1 ε ξ)v PPª v PR =(1+g)+δ (ε + ξ)v PR +(1 ε ξ)v PPª. (7) v PP =0+δv PP v RR = ,v RP = ,v PR = ,v PP =0. Let b denote player s belef that j s n state R, andletv θ (b) the expected payoff to player when hs state s θ = R, P. 8 Then, the graphs of V θ (b) = bv θ R +(1 b)v θ P, θ = R, P are gven n the followng fgure. The horzontal axs measures b. 8 In our general notaton, v θ θ j = v (m,θ )θ j and V θ = V (m,θ ). 19

20 = c,d = c P acton = D = d R acton = C Fgure 2: y x From top to bottom (on the y axs) V P (blue) and V R (black). The ntal value functon n the POMDP teraton W 0 s the upper envelope of those two lnear functons (W 0 (b) =max{v R (b),v P (b)}). In the frst step of POMDP, we consder one-shot extensons of path automata {(m,r), (m,p)}. Let M a zz 0 (a = C, D and z,z 0 = P, R) be the one-shot ex- 20

21 tenson that () starts wth a state wth acton a, () moves to state z of m after ω = c, and () moves to state z 0 of m after ω = d. For example, M CRP starts wth acton C and goes to state R (to play the grm trgger strategy) after ω = c and moves on to P (to play permanent defecton) after ω = d. It s easy to see that M CRP actually mplements the same strategy as (m,r) (the grm trgger strategy). To perform the cream skmmng of those one-shot extensons, lookng at the belef dynamcs s useful. Let χ a ω (b) denote the posteror probablty of θ j = R when the current acton, sgnal, belef of player are a, ω,andb. By Bayes rule we have b(1 2ε ξ) χ Cc (b) = b(1 ε ξ)+(1 b)(ε + ξ), χ Cd (b) = χ Dc (b) = bε b(ε + ξ)+(1 b)(1 ε ξ), bε b(1 ε ξ)+(1 b)(ε + ξ), and bξ χ Dd (b) = b(ε + ξ)+(1 b)(1 ε ξ). For our parameter values n ths example, t can be checked that χ ad (b) < χ ac (b) for all b and a = C, D (bad sgnal d always reduces the probablty that the opponent s stll cooperatng). Let b be the pont where V R and V P ntersect. Snce ½ V W 0 (b) f b b (b) = V P (b) f b b, W 0 (χ ac (b)) = V P (χ ac (b)) mples W 0 (χ ad (b)) = V P (χ ad (b)). Ths means that, for any b, the maxmum value to acheve ΓW 0 (b) n the Bellman equaton (2) s not acheved by M CPR or M DPR. Hence, for any b, ΓW 0 (b) s the assocated wth one of the followng sx automata: M CRR,M CRP,M CPP,M DRR,M DRP,M DPP. Note that M CRP and M DPP mplement the same strateges as M R (the grm trgger strategy) and M P (permanent defecton). Let V a zz 0 (b) be the expected payoff to player, when he plays ths automaton M a zz 0 under belef b. Ths s a lnear functon n belef b: V a zz 0 (b) =bv a zz 0,R +(1 b)v a zz 0,P, (8) where v a zz 0,R s the payoff under automaton M a zz 0 when the opponent plays (m j,r) (v a zz 0,P s defned smlarly). To compute v a zz 0,θ j,wecanjustusev R (b), V P (b) and belef functons χ a ω (b). For example, v Czz0,R (zz 0 = RR, P P ) s determned by ³ v Czz0,R = g 1 (C, C)+δ V z (χ Cc (1)) Pr(ω = c CC)+V z0 (χ Cd (1)) Pr(ω = d CC). 21

22 Computaton shows v CRR,R = ,v CPP,R = ,v CRR,P = ,v CPP,P = 1, v DRR,R = ,v DRP,R =1.7059,v DRR,P = ,v DRP,P = Then, the new value functon W 1 (b) s the upper envelope of sx lnear functons V a zz 0 (b), a zz 0 = CRR,CRP,CPP,DRR,DRP,DPP. It s gven by the followng fgure. y x -1-2 From top to bottom (on the y axs) V P = V DPP (blue), V DRP (green dotted), V CPP (pnk dotted), V DRR (brown dotted), V R = V CRP (black), and V CRR (red). POMDP to compute W 1 The upper envelope W 1 s almost the same as the orgnal value functon W 0 (= the upper envelope of the blue and black lnes). Note that the red lne (V CRR ) concdes wth W 1 only for very hgh belef b [ , 1], andonecancheck that χ a ω (b) does not le n ths regon for any a, ω and b. Ths mples that W 1 (χ a ω (b)) = W 0 (χ a ω (b)) for all a, ω,andb, whchshowsγw 1 = ΓW 0 = W 1. Hence POMDP converged to the optmal value functon W = W 1 on Θ =[0, 1] n two steps. Note that we can fnd a fxed pont one step earler f we restrct attenton to a belef-closed set X =[0, ( )], where s a fxed pont of χ Cc n (0, 1). 22

23 If the current belef s n X,then every posteror belef s n X gven any (a, ω ) because there s always a postve probablty that a bad sgnal s observed even when C s played. Snce V CRR matters only for b = , W 0 s the fxed pont on X. Concluson: The above fgure and our computaton show 1. the optmal value functon W concdes wth V P for b [0, 0.625] 2. the optmal value functon W concdes wth V R for b [0.625, ] Hence, when a correlaton devce can generate ntal belefs b R [0.625, ] and b P [0, 0.625], we get a correlated equlbrum where players use automata M R (the grm trgger strategy) and M p (permanent defecton). One such equlbrum s the mxed strategy equlbrum dentfed by Bhaskar and Obara, n whch the ntal belef s gven by (where the player s ndfferent between M R and M P ). 6.2 Example 1 - A Smpler Verfcaton The prevous secton llustrates how POMDP works, but for ths partcular example, there s a much smpler verfcaton of equlbrum condtons, based on Proposton 5(Fnte Verfcaton of the Belef-Based One-Shot Devaton Prncple). Let b c = ( ) be the fxed pont of χ Cc. Then, one can see that X =[0,b c ] s a belef-closed set. The reason s the followng. If we start wth b XÂ{0} and play C and observe ω = c repeatedly, the posteror belef ncreases and approaches the fxed pont b c. Ths s because the graph of posteror belef functon χ Cc s ncreasng and cross the 45 o degree lne at b c from above (.e., b c s a stable fxed pont). If current acton and sgnal s not (C, c), the posteror belef goes down. Fnally, b =0s absorbng. Hence X s belef-closed. We wll show that ΓW 0 = W 0 =max θ =R,P V θ on X. Proposton 5 shows that we only need check ths at three ponts, 0,b (the belef at whch V P and V R cross), and b c. 23

24 Checkng ΓW 0 (0) = W 0 (0) s trval, because b =0means that the opponent s playng D forever and the best reply s permanent defecton. To check ΓW 0 (b )= W 0 (b ), we check condton (1) n Proposton 5. If D s played today, the posteror belef s always less than b and the optmal contnuaton (accordng to W 0 )s permanent defecton. Ths s exactly what automaton (m,p) specfes. Now suppose C s played today. If the prvate sgnal s c, the posteror belef moves up (t s above b ), and the optmal contnuaton (accordng to W 0 ) s grm trgger. If the prvate sgnal s d, the posteror belef goes down (t s below b ), and the optmal contnuaton (accordng to W 0 ) s permanent defecton. Agan, ths s exactly what automaton (m,r) specfes. Hence condton (1) of Proposton 5 s satsfed. Lastly, let us check ΓW 0 (b c )=W 0 (b c ). We know χ Cc (b c )=b c and calculaton shows χ Cd (b c ) <b, so condton (1) s satsfed. When D s played today, χ Dc (b c ) and χ Dd (b c ) are both below b, and therefore the maxmum value of one-shot extenson whch plays D s gven by V P (b c ) (permanent defecton: the blue lne n the fgure). Snce ths s less than V R (b c ), the condton (2) of Proposton 5 s satsfed. Hence, the analyss n Sekguch (1997) and Bhaskar and Obara (2002) can be much smplfed as above, accordng to our general belef-based methodology. 6.3 Example 2 Modfy the stage game of the prevous example as follows: C C0 D C 1, 1 1, 1 ε l, 1+g C 0 1 ε, 1 1 ε, 1 ε l ε, 1+g D 1+g, l 1+g, l ε 0, 0 You can nterpret C 0 as C plus some montorng actvty that costs ε > 0. If a player chooses C 0, then he can observe the other player s acton perfectly. (Ths volates 24

25 our full support assumpton, but the analyss below remans essentally the same even when we assume that a player wth acton C 0 can observe the other player s acton almost perfectly.) From the other player, C and C 0 are ndstngushable,.e. the other player observes c wth probablty 1 r and d wth probablty r whether C or C 0 s played. We show that the grm trgger strategy stll consttutes a correlated sequental equlbrum for some ntal dstrbuton. Let s use the same parameter values for g, l, r and δ as before. Let V R and V P be the value functon assocated wth R and P as before. In the frst step of POMDP, we need to consder eght one-shot extensons ths tme: M a,z,z 0,a {C, C 0,D},z,z 0 {R, P }. Agan we don t have to consder a combnaton of (z,z 0)=(R, P ). Let V a,z,z 0 be the dscounted payoff functon assocated wth M a,z,z 0. We already know that V CPP,V DRR,V DRP are domnated. We have three new payoff functons because of C 0 : V C0RR,V C0RP,V C0PP. It s easy to show that we only need to consder V C0RP (remember that montorng s perfect gven C 0 ). How does ths new functon V C0RP affect the updated value functon? Suppose that ε =0for the tme beng. Then M C0RP domnates M CRR and M CRP because C 0 generates the same expected payoff as C n the current perod, but more nformatve than C. So V C0RP s above V CRR or V CRP. Only at b =1,V C0RP concdes wth V CRR because the player s sgnal does not convey any nformaton about the the other player s current state (whch he knows to be R). Thus V C0RP looks as follows. 9 9 Note that v C0 RP,R = v CRP,R and v C0 RP,P = l = 1. Hence (wth ε =0), V C0 RP (b) =bv C0 RP,R +(1 b)v C0 RP,P =3.1176b (1 b). 25

Finite State Equilibria in Dynamic Games

Finite State Equilibria in Dynamic Games Fnte State Equlbra n Dynamc Games Mchhro Kandor Faculty of Economcs Unversty of Tokyo Ichro Obara Department of Economcs UCLA June 21, 2007 Abstract An equlbrum n an nfnte horzon game s called a fnte state