Finding Correlated Equilibria in General Sum Stochastic Games

Size: px

Start display at page:

Download "Finding Correlated Equilibria in General Sum Stochastic Games"

Adela Nichols
5 years ago
Views:

1 Finding Correlted Equilibri in Generl Sum Stochstic Gmes Chris Murry nd Geoff Gordon June 2007 CMU-ML

2 Finding Correlted Equilibri in Generl Sum Stochstic Gmes Chris Murry nd Geoff Gordon June 2007 CMU-ML School of Computer Science Crnegie-Mellon University Pittsburgh, PA Abstrct Often problems rise where multiple self-interested gents with individul gols cn coordinte their ctions to improve their outcomes. We model these problems s generl sum stochstic gmes. We develop trctble pproximtion lgorithm for computing subgme-perfect correlted equilibri in these gmes. Our lgorithm is n extension of stndrd dynmic progrmming methods like vlue itertion nd Q-lerning. And, it is conservtive: while it is not gurnteed to find ll vlue vectors chievble in correlted equilibrium, ny policy which it does find is gurnteed to be n exct equilibrium of the stochstic gme (to within limits of ccurcy which depend on the number of bckups nd not on the pproximtion scheme). Our new lgorithm is bsed on the plnning lgorithm of [1]. Tht lgorithm computes subgme-perfect Nsh equilibri, but ssumes tht it is given set of punishment policies s input. Our new lgorithm requires only the description of the gme, n importnt improvement since suitble punishment policies my be difficult to come by.

3 Keywords: Multi-gent plnning, subgme perfect correlted equilibrium, stochstic gmes.

4 1 INTRODUCTION We model the multi-gent plnning problem, where self-interested rtionl gents interct with ech other nd with the world, s generl sum stochstic gme. The world stte nd the plyers joint ction determine the rewrds to ech plyer nd the world s next stte, nd the process repets. Since our gents re self-interested, simply finding policy tht chieves some vlue for ech gent will not suffice: we need to find policies where every gent hs n incentive to cooperte, tht is, equilibri of the gme. An equilibrium keeps every gent in line by promising rewrds for complince or thretening punishment for devition. More specificlly, in order to give gents the mximum flexibility in jointly choosing ctions, we will look for policies tht re subgme-perfect correlted equilibri. Unlike Nsh equilibri, correlted equilibri llow gents to correlte their ctions t ny given point in the gme. Tht is, gents smple from distribution over joint ctions, rther thn ech gent individully smpling her own ction nd hving the joint ction distribution be the product of the individul ction distributions. Correlted equilibri llow richer set of policies to be chieved, nd llow gents to void executing unintended joint ctions. In ddition, s we will see below, trgeting correlted equilibri llows us to develop clener lgorithm, since ech bckup opertion cn be viewed s pproximting convex set of pyoff vectors. Subgme perfection mens tht our equilibrium policies will contin no incredible threts: even fter devition by one gent (which might led to sitution which cn never be observed in equilibrium ply), no gent wishes to mke nother devition. In other words, if our policy deters devition with the thret of some punishment, the punishment policy is itself n equilibrium. (Of course, if we just wish to compute correlted equilibri without worrying bout subgme perfection, our lgorithm lso provides wy to do so.) Our focus here is on the plnning problem: given stochstic gme, we wnt to find the set of vlues tht cn be chieved in correlted equilibrium, nd policies to chieve these vlues. We won t focus on the problem of selecting n equilibrium from this set or executing policy once it is chosen: we will imgine tht modertor cn serve both of these functions. If modertor is unvilble, the negotition protocol in [1] cn be used to select n equilibrium 1 nd the cryptogrphic protocol in [2] cn be used to smple from distribution over joint ctions. 1 The negotition protocol requires disgreement policy s input, which we cn tke to be ny prespecified element of V(s strt). Good choices for disgreement policy re often domin-specific, but one resonble domin-independent choice might be tht, if the plyers disgree, they will pick vlue vector uniformly t rndom from the Preto frontier of V(s strt) nd use it s trget. 3

5 2 STOCHASTIC GAMES A stochstic gme represents multi-gent plnning problem in the sme wy tht Mrkov Decision Process [3] represents single-gent plnning problem. As in n MDP, trnsitions in stochstic gme depend on the current stte nd ction. Unlike MDPs, the current (joint) ction is vector of individul ctions, one for ech plyer. More formlly, generl sum stochstic gme G is tuple (S,s strt,p,a,t,r,γ). S is set of sttes, nd s strt S is the strt stte. P is the number of plyers. A = A 1 A 2... A P is the finite set of joint ctions. We write for joint ction, nd α for plyer s individul ction. When joint ction should be executed, but plyer p devites nd plys individul ction α insted of α, we write the resulting joint ction s p:α. We del with fully observble stochstic gmes with perfect monitoring, where ll plyers cn observe the true stte nd true joint ction. T : S A P(S) is the trnsition function, where P(S) is the set of probbility distributions over S. R : S A R P is the rewrd function. We will write R p (s,) for the pth component of R(s,). γ [0,1) is the discount fctor. Plyer p wnts to mximize the expecttion of her discounted totl vlue for the observed sequence of sttes nd joint ctions s 1, 1,s 2, 2,...: V p = γ t 1 R p (s t, t ) t=1 A (sttionry, joint) policy is function π p : S P(A) which tells the plyers how to pick their joint ction t ech stte. A nonsttionry policy is function π : ( t=0 (S A A) t S) P(A) which tkes history of sttes, recommended joint ctions, nd ctul joint ctions, nd produces distribution over recommended joint ctions for the next time step. For ny nonsttionry policy, there is sttionry policy tht chieves the sme vlue t every stte [4]; but, the sttionry policy my not be n equilibrium even if the originl nonsttionry policy is. For both kinds of policy, we imgine tht there is modertor who observes the current stte s (or the history of sttes nd ctions h), smples joint ction from π(s) (or π(h)), nd tells ech plyer her recommended ction p. We need such modertor (or n equivlent cryptogrphic protocol) to mke sure tht no plyer lerns nother plyer s recommended ction before choosing her own ction. The vlue function Vp π : S R gives expected vlues for plyer p under joint policy π t every stte. (For nonsttionry policy π we will define Vp π (h) to be the vlue fter observing history h.) And, the ction-vlue function Q π p : S A R gives expected vlues if we strt t stte s nd perform ction. (If π is nonsttionry, we will write Q π p(h,) for its vlue to p when strting t history h nd performing joint ction.) The vlue nd ction-vlue functions for π stisfy the liner equtions: V π p (h) = (π(h))()q π p(h,,) (1) 4

6 Q π p(h,, ) = R p (s(h), ) + γ s P(s s(h), )V π p ( h,,,s ) (2) Here s(h) is the stte corresponding to history h (i.e., the lst element of h). Note tht Vp π (h) in Eq. 1 depends only on Q-vlues for mtching recommended nd ctul ctions, but Eq. 2 defines Q-vlues for both mtching nd nonmtching cses. Also note tht we hve written Eqs. 1 2 for the more generl cse of nonsttionry policies; if π is sttionry, we cn simplify the equtions by replcing ech history with its finl stte. The vlue vector t stte s, V π (s), is the vector with components Vp π (s) (nd similrly for V π (h)). We will write V(s) to represent the set of vlue vectors which re chievble strting from stte s nd following ny correlted equilibrium policy (either sttionry or nonsttionry). This set is convex, since the modertor cn rndomize. 3 CORRELATED EQUILIBRIUM A joint policy π is correlted equilibrium if, when following π, no plyer ever hs incentive to devite. It s tempting to think of the correlted equilibrium condition s simply Vp π (s) Vp π (s) for ny policy π which is the sme s π except tht plyer p plys some individul ctions differently. However, this is not correct for two resons. First, if plyer p devites from π, the other plyers my rect nd chnge their ctions t subsequent sttes to punish p. So, the immedite benefit which p chieves by deviting must be weighed ginst her predicted future loss from being punished. Second, nd more subtly, p my consider not only unconditionl devitions of the form ignore wht I m supposed to do nd ply ction α insted, but lso conditionl devitions. During the execution of policy, t some time t nd stte s, p first lerns the individul ction α which she is recommended, nd then decides whether or not to follow tht ction. Lerning α tells plyer p conditionl distribution on wht joint ction will be followed; so, plyer p cn esily compute the conditionl expecttion of her future discounted vlue hving been recommended ction α. Crucilly, this vlue depends on the recommendtion α. Similrly, p s expected loss from being punished fter deviting cn lso depend on α. So, p my wish to devite fter some recommendtions nd not others. Putting these two requirements together, if policy wnts to recommend some individul ction α to plyer p fter history h, it must promise plyer p more by following α thn p would get from ny possible devition (nd the resulting punishment). More formlly, write Q π p(h,α,α ) for plyer p s expected vlue if she receives recommendtion α nd plys α insted, given tht the current history is h nd the policy is π: Q π p(h,α,α ) = P π ( α,h)q π p(h,, p:α ) (3) 5

7 For convenience, we will define Q π p(h,α,α ) to be zero if π never recommends ction α to plyer p given history h. Note tht Q π p(h,, p:α ) cn include penlty for plyer p if α α, since π cn prescribe tht the other plyers will chnge their behvior fter observing p:α insted of. With these definitions, the subgme-perfect correlted equilibrium condition is just ( h,p,α,α ) Q π p(h,α,α) Q π p(h,α,α ) (4) We cn relx the condition of subgme perfection by requiring Eq. 4 to hold only t histories h which re rechble during on-policy ply. 4 VALUE BACKUPS The lgorithm we use to find ll the chievble vlue vectors in stochstic gme is bsed on dynmic progrmming, nd is similr to (but more complicted thn) vlue itertion or Q-lerning for MDPs. The finl result of the lgorithm will be P-dimensionl convex set for ech stte, telling wht vectors of vlues re chievble in correlted equilibrium beginning in tht stte. The lgorithm will lso return informtion sufficient for us to reconstruct policy tht chieves ny one of those vlue vectors. The vlue bckups themselves ren t tht different from the norml MDP vlue bckups, s long s the multipliction nd ddition opertors re defined properly to work on convex sets of vectors. However, we must lso include pruning step which removes policies where some gent hs n incentive to devite. In this section we will describe the simpler bckup opertor tht doesn t enforce equilibrium constrints, nd thus finds ll vlue vectors chievble by ny policy in gme, rther thn only those chievble vi correlted equilibrium policy. In Section 5, we will show how to dd in the incentive constrints to rrive t the complete bckup opertor which does enforce equilibrium constrints. 4.1 MDP bckups To derive the set-vlued bckup opertor, we will strt from the ordinry Bellmn equtions for Mrkov decision processes: Q MDP (s,) = R(s,) + γ s P(s s,)v MDP (s ) (5) V MDP (s) = mx Q MDP (s,) (6) Here P(s s,) is the probbility of trnsitioning from stte s to stte s when tking ction. The MDP bckup works by treting Eqs. 5 nd 6 s ssignments: the opertor T MDP cn be written [ ] T MDP (V )(s) = mx R(s,) + γ s P(s s)v (s ) 6

8 or, in mtrix nottion, T MDP (V ) = mx [R + γp V ] (7) In this nottion, the mx opertion opertes componentwise. 4.2 Set-vlued bckups In the multi-plyer generliztion, V(s) R P is set of vlue vectors chievble strting from stte s: ech v V(s) hs one component for every plyer. But, the rules for bcking up the vlue for single plyer under fixed joint ction re exctly the sme s in Eq. 5. To pply Eq. 5 to n entire set of vlue vectors t once, we cn define ddition nd multipliction to work in the usul wy on sets of vectors: for two sets A nd B, sclr c, nd vector d, ca = {c A} d+a = {d+ A} A+B = {+b A,b B} If V is vector of sets nd M is mtrix of sclrs, the bove definitions of ddition nd sclr multipliction lso llow us to interpret the mtrix multipliction MV. With these definitions, ssuming tht V noprune (s) is the set of chievble vlue vectors t stte s, we cn write Q noprune (s,) = R(s,) + γ s P(s s,)v noprune (s ) (8) for the set of ll vlue vectors tht we cn chieve by strting t stte s nd executing ction. Eq. 8 is one hlf of the Bellmn equtions for stochstic gmes without pruning. The remining hlf defines V in terms of Q: V noprune (s) = conv A Q noprune (s,) (9) Here the conv opertor first tkes the union of its rguments, nd then finds the convex hull of the result. The reson we need conv in Eq. 9 (insted of the mx in Eq. 6) is tht we wnt ll vlue vectors tht cn be chieved from stte s, not just the one which mximizes some plyer s pyoff. (Convex combintions of chievble vlue vectors re chievble since the modertor cn rndomize mong joint ctions.) By combining Equtions (8) nd (9), we cn define the simplified trnsition opertor, which is the sme s one itertion of the exct vlue itertion lgorithm except tht it omits the pruning step. T noprune (V) = conv A [R + γp V] (10) If our gol were to find ll vlue vectors chievble vi ny policy, regrdless of equilibrium constrints, then repeted ppliction of T noprune would converge 7

9 to the correct nswer. 2 However, we wnt to prune out those vlues tht ren t chievble by correlted equilibrium policy. The following section detils how to do so. 5 INDIVIDUAL RATIONALITY In the full Bellmn equtions for discounted stochstic gmes (nd in the corresponding bckup opertor), we cn define Q exctly s we did for the no-pruning cse (cf. Eq. 8): Q(s,) = R(s,) + γ s P(s s,)v(s ) (11) But now, insted of finding V(s) by tking the convex hull of Q(s,) for ll, we need to define new pruning opertion which removes non-equilibrium ction distributions so tht V(s) = prune Q(s,) (12) The rest of this section defines the prune opertor; Appendix A shows tht our definition is correct. (Tht is, it shows tht the unique mximl solution of Eqs consists of exctly the vlue vectors chievble in subgme-perfect correlted equilibrium.) In the expression V = prune Q, the set V nd ll of the sets Q re subsets of R P. 5.1 Anlyzing V π ( s ) By definition, V(s) consists of ll vlue vectors V π ( s ) tht cn be chieved strting from stte s under ny subgme-perfect correlted equilibrium policy π. We cn brek π into two pieces: first, we hve n immedite distribution over recommended ctions, ω = π( s ). And second, we hve policy for the future: if we recommended n ction nd took possibly-different ction, then our future policy is π s,, (h) = π( s,,,h ). From Eqs. 1 2 nd the definitions of ω nd π s,,, we know tht [ ] V π ( s ) = ω() R(s,) + γ s P(s s,)v πs,, ( s ) (13) Eq. 13 shows tht the future policy π s,, influences V π ( s ) only through the vlue vectors V πs,, ( s ) t sttes s tht we might rech fter one step. Since we cn choose our future policy rbitrrily (our future ctions re not limited by how we rrived t s ), nd since V(s ) tells us wht vlue vectors we cn chieve t s, Eq. 13 mens tht we don t need to worry bout our exct future 2 Repeted ppliction will converge s long s we initilize V(s) to nonempty, bounded set for ech s. For the full lgorithm, we will in ddition need to initilize V(s) to set contining the correct nswer, such s the lrge cube suggested in Fig. 1. 8

10 policy: for purposes of computing V(s), we just need to keep trck of how much vlue our future policy will give us t ech stte s hving followed ech possible joint ction. In fct, by exmining Eq. 11, we cn see tht the term in brckets in Eq. 13 is n element of Q(s,). So, we hve V π p ( s ) = ω()q (14) where ω is probbility distribution over ctions nd where q Q(s,) for ech. If we llowed ll choices of ω nd q, we would rrive t the no-pruning Bellmn equtions described in Sec. 4. But, not every choice of ω nd q will correspond to n equilibrium policy. (So, prune Q(s,) will be subset of conv Q(s,).) Therefore, to compute V(s), we still need to enforce the individul rtionlity constrints, Eq Enforcing Eq. 4 Fixing h = s nd substituting the definition of Q π p(h,α,α ) into Eq. 4, we hve: P π ( α,s)q π p( s,,) P π ( α,s)q π p( s,, p:α ) (15) For π to be rtionl t s, Eq. 15 must hold for ll p, α, nd α with P π (α s) > 0. By Byes rule, P π ( α,s) = P π (α,s)p π ( s)/p π (α s). The first term, P π (α,s), is either 0 or 1 depending on whether α is consistent with. The second term, P π ( s), is given by our immedite ction distribution ω. And, since the lst term is positive nd doesn t depend on, we cn fctor it out nd cncel it from both sides of Eq. 15: ω()q π p( s,, p:α ) (16) α ω()q π p( s,,) α for ll p, α, nd α. (We cn drop the qulifiction P π (α s) > 0 since Eq. 16 is vcuous if P π (α s) = 0.) The sum is over ll ctions which re consistent with the recommendtion α. On the left-hnd side of Eq. 16, Q π p( s,,) is just the pth element of q from Eq. 14 bove. On the right-hnd side, Q π p( s,, p:α ) tells us how much vlue plyer p will get by deviting to α. Since the off-policy Q-vlue Q π p( s,, p:α ) does not directly influence V π (s), we don t need to be concerned with its exct vlue except to mke sure tht Eq. 16 is stisfied. Tht is, we only need to mke sure tht policy π punishes devitions sufficiently severely to deter them. By vrying the future policy π s,, over equilibrium policies, we cn mke the vector Q π ( s,, p:α ) be n rbitrry element of Q(s, p:α ). So, define Q p (s,) = min Q Q(s,) Q p 9

11 And, define Q(s,) to be the vector whose pth element is Q p (s,). Q p (s,) is the vlue of the hrshest punishment tht the other plyers cn impose on plyer p (within the bounds of equilibrium) given tht we strt with stte s nd ction. So, if Eq. 16 cn be stisfied t ll, it will be stisfied when Q π p( s,, p:α ) = Q p (s, p:α ). Tht mens tht Eq. 16 reduces to α ω()q α ω()q(s, p:α ) (17) for ll plyers p, recommendtions α, nd devitions α. Here ω is probbility distribution, nd q Q(s,) for ech. In Eq. 17, the opertion on length-p vectors is interpreted componentwise. If we wish to compute correlted equilibri without regrd to subgme perfection, we cn replce Q p (s,) by the miniml vlue tht ny fesible policy ssigns to plyer p; since this punishment will not be visited during equilibrium ply, it does not need to be n equilibrium unless we wnt subgme perfection. To compute the miniml fesible vlue for ech plyer t ech stte nd ction, we cn run the no-pruning version of our lgorithm to completion. 5.3 Putting it together Summrizing, we hve tht { } V(s) = ω()q (18) where the distribution ω nd the vlue vectors q Q(s,) re constrined to stisfy Eq. 17. Eqs constitute definition of the prune opertor from Eq. 12. However, to mke it esier to compute V(s), we will rerrnge this definition slightly: define q = ω()q And, ssume tht we re given system of inequlities defining Q(s,), Q(s,) = {q M q + b 0} (A mtrix M nd vector b for such system lwys exist, since Q(s,) is convex set; however, M nd b my be infinitely tll if Q(s,) hs curved boundry.) Then V(s) is chrcterized by the liner system of inequlities V = q (19) q ω()q(s, p:α ) p,α,α (20) α α M q + ω()b 0 (21) 10

12 ω() = 1 (22) ω() 0 (23) The constnts Q(s,) in Eq. 20 cn be precomputed from Q(s,). Inequlity 21 ensures tht q ω()q(s,), where ω()q(s,) is copy of Q(s,) scled down by ω(). 5.4 Policy execution If we hve solution to the Bellmn equtions (Eqs ), then we cn use Eqs to find, for ny trget vlue vector v V(s), probbility distribution ω() nd vlue vectors q Q(s,) such tht v = ω()q. (If q = 0, then q is not determined, but my be chosen rbitrrily since ω() = 0.) And, we know tht ech q stisfies q = R(s,) + γ s P(s s,)v (s ) (24) for some vectors v (s ) V(s ). The distribution ω nd vectors v (s ) for ll nd s tell us how to chieve the trget vlue vector v from stte s: 1. Drw joint ction ccording to the distribution ω(). 2. Attempt to execute tht joint ction. 3. If plyer p devited, switch to the policy corresponding to Q p (s,) Else, let s be the new stte. () Set the current stte s to be s. (b) Set the trget vlue vector v to be v (s ). (c) Recompute ω, q, nd v (s) for ll,s ccording to Eqs nd 24. (d) Goto 1. 6 ALGORITHM Using the Bellmn equtions (Eqs ) nd the liner-inequlity representtion of the prune opertor (Eqs ), we cn design dynmic progrmming lgorithm which computes n pproximtion to V(s) for ll s. We first present conceptul, exct lgorithm tht is intrctble to implement; below, in Sec. 6.1, we show how to modify the lgorithm so tht we cn implement it efficiently. 3 A subsequent devition by nother plyer p will cuse us to switch to some other punishment policy, corresponding to Q p (s, ) for the stte s nd ction involved in the devition. 11

13 Initiliztion for s S V(s) {V V Rmx 1 γ } end Repet until converged for itertion 1,2,... for s S Compute vlue vector set nd punishment vlue for ech joint ction for A Q(s,) {R(s,)} + γ s S P(s s,)v(s ) for p {1...P }, Q p (s,) min Q Q(s,) Q p end Do bckups for enforceble one-step joint-ction distributions V(s) {V (V,ω, q ) IRC} end end Figure 1: Dynmic progrmming using exct opertions on sets of vlue vectors. The set IRC is the intersection of the individul rtionlity constrints in Eqs In our lgorithm, the set V(s) for ech stte is initilized to lrge P- dimensionl hypercube, centered t the origin nd extending distnce Rmx 1 γ in ech direction, where R mx is the bsolute vlue of the lrgest rewrd vilble in the gme. This initiliztion mens tht V(s) strts out contining ll vlue vectors which plyers could ever hope to chieve. From this initil vlue of V, we compute Q ccording to Eq. 11. Then, for ech stte s, we intersect the liner inequlities to find vlue vectors V, probbility distributions ω, nd scled Q-vlues q tht re consistent with individul rtionlity for one step into the future. We updte V(s) to be the set of ll one-step individully rtionl vlue vectors V. We then continue in this mnner, lterntely recomputing Q nd V; ech dditionl such bckup extends the plnning horizon nd the individul rtionlity constrints nother step into the future. Finlly, once we hve computed V to the desired ccurcy, we cn select n element of V(s strt ) nd begin executing our policy s described in Sec Write V T(V) for the bckup opertion. We show in Appendix A tht T k (V) converges s k. Unfortuntely, we hve not been ble to show tht the convergence is liner s it is for MDPs: the problem is tht we could need to find very ccurte pproximtion to V(s ) before we relize tht some ction distribution ω is irrtionl t s. If tht ction distribution ws being used s punishment to support equilibri, such chnge could cuse rpid djustment to our vlue sets. 12

14 6.1 Approximte bckups The lgorithm given in the previous section is intrctble since it opertes on rbitrry convex sets. We cn mke trctble lgorithm by storing finite number of points to represent ech convex set V(s). Since the sets re convex, we need only store points on the exterior. Since the vlue vectors lie in R P, we use finite set of directions w 1...w K R P nd store only the points on the exterior of the convex sets frthest in ech direction w i. More precisely, t step k of the lgorithm we pproximte ech convex set T k (V 0 )(s) s V k (s) = conv {V i k(s) i = 1...K} where V i k (s) is the point in T(V k 1)(s) frthest in the w i direction, V i k(s) = rg mx V w i V T(V k 1 )(s) This pproximtion is conservtive, since our pproximte V k (s) is contined in the exct T k (V 0 )(s). So, while our pproximte lgorithm might miss equilibri, it will never erroneously clim tht non-equilibrium is n equilibrium. Using this representtion, the pproximte lgorithm is the sme s the exct lgorithm in Fig. 1, except tht we replce the line with V(s) {V (V,ω, q ) IRC} for i = 1...K, V i (s) rg mx V V w i s.t. (V,ω, q ) IRC (25) The constrints nd objectives for the mximiztions in Eq. 25 re liner, so we cn implement the pproximte lgorithm vi clls to stndrd LP solver. In prticulr, since the sets Q(s, ) re represented s the convex hull of finitely mny points, there re finitely mny liner constrints on the q vectors. We cn sve some time by not computing Q(s,) explicitly, but insted finding the frthest point in Q(s,) in the direction w i directly from the points V i (s): rg mx Q w i = R(s,) + γ P(s s,)rg mx V w i Q Q(s,) v V(s) s = R(s,) + γ s P(s s,)v i (s) 7 EXPERIMENTS In order to test our lgorithm nd our intuition, we creted simple repeted gme clled three-wy mtching pennies. 4 This is three-plyer symmetric gme where every plyer holds penny nd ech turn cn revel heds or tils. If plyers one nd two revel the sme side of the coin, plyer three gets 4 Repeted gmes re subset of generl sum stochstic gmes. 13

15 Tble 1: Pyoff mtrix to the three-wy mtching pennies gme. Plyer 1 Plyer 2 Plyer 3 Tils Heds Tils Tils (0,0,1) (0,0,1) Tils Heds (1,1,0) (0,0,1) Heds Tils (0,0,1) (1,1,0) Heds Heds (0,0,1) (0,0,1) point. If plyers one nd two revel different sides of their coins, they my ern point, depending on wht side plyer three shows. Thus the gme inherently fvors plyer three. The pyoffs re shown in tble 1. An importnt point is tht plyers one nd two will both be better off if they coordinte their joint ctions: ny time they both revel the sme side of the coin they will do poorly. Thus we expect tht solution policies which re correlted equilibri will llow plyers one nd two to do better thn solution policies which re merely Nsh equilibri. Figure 2 shows the chievble vlue vectors for this gme. By coordinting their ctions nd lwys plying either HT or TH, plyers one nd two cn do s well s plyer three nd score hlf the time (corresponding to the point (1,1) in the figure). This is the best plyers one nd two cn do: if they ply HT more thn hlf the time (or less thn hlf the time), plyer three will just ply T (or H) nd reduce their score. If plyers one nd two didn t coordinte but mde their ction choices independently, they would score qurter of the time (corresponding to the point (0.5,1.5)), since they would often end up plying the sme side of the coin nd gurnteeing plyer three point. And, if plyers one nd two nticoordinte their ction choices, they cn score none of the time (corresponding to the point (0,2)). This unfortunte outcome is still n equilibrium: plyers two nd three cting together cn keep plyer one from scoring ny points (nd likewise plyers one nd three cn keep plyer two from scoring), so with the thret of such punishment, there is no incentive for plyer one or two to devite. 8 CONCLUSION We presented trctble pproximtion lgorithm for finding subgme-perfect correlted equilibri in generl sum stochstic gmes. This plnning lgorithm is importnt since it llows self-interested gents to find policies where they cn jointly chieve higher pyoffs by cooperting, nd since subgme-perfect correlted equilibrium is strong condition. To use this lgorithm in prctice, the gents would need to coordinte on vlue vector in V(s strt ) nd would need to simulte modertor; for these purposes the gents cn use the negotition protocol in [1] nd the cryptogrphic protocol in [2], respectively. 14

16 2 1.5 Vlue to plyer Vlue to plyers 1 nd 2 Figure 2: Achievble correlted equilibrium vlue vectors for the three-wy mtching pennies gme with discount fctor of γ = 0.5. Though the true vlue vectors lie in three dimensions, the gme s pyoff structure trets plyers one nd two the sme, so the plot shows vlue to plyers one nd two on the X xis nd vlue to plyer three on the Y xis. Acknowledgements The uthors would like to thnk Ron Prr for helpful comments nd discussion t n erly stge of this work. This reserch ws supported in prt by grnt from DARPA s Computer Science Study Pnel progrm. References [1] Chris Murry nd Geoff Gordon. Multi-robot negotition: Approximting the set of subgme perfect equilibri in generl-sum stochstic gmes. In NIPS, [2] Yevgeniy Dodis, Shi Hlevi, nd Tl Rbin. A cryptogrphic solution to gme theoretic problem. In Lecture Notes in Computer Science, volume 1880, pge 112. Springer, Berlin, [3] D. P. Bertseks. Dynmic Progrmming nd Optiml Control. Athen Scientific, Msschusetts, [4] Prjit K. Dutt. A folk theorem for stochstic gmes. Journl of Economic Theory, 66:1 32,

17 A Proofs In this ppendix we will prove tht the exct lgorithm presented in Figure 1 is correct. We do so by mking the following rguments: Monotonicity The sets of vlue vectors stored by the lgorithm decrese monotoniclly s the lgorithm progresses, so if the lgorithm is llowed to run for long enough they will converge to finl sets of vlue vectors. (Since T is continuous, the finl sets form fixed point V of T, TV = V.) Achievbility After the lgorithm hs converged, ll vlue vectors tht it stores re chievble in subgme-perfect correlted equilibrium. Conservtive initiliztion The vlue vector sets re initilized to contin ll possible vlue vectors tht could be chieved in the gme. Conservtive bckups As the lgorithm runs, it never throws out vlue vector which is chievble in correlted equilibrium. These properties, together, ssure tht the lgorithm finds ll vlue vectors chievble in correlted equilibrium. It is interesting to note tht there my be more thn one solution to the Bellmn equtions (Eqs ). (In prticulr, setting V(s) = for ll s yields trivil fixed point.) The chievbility property mens tht ll of these fixed points contin only vlue vectors chievble in subgme-perfect correlted equilibrium. However, becuse of the conservtive initiliztion nd conservtive bckup properties, our lgorithm finds the (unique) lrgest fixed point, which includes ll equilibrium vlue vectors. The conservtive initiliztion property is esy to show: Figure 1 initilizes V(s) to the hypercube [ R mx /(1 γ),r mx /(1 γ)] P, nd no policy cn possibly chieve more thn this mount of rewrd. The following sections contin proofs of the remining properties. A.1 Monotonicity We will show, first, tht the bckup opertor T is monotone. Tht is, if V nd W re two set-vlue functions with V(s) W(s) for ll s (we will write this property s V W) then T(V) T(W) (26) We will then show tht, s long s we pick our initil set-vlue function V 0 ppropritely, T(V 0 ) V 0 (27) Using (27) s bse cse nd (26) s n inductive step, we will then hve tht, s climed, our sequence of vlue functions decreses monotoniclly s our lgorithm progresses. Lemm 1 T is monotone (Eqution (26)). 16

18 Proof: Write Then by definition Q (V) = R + γp V (28) T(V)(s) = prune Q (V)(s) (29) It is esy to see tht, if V(s) W(s) for ll s, then Q (V)(s) Q (W)(s) for ll s nd : liner opertions on sets preserve subset reltionships. So, if we cn show tht the pruning opertor lso preserves subset reltionships, we will hve the desired result. The sets Q (V)(s) pper in two plces in Eqs (the definition of the pruning opertor): first, they influence the fesible set for q, nd second, they influence the punishment vlues Q(s,). In the first cse, shrinking Q (V)(s) only leds to tighter constrint on q. And, in the second cse, shrinking Q (V)(s) only leds to higher vlue for Q(s,); since Q(s,) ppers with positive sign on the right-hnd side of constrint, rising Q(s,) lso results in tighter constrint. So, the fesible set described by Eqs when using Q (V)(s) is contined in the fesible set when using Q (W)(s), which is wht we wnted to prove. Lemm 2 With V 0 defined s in the initiliztion of Fig. 1, T(V 0 ) V 0. Proof: By stndrd MDP rgument, Q (V 0 ) V 0 : the stochstic mtrix P mps the origin-centered cube V 0 into itself, nd the discount fctor γ shrinks the cube enough tht the offset R cnnot plce the resulting set outside of V 0. But, s rgued in Sec. 5, prune Q conv Q ; the desired result follows. Given these two lemms, the inductive rgument outlined t the beginning of the section shows tht (T k+1 (V 0 ))(s) (T k (V 0 ))(s) for ech s nd k. So, the sequence (T k (V 0 ))(s) converges for ech s, since it is decresing nd bounded below by the empty set. (In fct, T k (V 0 ) is bounded below by ny fixed point V: becuse no policy cn chieve more thn R mx /(1 γ) or less thn R mx /(1 γ), we know V V 0. By monotonicity, T k V T k V 0, nd by the fixed point ssumption, T k V = V. So, T k V 0 contins V for ll k. Therefore, V = lim k T k V 0 contins ny fixed point V, mening tht V is the unique lrgest fixed point of T.) A.2 Achievbility Lemm 3 Let V be fixed set of T, tht is, V = T(V ). For ny v V (s), the policy π v,s described in Sec. 5.4 chieves v in expecttion strting from stte s. And, no gent hs n incentive to devite from π v,s t ny step. Proof: The proof will be by induction. Specificlly, we will show tht, for ll v nd s, following π v,s for k steps will yield n ctul expected discounted vlue vector A k (v,s) which stisfies A k (v,s) v γ k R mx 1 γ (30) 17

19 (The norm will lwys be the mx (infinity) norm, but we will leve off the subscript to void clutter.) So, lim A k(v,s) = v (31) k for ll s nd v V (s). Since the prune Q opertion enforces incentive constrints under the ssumption tht the Q sets correctly describe chievble vlues, Eq. 31 mens tht incentive constrints re correctly enforced. (More precisely, the incentive for ny plyer to devite is bounded by twice the mxnorm error in chieving trget vlue vector; since Eq. 31 shows tht this error is zero in the limit, there is no incentive to devite.) Bse cse: Following ny policy for 0 steps from stte s chieves A 0 (v,s) = 0, while v V (s) mens tht v Rmx 1 γ. So, A 0(v,s) v γ 0 Rmx 1 γ. Inductive cse: By the inductive hypothesis, we cn strt in ny stte s nd, for ny v V (s), chieve in k steps vlue vector A k (v,s) stisfying A k (v,s) v γ k Rmx 1 γ. Since V is fixed set, for every v V (s) there exists distribution ω nd vlues v,s for ll nd s stisfying one-step incentive constrints nd [ ] v = ω() R(s,) + γ P(s s,)v,s s (32) v,s V (s ) (33) Our k + 1-step policy will use the ction distribution ω on the first step, nd will trget v,s for the next k steps to chieve [ ] A k+1 (v,s) = ω() R(s,) + γ s P(s s,)a k (v,s,s ) (34) This lets us write the mx-norm distnce between the vlue chieved by the k + 1-step policy nd the trget vlue s A k+1 (v,s) v = A k+1(v,s) [ ω() R(s,) + γ P(s s,)v,s ] (35) s = ω()γ P(s s,)[a k (v,s,s ) v,s ] (36) s γ ω()p(s s,) A k (v,s,s ) v,s (37) s γ ω()p(s s,)γ K R mx (38) 1 γ s = γ K+1 R mx 1 γ 18 (39)

20 Here Eq. 35 expnds v using eqution 32. Eq. 36 expnds A k+1 (v,s) using eqution 34. This expnsion lets the rewrd terms ω()r(s,) cncel out. Eqution 37 uses the fct tht the norm of sum of terms is not greter thn the sum of the norms of the terms. Eqution 38 uses the inductive hypothesis, which sys tht K-step policy strting from ny stte s cn cn chieve vlue within γ K Rmx 1 γ of ny vlue v(s ) V (s ). The lst eqution uses the fct tht probbility distributions sum to one. Tking Eqs together, we hve verified the inductive cse nd hve therefore proven the lemm. A.3 Conservtive bckups Lemm 4 If V 0 contins the vlue vectors for ll subgme-perfect correlted equilibri t ll sttes, then T k (V 0 ) lso contins the vlue vectors for ll subgme-perfect correlted equilibri t ll sttes. Proof: The proof is by induction. The bse cse (k = 0) is ssumed in the lemm. For the inductive step, ssume the lemm is true for T k. The bckup opertor consists of two steps, defined in Eqs The first step throws out only vlue vectors tht re not chievble by n rbitrry initil ction followed by ny equilibrium policy. The second step throws out only vlue vectors for which the individul rtionlity constrints re violted t the first step, nd must therefore leve subgme-perfect correlted equilibri untouched. 19

21 Crnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA Crnegie Mellon University does not discriminte nd Crnegie Mellon University is required not to discriminte in dmission, employment, or dministrtion of its progrms or ctivities on the bsis of rce, color, ntionl origin, sex or hndicp in violtion of Title VI of the Civil Rights Act of 1964, Title IX of the Eductionl Amendments of 1972 nd Section 504 of the Rehbilittion Act of 1973 or other federl, stte, or locl lws or executive orders. In ddition, Crnegie Mellon University does not discriminte in dmission, employment or dministrtion of its progrms on the bsis of religion, creed, ncestry, belief, ge, vetern sttus, sexul orienttion or in violtion of federl, stte, or locl lws or executive orders. However, in the judgment of the Crnegie Mellon Humn Reltions Commission, the Deprtment of Defense policy of, "Don't sk, don't tell, don't pursue," excludes openly gy, lesbin nd bisexul students from receiving ROTC scholrships or serving in the militry. Nevertheless, ll ROTC clsses t Crnegie Mellon University re vilble to ll students. Inquiries concerning ppliction of these sttements should be directed to the Provost, Crnegie Mellon University, 5000 Forbes Avenue, Pittsburgh PA 15213, telephone (412) or the Vice President for Enrollment, Crnegie Mellon University, 5000 Forbes Avenue, Pittsburgh PA 15213, telephone (412) Obtin generl informtion bout Crnegie Mellon University by clling (412)

Reinforcement Learning

Reinforcement Learning Reinforcement Lerning Tom Mitchell, Mchine Lerning, chpter 13 Outline Introduction Comprison with inductive lerning Mrkov Decision Processes: the model Optiml policy: The tsk Q Lerning: Q function Algorithm