A Polynomial-time Nash Equilibrium Algorithm for Repeated Games

Size: px

Start display at page:

Download "A Polynomial-time Nash Equilibrium Algorithm for Repeated Games"

Bryan Black
5 years ago
Views:

1 A Polynomial-time Nash Equilibrium Algorithm for Repeated Games Michael L. Littman Rutgers University Peter Stone The University of Texas at Austin

2 Main Result Present a polynomial-time algorithm for computing a Nash equilibrium for a 2- player, average-payoff repeated game. Not: A polynomial-time Nash equilibrium algorithm for one-shot games. This is a well-known open problem, possibly unnecessarily hard. 7/22/04 Polytime Repeated Nash 2

3 Example: Grid Game 3 U, D, R, L, X No move on collision Semiwalls (50%) A B (Hu & Wellman 01) -1 for step, -10 for collision, +100 for goal, 0 if back to initial config. Both can get goal. 7/22/04 Polytime Repeated Nash 3

4 Choices in Grid Game A XX B see: Hawks/Doves, Traffic, chicken Average reward: (32.3, 16.0), C, S (16.0, 32.3), S, C (-1.0, -1.0), C, C (15.8, 15.8), S, S (15.9, 15.9), mix (25.7, 25.8), L, F (25.8, 25.7), F, L 7/22/04 Polytime Repeated Nash 4

5 Grid Game 3: Matrix A B C S L F A C S L F B s matrix is the transpose of this. 7/22/04 Polytime Repeated Nash 5

6 One-Shot Strategy We play 1 round of (bimatrix game) GG3. Strategy is prob. distribution over choices. How do we choose? 7/22/04 Polytime Repeated Nash 6

7 Security Level Solution A doesn t know what B will do. Maximize reward in the worst case. If A plays C (prob. 0.01) and S (prob. 0.99), A s worst cases are C and F (15.85). (Defense) If B plays C (prob. 0.49) and F (prob. 0.51), A s best choices are C and S (15.85). (Attack) Computed efficiently via linear programming. Too pessimistic/paranoid? 7/22/04 Polytime Repeated Nash 7

8 Nash Equilibrium Pair of strategies such that neither player has incentive to deviate unilaterally. Always exists (Nash 51). Sometimes mixed. A B C S L F B C S L F B A C A C S S L L F F /22/04 Polytime Repeated Nash 8

9 Nash Values For GG3: (C, S) = (32.3, 16.0), very imbalanced, 24.2 each (S, C) =(16.0, 32.3) very imbalanced, 24.2 each ~1/2 mix (C/S, C/S) =(15.9, 15.9), imbalanced, very 15.9 each Computationally difficult to find in general. (L, F) =(25.7, 25.8), not Nash nearly balanced, 25.8 each 7/22/04 Polytime Repeated Nash 9

10 Repeated Games What if we face each other multiple times? Strategies: can be a function of history can be randomized Nash equilibrium still exists, of course. Philosophical claim: Equilibrium assumes games repeated; players choose best response. Computational observation: Easier to find. 7/22/04 Polytime Repeated Nash 10

11 Equilibrium in Repeated GG3 A: B: B faces L or C. Achieves max via F. Average: A faces F or C. L gets C gets But best vs. C gets 16.0, bringing avg to /22/04 Polytime Repeated Nash 11

12 Observations Can balance payoff by alternating roles. Like tit-for-tat from PD (Axelrod 84). Related to folk theorem 7/22/04 Polytime Repeated Nash 12

13 Repeated Games are Special Folk Theorem (Osborne & Rubinstein 94, e.g.): For any repeated game under the average-reward criterion, any achievable payoff profile that dominates the security-level payoffs is the payoff profile of a Nash equilibrium pair. Proof: Achievable payoff stabilized by each player threatening to reduce the other to its security level. 7/22/04 Polytime Repeated Nash 13

14 Algorithmic Application Algorithmic Result (Littman & Stone 03): For any two-player repeated game under the average-reward criterion, a Nash equilibrium pair of controllers can be synthesized in polynomial time. Builds on the structural Folk Theorem. Computational and representational result. Proof: Two tricks 7/22/04 Polytime Repeated Nash 14

15 Two-Player Plot Mark payoff for each action combination. Mark security level. Subtract security level (advantage game). 7/22/04 Polytime Repeated Nash 15

16 Two-Player Plot Mark payoff for each action combination. Mark security level. Subtract security level (advantage game). 7/22/04 Polytime Repeated Nash 16

17 Mutual advantage: Two Cases There is one or a pair of action combinations that can be averaged to get a point that dominates security level. Otherwise: There isn t. 7/22/04 Polytime Repeated Nash 17

18 Noticing Mutual Advantage Easy to state way: Compute convex hull. Easy to compute way: Check all pairs of action combinations. Advantage payoffs: x = (x 1,x 2 ), y = (y 1,y 2 ) Compute w x = (-y 2 (x 1 -y 1 )-y 1 (x 2 -y 2 ))/(2(x 2 -y 2 )(x 1 -y 1 )) If 0 w x 1, z = w x x + (1-w x ) y dominates security iff any combination does. Natural choice: Nash bargaining solution (Nash 50). 7/22/04 Polytime Repeated Nash 18

19 Counting Node Representation Nodes: probability distributions on actions Edges: opponent actions Counting nodes: repeat count, escape. c trick 1 π c i q * = π * iq π π * * i q... π iq * π * 7/22/04 Polytime Repeated Nash 19

20 Alternation Repeat one, then the other. Repeat. 7/22/04 Polytime Repeated Nash 20

21 Mutual Advantage Strategies Punish via attack strategy (α). Formulae for alternation (r i, r j ) and punishment (a 1, a 2 ) counts in paper. 7/22/04 Polytime Repeated Nash 21

22 Otherwise... Check defense against defense. If Nash, done. If not, at most one player can be improved unilaterally (since not mutual advantage) Defense against improved is Nash. trick 2 All steps polytime. Finds equilibrium. 7/22/04 Polytime Repeated Nash 22

23 Conclusion Threats can help. Find repeated Nash in polynomial time. Very simple structure for symmetric games. Some ideas work sequential games. 7/22/04 Polytime Repeated Nash 23

24 Future Work Discounted reward: as hard as one shot? More than two players: Feasible. Need uncoordinated punishment. Graphical games: Factored representation. Learning: Sizing up the opponent? Generalize to stochastic games. 7/22/04 Polytime Repeated Nash 24

25 From the paper: PD battle of sexes unbalanced game exponential game Examples 7/22/04 Polytime Repeated Nash 25

26 Symmetric Case R 1 (a, a ) = R 2 (a, a) Value of game just maximum average! Alternate or accept security-level. 7/22/04 Polytime Repeated Nash 26

27 Symmetric Markov Game AB BA Episodic Roles chosen randomly Algorithm: Maximize sum (MDP) Security-level (0-sum) Choose max if better Converges to Nash. 7/22/04 Polytime Repeated Nash 27

28 Discussion Objectives in game theory for agents? Desiderata? How learn state space when repeated? Multiobjective negotiation? Learning: combine leading and following? Different unknown discount rates?? Incomplete rationality? Incomplete information of rewards? 7/22/04 Polytime Repeated Nash 28

29 Markov Game S: Finite set of states A 1, A 2 : Finite set of action choices R 1 (s, a 1, a 2 ): Payoff to first player R 2 (s, a 1, a 2 ): Payoff to second player P(s s, a 1, a 2 ): Transition function G: Goal (terminal) states (subset of S) Objective: maximize expected total reward 7/22/04 Polytime Repeated Nash 29

30 Markov Games: Overview Combines Markov chain & matrix game: Players jointly set transitions and rewards One player: Markov decision processes Two-player zero sum best studied Also: sequential or stochastic games In general, equilibrium strategy probabilistic (unlike MDPs and games of alternation) 7/22/04 Polytime Repeated Nash 30

31 Zero-sum Markov Games How do we compute an equilibrium? Value iteration: Markov chain, except solve a mini zero-sum game at each stage. Work through example: Soccer showdown: two effective states 7/22/04 Polytime Repeated Nash 31

32 Complexity Results One player controls each state, alternating In NP co-np, in P? Otherwise, Optimal values can be irrational Even if transitions deterministic Can approximate iteratively 7/22/04 Polytime Repeated Nash 32

33 Collaborative Solution A Average total: (96, 96) (not Nash) A BA A won t wait. A AB B B changes incentives. Can we formalize collaboration like this? Simpler setting: matrix games 7/22/04 Polytime Repeated Nash 33

34 Repeated Matrix Game R1 = R 2 = One-state Markov game A 1 = A 2 = {cooperate, defect}: PD One (single-step) Nash 7/22/04 Polytime Repeated Nash 34

35 Two Special Cases Saddle-point equilibrium Deviation helps other player. Value is unique solution to zero-sum game. Coordination equilibrium Both players get maximum reward possible Value is unique max value R1 = Question: Can we check these properties efficiently? R 2 = /22/04 Polytime Repeated Nash 35

36 Tit-for-Tat R1 = R 2 = Saddle point, not coordination. Consider: cooperate, defect iff defected on. Better (3) than with defect-defect (1). In fact, pareto-optimal, although requires a sequence of decisions. 7/22/04 Polytime Repeated Nash 36

37 Tit-For-Tat is Nash Cooperation (TFT) is best response C: C, D: D = 3 C: C, D: C = 3 C: D, D: D = 1 C: D, D: C = 2.5 7/22/04 Polytime Repeated Nash 37

38 Generalized TFT TFT stabilizes mutually beneficial outcome. General class of policies: Play beneficial action Punish deviation to suppress temptation Need to generalize both components. 7/22/04 Polytime Repeated Nash 38

Convergence to Pareto Optimality in General Sum Games via Learning Opponent s Preference

Convergence to Pareto Optimality in General Sum Games via Learning Opponent s Preference Dipyaman Banerjee Department of Math & CS University of Tulsa Tulsa, OK, USA dipyaman@gmail.com Sandip Sen Department