Fictitious Self-Play in Extensive-Form Games

Size: px

Start display at page:

Download "Fictitious Self-Play in Extensive-Form Games"

Roy McDaniel
6 years ago
Views:

1 Johannes Heinrich, Marc Lanctot, David Silver University College London, Google DeepMind July 9, 05

2 Problem Learn from self-play in games with imperfect information. Games: Multi-agent decision making domains, e.g. poker, politics, security Self-play: Agents learn from interaction with each other, without prior knowledge

3 Extensive-Form Game Game-theoretic model of sequential interaction of multiple players Based on a game tree Can represent imperfect (asymmetric, private) information

4 Fictitious Play Game-theoretic model of learning in games Players repeatedly play a game At each iteration each player chooses a best response to their opponents average behaviour Players average strategies converge to Nash equilibrium in some classes of games, e.g. potential games and two-player zero-sum games (Brown, 95)

5 Fictitious Play Game-theoretic model of learning in games Players repeatedly play a game At each iteration each player chooses a best response to their opponents average behaviour Players average strategies converge to Nash equilibrium in some classes of games, e.g. potential games and two-player zero-sum games Problem: Almost exclusively studied in normal-form games (tabular, no explicit sequential structure) (Brown, 95)

6 wo algorithms Fictitious play in extensive-form games Computation linear in time and space rather than exponential Preserve convergence guarantees Fictitious Self-Play Experiential and sample-based approximation of fictitious play Leverages machine learning

7 wo steps of fictitious play At each iteration Compute a best response to opponents average strategies Update own average strategy with computed best response Π k+ Π k + k + (BR [Π k] Π k )

8 Information state tree C rest of tree...

9 Computing a best response E rest of tree... E E

10 Strategies Behavioural Pure p p 0 q q 0 Mixed pq p( q) p 0

11 Aggregating strategies Π B ( α) α Π B

12 Aggregating strategies ( α)π + αb α 0 0.5( α) 0.5( α) α α 0

13 Realization-equivalence p ( p) pq p( q) wo strategies are realization-equivalent iff for any opponent strategies they define the same probability distribution over the states of the game. Realization plan: x π (σ u ) := π(u, a) u U (u,a) σ u (Koller et al., 994; Von Stengel, 996)

14 Aggregating behavioural strategies Lemma Given π and β two behavioural strategies Π and B two mixed strategies 3 Π and B are realization equivalent to π and β hen [ µ(u) = π(u)+α x β (σ u ) ( α)x π (σ u ) + αx β (σ u ) ] (β(u) π(u)) u U is realization-equivalent to M = Π + α (B Π)

15 Fictitious play in extensive-form games Extensive-Form Fictitious Play (Hendon et al., 996) Extensive-Form Fictitious Play (this paper) Exploitability Iterations Figure: Learning curves in Leduc Hold em.

16 Motivation Full-width fictitious play performs computation at all states of the game, no matter whether they are likely to occur Sampling can focus on relevant states In big games event a single iteration of full-width fictitious play might be too costly Function approximator could generalise between states he state space is larger than information state space Learning agents only operate on their information states

17 Generalised weakened fictitious play At each iteration Compute an approximate best response to opponents average strategies Update own average strategy with computed best response, allowing for some kinds of perturbations Π k+ Π k + α k+ ( BRɛk+ [Π k ] Π k + M k+ ) (Benaïm et al., 005; Leslie & Collins, 006)

18 Learning a best response E rest of tree... E E

19 Knowledge transfer Opponent s update (in a two-player game): Π i k = ( α k )Π i k + α kb i k hus the MDP defined by Π i k ( MDP has the following structure: Π i k ) C α k α k (α k 0, k ) ( ) MDP Π i k ( MDP B i k )

20 Learning a strategy update Learn Π i k = ( α k)π i k + αbi k, by sampling data (state-action pairs) from { π i k with prob. α k β i k with prob. α k against some fixed, fully mixed opponent sampling policy, e.g. π i k.

21 Fictitious Self-Play Generate an approximate best response by reinforcement learning from experience Learn a model of own average behaviour by supervised learning from experience 3 Generate experience from self-play, using combinations of (π i, π i ), (β i, π i ), (π i, β i )

22 Experiments Fitted Q-Iteration Counting model: a A(u t ) : N(u t, a) N(u t, a) + {at=a} a A(u t ) : π(u t, a) N(u t, a) N(u t )

23 Fictitious Self-Play in Leduc Hold em.5 XFP, 6-card Leduc FSP:FQI, 6-card Leduc Exploitability ime in s Figure: Learning curves in Leduc Hold em.

24 Fictitious Self-Play in Leduc Hold em Exploitability XFP, 6-card Leduc XFP, 60-card Leduc FSP:FQI, 6-card Leduc FSP:FQI, 60-card Leduc ime in s Figure: Learning curves in Leduc Hold em.

25 Fictitious Self-Play in River Poker.5 XFP, River Poker (uniform beliefs) FSP:FQI, River Poker (uniform beliefs) Exploitability ime in s Figure: Learning curves in River Poker.

26 Fictitious Self-Play in River Poker Exploitability 0 0. XFP, River Poker (uniform beliefs) XFP, River Poker (defined beliefs) FSP:FQI, River Poker (uniform beliefs) FSP:FQI, River Poker (defined beliefs) ime in s Figure: Learning curves in River Poker.

27 Summary Convergent, full-width fictitious play in extensive-form games Fictitious Self-Play is an experiential, sample- and learning-based approach to fictitious play

Solving Zero-Sum Extensive-Form Games. Branislav Bošanský AE4M36MAS, Fall 2013, Lecture 6

Solving Zero-Sum Extensive-Form Games ranislav ošanský E4M36MS, Fall 2013, Lecture 6 Imperfect Information EFGs States Players 1 2 Information Set ctions Utility Solving II Zero-Sum EFG with perfect recall