Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games

Size: px

Start display at page:

Download "Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games"

Nelson Benson
6 years ago
Views:

1 Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games Richard Mealing and Jonathan L. Shapiro Machine Learning and Optimisation Group School of omputer Science University of Manchester, UK Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of omputer ScienceUniversity of Manchest 1 / 22

2 The Problem You play against an opponent The opponent s actions are based on previous actions How can you maximise your reward? Applications Heads-up poker Auctions P2P networking Path finding etc Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of omputer ScienceUniversity of Manchest 2 / 22

3 Possible Approaches You could use reinforcement learning to learn to take actions with high expected discounted rewards However we propose to: Model the opponent using sequence prediction methods Lookahead and take actions which probabilistically, according to the opponent model, lead to the highest reward Which approach give us the highest rewards? Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of omputer ScienceUniversity of Manchest 3 / 22

4 Opponent Modelling using Sequence Prediction Observe the opponent s action and the player s action (a opp, a) Form a sequence over time t (memory size n) (a t opp, a t ), (a t 1 opp, a t 1 ),..., (a t n+1 opp, a t n+1 ) Predict the opponent s next action based on this sequence Pr ( a t+1 opp (a t opp, a t ), (a t 1 opp, a t 1 ),..., (a t n+1 opp, a t n+1 ) ) Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of omputer ScienceUniversity of Manchest 4 / 22

5 Sequence Prediction Methods We tested a variety of sequence prediction methods... Lempel-Ziv-1978 (LZ78) [1] Knuth-Morris-Pratt (KMP) [2] Unbounded contexts Prediction by Partial Matching (PPM) [3] ActiveLeZi [4] Transition irected Acyclic Graph (TAG) [5] Entropy Learned Pruned Hypothesis Space (ELPH) [6] N-Gram [7] ontext blending ontext pruning Hierarchical N-Gram (H. N-Gram) [7] } ollection of 1 to N-Grams Long Short Term Memory (LSTM) [8] } Implicit blending & pruning Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of omputer ScienceUniversity of Manchest 5 / 22

6 Sequence Prediction Method Lookahead Predict with k lookahead given a hypothesised context i.e. ( ) Pr aopp t+k (aopp t+k 1, a t+k 1 ), (aopp t+k 2, a t+k 2 ),..., (aopp t+k n, a t+k n ) A hypothesised context may contain unobserved (predicted) symbols Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of omputer ScienceUniversity of Manchest 6 / 22

7 Reinforcement Learning: Q-Learning Learns an action-value function that when input a state-action pair (s, a) outputs the expected value of taking that action in that state and following a fixed strategy thereafter [9] State Action {}}{ Q( s t, {}}{ a t Reward iscount Learning rate {}}{{}}{ ) (1 α )Q(s t, a t ) + α[ r t {}}{ + γ }{{} fraction of old value max Q(st+1, a t+1 )] a t+1 }{{} fraction of reward & next max valued action Select actions with high q-values with some exploration Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of omputer ScienceUniversity of Manchest 7 / 22

8 Need for Lookahead (Prisoner s ilemma Example) 1,1 4,0 0,4 3,3 Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of omputer ScienceUniversity of Manchest 8 / 22

9 Need for Lookahead (Prisoner s ilemma Example) efect is the dominant action ooperate-ooperate is socially optimal (highest sum of rewards) Tit-for-tat (copy opponent s last move) is good for iterated play an we learn tit-for-tat? Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of omputer ScienceUniversity of Manchest 9 / 22

10 Need for Lookahead (Prisoner s ilemma Example) Pred. Pred ,1 4,0 0,4 3,3 1 0 Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of omputer ScienceUniversity of10 Manchest / 22

11 Need for Lookahead (Prisoner s ilemma Example) Pred. Pred ,1 4,0 0,4 3,3 1 0 Lookahead of 1 shows has highest reward With lookahead of 2 (,,,) has highest total reward (unlikely) Assume the opponent copies the player s last move (i.e. tit-for-tat) Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of omputer ScienceUniversity of11 Manchest / 22

12 Need for Lookahead (Prisoner s ilemma Example) Pred. 1,1 4,0 0,4 3,3 Pred Pred. Pred. Pred. Pred Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of omputer ScienceUniversity of12 Manchest / 22

13 Need for Lookahead (Prisoner s ilemma Example) Pred. 1,1 4,0 0,4 3,3 Pred Pred. Pred. Pred. Pred Lookahead of 2 against tit-for-tat shows has highest reward 3 Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of omputer ScienceUniversity of13 Manchest / 22

14 Q-Learning s Implicit Lookahead State Action {}}{ Q( s t, {}}{ a t Reward iscount Learning rate {}}{{}}{ ) (1 α )Q(s t, a t ) + α[ r t {}}{ + γ }{{} fraction of old value max Q(st+1, a t+1 )] a t+1 }{{} fraction of reward & next max valued action Assume each state is an opponent action i.e. s = a opp Learns (player action, opponent action) values as: γ = 0 - payoff matrix (arg max a Q ( aopp, t+1 a ) same as max lookahead 1) 0 <γ <1 - payoff matrix + future rewards with exponential decay γ = 1 - payoff matrix + future rewards Increasing γ increases lookahead Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of omputer ScienceUniversity of14 Manchest / 22

15 Exhaustive Explicit Lookahead We use exhaustive explicit lookahead with the opponent model and action values to greedily select actions (to limited depth) maximising total reward Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of omputer ScienceUniversity of Manchest Sequence Prediction Opponent Modelling 15 / 22

16 Experiments Iterated Rock-Paper-Scissors Opponent s actions depend on its previous actions Iterated Prisoner s ilemma Opponent s actions depend on both players previous actions R P S R 0,0-1,1 1,-1 P 1,-1 0,0-1,1 S -1,1 1,-1 0,0 1,1 4,0 0,4 3,3 Littman s Soccer [10] irect competition Which approach has better performance? ichard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of omputer ScienceUniversity of16 Manchest / 22

17 Iterated Rock Paper Scissors Memory Size 1 {R,P,S} Order 1 {R,R,P,P,S,S} Order 2 {R,R,R,P,P,P,S,S,S} Order 3 Name Avg Payoff Avg Time Name Avg Payoff Avg Time Name Avg Payoff Avg Time ELPH 1 ± ± 0.6 WoLF-PH ± ± 5 ELPH ± ± 4 WoLF-PH 1 ± 0 27 ± 2 PGA-APP ± ± 5 PGA-APP ± ± 4 PGA-APP ± ± 2 ɛ Q-Learner ± ± 3 WoLF-PH ± ± 4 ɛ Q-Learner 0.97 ± ± 2 ELPH ± ± 0 ɛ Q-Learner ± ± 6 WPL 0.87 ± ± 6 WPL ± ± 7 WPL ± ± 7 Memory Size 2 Memory Size 3 ELPH 1 ± 0 10 ± 0 ELPH 1 ± 0 10 ± 0 WoLF-PH 0.68 ± ± 6 WoLF-PH 0.98 ± ± 3 ɛ Q-Learner 0.92 ± ± 4 ɛ Q-Learner 0.64 ± ± 5 ɛ Q-Learner 0.97 ± ± 2 WoLF-PH 0.91 ± ± 8 PGA-APP 0.61 ± ± 7 PGA-APP 0.92 ± ± 3 PGA-APP 0.86 ± ± 6 ELPH 0.6 ± ± 4 WPL 0.65 ± ± 7 WPL 0.54 ± ± 6 WPL ± ± 7 ELPH 1 ± 0 10 ± 0 ELPH 1 ± ± 0.7 ELPH 1 ± ± 0.7 WoLF-PH 0.95 ± ± 6 WoLF-PH 0.89 ± ± 3 WoLF-PH 0.85 ± ± 0 ɛ Q-Learner 0.94 ± ± 4 ɛ Q-Learner 0.87 ± ± 5 ɛ Q-Learner 0.84 ± ± 6 PGA-APP 0.9 ± ± 6 PGA-APP 0.87 ± ± 6 PGA-APP 0.77 ± ± 3 WPL 0.63 ± ± 6 WPL 0.69 ± ± 2 WPL 0.76 ± ± 0 Good Bad Agents cannot learn best response with memory size < model order Our approach gains the highest payoffs at generally the fastest rates Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of omputer ScienceUniversity of17 Manchest / 22

18 Iterated Prisoner s ilemma Memory Size 1 iscount = 0 and epth = 1 iscount = 0.99 and epth = 2 Name Avg Payoff Avg Time Position Name Avg Payoff Avg Time Position PGA-APP 2.03 ± ± 3 13 ɛ Q-Learner 2.68 ± ± 5 1 ɛ Q-Learner 1.94 ± ± 4 16 TAG + Q-Learner 2.63 ± ± 4 1 WPL ± ± 1 17 TAG ± ± 1 1 TAG 1.93 ± ± 2 16 WPL 2.31 ± ± 4 12 WoLF-PH 1.89 ± ± 2 18 PGA-APP 2.17 ± ± 3 13 WoLF-PH 2.1 ± ± 5 13 Memory Size 2 Memory Size 3 PGA-APP 2.01 ± ± 4 14 TAG + Q-Learner ± ± 6 1 WPL ± ± 1 17 ɛ Q-Learner 2.74 ± ± 5 1 WoLF-PH 1.92 ± ± 4 17 TAG 2.72 ± ± 1 1 TAG ± ± 2 16 WPL 2.34 ± ± 4 12 ɛ Q-Learner ± ± 2 18 PGA-APP 2.18 ± ± 5 13 WoLF-PH 2.14 ± ± 3 13 ɛ Q-Learner 2.02 ± ± 3 14 TAG + Q-Learner ± ± 5 1 TAG ± ± 3 17 TAG 2.74 ± ± 3 1 WPL ± ± 3 17 ɛ Q-Learner 2.65 ± ± 5 1 PGA-APP 1.92 ± ± 2 16 WPL 2.32 ± ± 4 12 WoLF-PH ± ± 1 18 PGA-APP 2.18 ± ± 4 12 WoLF-PH 2.14 ± ± 4 13 Good Bad Increasing lookahead (discounting, search depth) increases rewards Our approach + Q-Learning increases rewards but also increases time Our approach gains the highest payoffs at generally the fastest rates Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of omputer ScienceUniversity of18 Manchest / 22

19 Soccer ɛ Q-Learner WoLF-PH WPL PGA-APP Name Avg Payoff Name Avg Payoff Name Avg Payoff Name Avg Payoff PPM ± PPM ± PPM ± PPM ± LSTM ± LSTM ± H. N-Gram ± H. N-Gram ± TAG 0.63 ± FP ± N-Gram ± ActiveLeZi ± H. N-Gram ± N-Gram ± LSTM ± FP ± LZ ± H. N-Gram ± TAG ± TAG ± N-Gram 0.62 ± ActiveLeZi ± FP ± LSTM ± ActiveLeZi ± TAG ± LZ ± N-Gram ± ELPH ± LZ ± ActiveLeZi ± LZ ± FP ± ELPH ± ELPH ± ELPH ± KMP ± KMP ± KMP 0.62 ± KMP ± Good Bad Our approach wins above 50% of the games using any predictor PPM has the highest performances Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of omputer ScienceUniversity of19 Manchest / 22

20 onclusions We proposed sequence prediction and lookahead to accurately model and effectively respond to opponents with memory Empirical results show given sufficient memory and lookahead our approach outperforms reinforcement learning algorithms Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of omputer ScienceUniversity of20 Manchest / 22

21 Future Work Will apply our approach to domains with: Larger state spaces Hidden information Where the challenges are: eeper lookahead (e.g. sampling techniques) Sequence predictor configuration (e.g. 1 predictor per state) Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of omputer ScienceUniversity of21 Manchest / 22

22 References [1] Lempel and Ziv. ompression of Individual Sequences via Variable-Rate oding [2] Byron Knoll. Text Prediction and lassification Using String Matching [3] Alistair Moffat. Implementing the PPM ata ompression Scheme. In: IEEE Transactions on ommunications 38 (1990), pp [4] Karthik Gopalratnam and iane J. ook. ActiveLezi: An incremental parsing algorithm for sequential prediction. In: 16th Int. FLAIRS onf. 2003, pp [5] Philip Laird and Ronald Saul. iscrete Sequence Prediction and Its Applications. In: Machine Learning 15 (1994), pp [6] Jensen et al. Non-stationary policy learning in 2-player zero sum games. In: Proc. of 20th Int. onf. on AI. 2005, pp [7] Ian Millington. Artificial Intelligence for Games. In: ed. by avid H. Eberly. Morgan Kaufmann, hap. Learning, pp [8] Felix A. Gers, Nicol N. Schraudolph, and Jürgen Schmidhuber. Learning Precise Timing with LSTM Recurrent Networks. In: JMLR 3 (2002), pp [9]. J.. H. Watkins. Learning from delayed rewards. Ph thesis. ambridge, [10] Michael L. Littman. Markov games as a framework for multi-agent reinforcement learning. In: 11th Proc. of IML. Morgan Kaufmann, 1994, pp Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of omputer ScienceUniversity of22 Manchest / 22

Learning an Effective Strategy in a Multi-Agent System with Hidden Information

Learning an Effective Strategy in a Multi-Agent System with Hidden Information Richard Mealing Supervisor: Jon Shapiro Machine Learning and Optimisation Group School of Computer Science University of Manchester