Learning an Effective Strategy in a Multi-Agent System with Hidden Information

Size: px

Start display at page:

Download "Learning an Effective Strategy in a Multi-Agent System with Hidden Information"

Crystal Flora Bailey
5 years ago
Views:

1 Learning an Effective Strategy in a Multi-Agent System with Hidden Information Richard Mealing Supervisor: Jon Shapiro Machine Learning and Optimisation Group School of Computer Science University of Manchester / 3

2 Our Problem: Maximising Reward with An Opponent We focus on the simplest case with just 2 agents Each agent is trying to maximise its own rewards But each agent s actions can affect the other agent s rewards 2 / 3

3 Our Proposal: Predict and Adapt to the Future Before maximising our rewards we learn: What our rewards are for actions - use reinforcement/no-regret learning How the opponent will act - use sequence prediction methods To maximise our rewards: Lookahead - take the actions with the maximum expected reward Simulate - adapt our strategy to rewards against the opponent model Hidden information - what did the opponent base their decision on? Learn the hidden information using online expectation maximisation 3 / 3

4 Why Games? Games let you focus on the agent and worry less about the environment Well-defined rules and clear goals Can allow easy agent comparisons Can allow complex strategies Game theory gives a foundation 4 / 3

Artificial Intelligence Success in Games Year 979 994

Chess Othello (Reversi) Go Poker Jeopardy!

8 beat world champion Luigi Villa [] Chinook beat world

champion David Boys [3] Deep Blue beat world champion

Takeshi Murakami [5] Crazy Stone beat various pros [6]

5 Artificial Intelligence Success in Games Year Game Backgammon Checkers Scrabble Chess Othello (Reversi) Go Poker Jeopardy! Success BKG 9.8 beat world champion Luigi Villa [] Chinook beat world champion Marion Tinsley [2] Quackle beat former champion David Boys [3] Deep Blue beat world champion Garry Kasparov [4] Logistello beat world champion Takeshi Murakami [5] Crazy Stone beat various pros [6] Polaris beat various pros in heads-up limit Texas hold em [7] Watson beat former winners Brad Rutter and Ken Jennings [8] 5 / 3

6 Perfect and Imperfect Information Perfect information - players always know the state e.g. Tic Tac Toe Checkers Imperfect information - at some point a player doesn t know the state e.g. Rock Paper Scissors Poker 6 / 3

First Approach Reinforcement learning (Q-Learning) to learn our own rewards 2 Sequence prediction to learn the opponent s strategy 3 Exhaustive

reinforcement learning agents in: Rock Paper Scissors Prisoner s Dilemma Littman s Soccer [] Richard Mealing and Jonathan L. Shapiro.

7 First Approach Reinforcement learning (Q-Learning) to learn our own rewards 2 Sequence prediction to learn the opponent s strategy 3 Exhaustive explicit lookahead (to a limited depth) with and 2 to take the actions with the maximum expected reward Outperforms state-of-the-art reinforcement learning agents in: Rock Paper Scissors Prisoner s Dilemma Littman s Soccer [] Richard Mealing and Jonathan L. Shapiro. Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games. In: 2th International Conference on Artificial Intelligence and Soft Computing / 3

8 Reinforcement Learning We use Q(uality)-Learning to learn the rewards for action sequences Comparison agents use Q-Learning or Q-Learning based methods Q-Learning learns the expected value of taking an action in a state and then following a fixed strategy [] s t = α = γ = Q(s t, apla t ) ( α)q(st, apla t ) + α[r t + γ max Q(s t+, a t+ a t+ pla )] pla state at time t learning rate discount factor a t pla = r t = player s action at time t reward at time t We use Q(s t, apla t ) with lookahead and some exploration Comparison agents select max a t pla Q(s t, apla t ) with some exploration 8 / 3

9 Sequence Prediction Markov model - the probability of the opponent s action a t opp depends only on the current state s t Pr(a t opp s t ) Sequence prediction - the probability of the opponent s action depends on a history H Pr(a t opp H) where H {s t, a t, s t, a t 2, s t 2,..., a, s } 9 / 3

10 Sequence Prediction Methods Long-term memory L - a set of distributions, each one conditioned on a different history H L = {Pr(a t opp H) : H {s t, a t, s t, a t 2, s t 2,..., a, s }} Short-term memory S - a list of recent observations (states/actions) Observing a symbol o t S = (o t, o t, o t 2,..., o t n ) Generate a set of histories H = {H, H 2,... } using S 2 For each H H create/update Pr(a t opp H) using o t 3 Add o t to S (remove the oldest observation if needed) Predicting an opponent action a t opp Generate a set of histories H = {H, H 2,... } using S 2 Predict using {Pr(a t opp H) : H H} / 3

11 Sequence Prediction Method Example Entropy Learned Pruned Hypothesis Space [2]: Inputs: memory size n and entropy threshold e Observing a symbol o t Generate the powerset P(S) = H of short-term memory S S = (o t, o t, o t 2,..., o t n ) P(S) = {{}, {o },..., {o n }, {o, o 2 },..., {o, o n },..., {o, o 2,..., o n }} 2 For each H H create/update Pr(aopp H) t using o t 3 For each H H if Entropy(Pr(aopp H)) t > e then discard it 4 Add o t to S (remove the oldest observation if S > n) Predicting an opponent action aopp t Generate the powerset P(S) = H of short-term memory S 2 Predict using arg min Pr(a t opp H) Entropy(Pr(aopp H)) t for all H H / 3

12 Lookahead Example D C D, 4, C,4 3,3 2 / 3

13 Lookahead Example Defect is the dominant action (highest reward) Cooperate-Cooperate is socially optimal (highest sum of rewards) Tit-for-tat (copy opponent s last move) is good for repeated play Can we learn to play optimally against tit-for-tat? 3 / 3

14 Lookahead Example Pred. C Pred. D D C D C 4 3 D C D, 4, C,4 3,3 4 / 3

15 Lookahead Example Pred. C Pred. D D C D C 4 3 D C D, 4, C,4 3,3 With lookahead D has highest reward With lookahead 2 (D,C,D,C) has highest total reward (unlikely) Assume the opponent copies the player s last move (i.e. tit-for-tat) 5 / 3

16 Lookahead Example Pred. C D C D, 4, C,4 3,3 Pred. D D C D C 4 3 Pred. D Pred. C Pred. D Pred. C D C D C D C D C / 3

17 Lookahead Example Pred. C D C D, 4, C,4 3,3 Pred. D D C D C 4 3 Pred. D Pred. C Pred. D Pred. C D C D C D C D C With lookahead of 2 against tit-for-tat C has highest reward 3 7 / 3

18 Results of First Approach Converges to higher average payoffs per game at faster rates than reinforcement learning algorithms such that in... Iterated rock-paper-scissors: Learns to best-respond against variable-markov models Iterated prisoner s dilemma: Comes first in a tournament against finite automata Littman s soccer: Wins 7% of games against reinforcement learning algorithms 8 / 3

19 Summary of First Approach We associate a sequence predictor with each game state During a game we update our: Rewards for action sequences using Q-Learning Sequence predictors with observed opponent actions At each decision point we lookahead and take the first action of an action sequence with the maximum expected cumulative reward 9 / 3

adjust our strategy 4 Simulate games against our opponent model Improves no-regret algorithm performance vs itself, a

20 Second Approach Sequence prediction to learn the opponent s strategy 2 Online expectation maximisation [3, 4] to predict the opponent s hidden information (to know H to update our opponent model) 3 No-regret learning algorithm to adjust our strategy 4 Simulate games against our opponent model Improves no-regret algorithm performance vs itself, a state-of-the-art reinforcement learning agent and a popular bandit algorithm in: Die-Roll Poker [5] Rhode Island Hold em [6] 2 / 3

21 Online Expectation Maximisation A rational agent will act based on its hidden information At the end of a game, we have observed the opponent s (public) actions but not necessarily their hidden information (e.g. they folded) Expectation step: For each possible instance of hidden information the opponent could hold, calculate the probability of their actions 2 Normalise these probabilities Each normalised probability corresponds to the expected number of opponent visits to the path associated with that hidden information Maximisation step: update the opponent s action probabilities along each path to account for their expected number of visits 2 / 3

22 Online Expectation Maximisation.5J.5K.5J.5K.5J.5K I I I2 I2.6C.4R.6C.4R.3C.7R.3C.7R I3 I4 I5 I6 I3 I4 I5 I6.8C.2R.F.C.C.9R.F.C.8C.2R.F.C.C.9R.F.C.F I7.C -.F I7.C -2.F I8.C 2.F I8.C J = Jack, K = King, F = Fold, C = Call, R = Raise 22 / 3

23 Online Expectation Maximisation.5J.5K.5J.5K.5J.5K I I I2 I2.6C.4R.6C.4R.3C.7R.3C.7R I3 I4 I5 I6 I3 I4 I5 I6.8C.2R.F.C.C.9R.F.C.8C.2R.F.C.C.9R.F.C.F I7.C -.F I7.C -2.F I8.C 2.F I8.C Assume we are P, we got a Jack, opponent P2 got a Jack or a King 23 / 3

24 Online Expectation Maximisation.5J.5K.5J.5K.5J.5K I I I2 I2.6C.4R.6C.4R.3C.7R.3C.7R I3 I4 I5 I6 I3 I4 I5 I6.8C.2R.F.C.C.9R.F.C.8C.2R.F.C.C.9R.F.C.F I7.C -.F I7.C -2.F I8.C 2.F I8.C Pr((J, J, R, F ) σ i ) = and Pr((J, K, R, F ) σ i ) = Update visits to (J, J, R, F ) by and to (J, K, R, F ) by 24 / 3

25 No-Regret Learning Our no-regret method is based on counterfactual regret minimisation State-of-the-art algorithm that provably minimises regret in two-player, zero-sum, imperfect information games [7] In self-play its average strategy profile approaches a Nash equilibrium Can handle games with 2 states ( states was the previous limit using Nesterov s excessive gap technique, limit poker has 8 states) Needs opponent s strategy, we use an online version that removes this 25 / 3

26 Results of Second Approach Has higher average payoffs per game and a higher final performance than the no-regret algorithm on its own such that in... Die-roll poker and Rhode Island hold em: Learns to win against all opponents (except near Nash where it draws) But online expectation maximisation seems less effective in Rhode Island hold em compared to die-roll poker - investigating why 26 / 3

27 Summary of Second Approach We associate a sequence predictor with each game state from the opponent s perspective (opponent information set) At the end of a game we: Predict opponent s hidden information by online expectation maximisation Update the sequence predictors along the path associated with the predicted hidden information and public actions Update our strategy with the reward from the actual game as well as the rewards from a number of simulated games 27 / 3

28 Summary Maximise our rewards when an opponent s actions can affect them Use games to focus on the agent, worry less about the environment Approaches: Reinforcement learning + sequence prediction + lookahead 2 Sequence prediction + online EM + no-regret + simulation 28 / 3

29 References I [] Backgammon Programming. Accessed: //23. [2] Chinook vs. the Checkers Champ - Top Man-vs.-Machine Moments - TIME. Accessed: //23. [3] Scrabble Showdown: Quackle vs. David Boys - Top Man-vs.-Machine Moments - TIME. Accessed: //23. [4] IBM - Deep Blue. Accessed: //23. [5] Othello match of the year. Accessed: //23. [6] CrazyStone at Sensei s Library. Accessed: //23. [7] Man vs Machine II - Polaris vs Online Poker s Best. Accessed: //23. [8] IBM Watson. Accessed: //23. [9] Richard Mealing and Jonathan L. Shapiro. Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games. In: 2th International Conference on Artificial Intelligence and Soft Computing. 23. [] Michael L. Littman. Markov games as a framework for multi-agent reinforcement learning. In: th Proc. of ICML. Morgan Kaufmann, 994, pp [] C. J. C. H. Watkins. Learning from delayed rewards. PhD thesis. Cambridge, 989. [2] Jensen et al. Non-stationary policy learning in 2-player zero sum games. In: Proc. of 2th Int. Conf. on AI. 25, pp / 3

30 References II [3] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. In: Journal of the Royal Statistical Society 39 (977), pp. 38. [4] Olivier Cappé and Eric Moulines. Online EM Algorithm for Latent Data Models. In: Journal of the Royal Statistical Society 7 (28), pp [5] Marc Lanctot et al. No-Regret Learning in Extensive-Form Games with Imperfect Recall. In: Proceedings of the 29th International Conference on Machine Learning (ICML-2). 22. [6] Jiefu Shi and Michael L. Littman. Abstraction Methods for Game Theoretic Poker. In: Revised Papers from the Second International Conference on Computers and Games. 2. [7] Martin Zinkevich et al. Regret Minimization in Games with Incomplete Information. In: Advances in Neural Information Processing Systems [8] G.W. Brown. Activity Analysis of Production and Allocation. In: ed. by T. J. Koopmans. New York: Wiley, 95. Chap. Iterative Solutions of Games by Fictitious Play. [9] Carmel and Markovitch. Learning Models of Intelligent Agents. In: Proc. of 3th Int. Conf. on AI. AAAI, 996, pp [2] John M Butterworth. Stability of gradient-based learning dynamics in two-agent imperfect-information games. PhD thesis. The University of Manchester, 2. [2] Knoll and de Freitas. A Machine Learning Perspective on Predictive Coding with PAQ. arxiv: / 3

31 Appendix: Future Work Change detection methods to discard outdated observations Use the opponent model more when it is more accurate More challenging domains e.g. n-player, continuous values Real-world applications e.g. peer-to-peer file sharing Use implicit as well as explicit opponent modelling 3 / 3

32 Appendix: Potential Applications Learning conditional and adaptive strategies Adapting to user interaction Adjusting the workload or relocating the system resources Responding to network traffic (p2p, spam filtering, virus detection) Overlapping areas: speech recognition/synthesis/tagging, musical score, machine translation, gene prediction, DNA/protein sequence classification/identification, bioinformatics, handwriting, gesture recognition, partial discharges, cryptanalysis, protein folding, metamorphic virus detection, statistical process control, robotic teams, distributed control, resource management, collaborative decision support systems, economics, industrial manufacturing, complex simulations, combinatorial search, etc... 3 / 3

33 Appendix: What has been tried before? Fictitious play assumes a Markov model opponent strategy [8] Unsupervised L* infers deterministic finite automata models [9] ELPH defeated human and agent players in rock-paper-scissors [2] Stochastic gradient ascent with the lagging anchor algorithm [2] PAQ8L defeated human players in rock-paper-scissors [2] 3 / 3

34 Appendix: Counterfactual Regret Minimisation Counterfactual Value: v i (I σ) = Pr(n σ i )u i (n) n I u i (n) = Pr(z σ)u i (z) Pr(n σ) z Z[n] v i (I σ) = player i s counterfactual value of information set I given strategy profile σ Pr(n σ i ) = probability of reaching node n from the root given the opponent s strategy u i (n) = player i s expected reward at node n Pr(n σ) = probability of reaching node n from the root given all players strategies Z[n] = set of terminal nodes that can be reached from node n u i (z) = player i s reward at terminal node z 3 / 3

35 Appendix: Counterfactual Regret Minimisation Counterfactual Regret: r i (I, a) = v i (I σ I a ) v i (I σ) r i (I, a) = player i s counterfactual regret of not playing action a at information set I σ I a = same as σ except a is always played at I v i (I σ I a ) = player i s counterfactual value of playing action a at information set I v i (I σ) = player i s counterfactual value of playing their strategy information set I 3 / 3

36 Appendix: Counterfactual Regret Minimisation Sampled Counterfactual Value: ṽ i (I σ, Q j ) = Pr(n σ i )ũ i (n Q j ) n I ũ i (n Q j ) = Pr(n σ) q(z) Pr(z σ)u i(z) z Q j Z[n] q(z) = j:z Q j q j ṽ i (I σ, Q j ) = player i s sampled counterfactual value of I given strategy profile σ and Q j Q j = set of sampled terminal nodes Pr(n σ i ) = probability of reaching node n from the root given the opponent s strategy ũ i (n) = player i s sampled expected reward at node n given Q j Pr(n σ) = probability of reaching node n from the root given all players strategies Z[n] = set of terminal nodes that can be reached from node n u i (z) = player i s reward at terminal node z q j = probability of sampling Q j 3 / 3

37 Appendix: Counterfactual Regret Minimisation Outcome Sampling ( Q j = and q j = q(z)): ṽ i (I x σ, Q j ) = Pr(n σ i ) Pr(n σ) n I x = Pr(n σ i) Pr(z σ)u i (z) Pr(n σ)q(z) = Pr(n σ i) Pr(z σ)u i (z) Pr(n σ) Pr(z σ ) z Q j Z[n] = Pr(n σ i) Pr(z σ i ) Pr(z σ i )u i (z) Pr(n σ i ) Pr(n σ i ) Pr(z σ i ) Pr(z σ i ) = Pr(z σ i) Pr(z σ i )u i (z) Pr(n σ i ) Pr(z σ i ) Pr(z σ i ) = Pr(z σ i)u i (z) Pr(n σ i ) Pr(z σ i ) Assuming σ i σ i q(z) Pr(z σ)u i(z) 3 / 3

38 Appendix: Zero-Determinant Strategies Unilaterally set an opponent s expected payoff in the iterated prisoner s dilemma irrespective of the opponents strategy Turns the prisoner s dilemma into an ultimatum game Works well against evolutionary players without an opponent model An opponent model could recognise the unfair offer and refuse 3 / 3

Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games

Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games Richard Mealing and Jonathan L. Shapiro {mealingr,jls}@cs.man.ac.uk Machine Learning and Optimisation Group School of omputer