Learning an Effective Strategy in a Multi-Agent System with Hidden Information
|
|
- Crystal Flora Bailey
- 5 years ago
- Views:
Transcription
1 Learning an Effective Strategy in a Multi-Agent System with Hidden Information Richard Mealing Supervisor: Jon Shapiro Machine Learning and Optimisation Group School of Computer Science University of Manchester / 3
2 Our Problem: Maximising Reward with An Opponent We focus on the simplest case with just 2 agents Each agent is trying to maximise its own rewards But each agent s actions can affect the other agent s rewards 2 / 3
3 Our Proposal: Predict and Adapt to the Future Before maximising our rewards we learn: What our rewards are for actions - use reinforcement/no-regret learning How the opponent will act - use sequence prediction methods To maximise our rewards: Lookahead - take the actions with the maximum expected reward Simulate - adapt our strategy to rewards against the opponent model Hidden information - what did the opponent base their decision on? Learn the hidden information using online expectation maximisation 3 / 3
4 Why Games? Games let you focus on the agent and worry less about the environment Well-defined rules and clear goals Can allow easy agent comparisons Can allow complex strategies Game theory gives a foundation 4 / 3
5 Artificial Intelligence Success in Games Year Game Backgammon Checkers Scrabble Chess Othello (Reversi) Go Poker Jeopardy! Success BKG 9.8 beat world champion Luigi Villa [] Chinook beat world champion Marion Tinsley [2] Quackle beat former champion David Boys [3] Deep Blue beat world champion Garry Kasparov [4] Logistello beat world champion Takeshi Murakami [5] Crazy Stone beat various pros [6] Polaris beat various pros in heads-up limit Texas hold em [7] Watson beat former winners Brad Rutter and Ken Jennings [8] 5 / 3
6 Perfect and Imperfect Information Perfect information - players always know the state e.g. Tic Tac Toe Checkers Imperfect information - at some point a player doesn t know the state e.g. Rock Paper Scissors Poker 6 / 3
7 First Approach Reinforcement learning (Q-Learning) to learn our own rewards 2 Sequence prediction to learn the opponent s strategy 3 Exhaustive explicit lookahead (to a limited depth) with and 2 to take the actions with the maximum expected reward Outperforms state-of-the-art reinforcement learning agents in: Rock Paper Scissors Prisoner s Dilemma Littman s Soccer [] Richard Mealing and Jonathan L. Shapiro. Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games. In: 2th International Conference on Artificial Intelligence and Soft Computing / 3
8 Reinforcement Learning We use Q(uality)-Learning to learn the rewards for action sequences Comparison agents use Q-Learning or Q-Learning based methods Q-Learning learns the expected value of taking an action in a state and then following a fixed strategy [] s t = α = γ = Q(s t, apla t ) ( α)q(st, apla t ) + α[r t + γ max Q(s t+, a t+ a t+ pla )] pla state at time t learning rate discount factor a t pla = r t = player s action at time t reward at time t We use Q(s t, apla t ) with lookahead and some exploration Comparison agents select max a t pla Q(s t, apla t ) with some exploration 8 / 3
9 Sequence Prediction Markov model - the probability of the opponent s action a t opp depends only on the current state s t Pr(a t opp s t ) Sequence prediction - the probability of the opponent s action depends on a history H Pr(a t opp H) where H {s t, a t, s t, a t 2, s t 2,..., a, s } 9 / 3
10 Sequence Prediction Methods Long-term memory L - a set of distributions, each one conditioned on a different history H L = {Pr(a t opp H) : H {s t, a t, s t, a t 2, s t 2,..., a, s }} Short-term memory S - a list of recent observations (states/actions) Observing a symbol o t S = (o t, o t, o t 2,..., o t n ) Generate a set of histories H = {H, H 2,... } using S 2 For each H H create/update Pr(a t opp H) using o t 3 Add o t to S (remove the oldest observation if needed) Predicting an opponent action a t opp Generate a set of histories H = {H, H 2,... } using S 2 Predict using {Pr(a t opp H) : H H} / 3
11 Sequence Prediction Method Example Entropy Learned Pruned Hypothesis Space [2]: Inputs: memory size n and entropy threshold e Observing a symbol o t Generate the powerset P(S) = H of short-term memory S S = (o t, o t, o t 2,..., o t n ) P(S) = {{}, {o },..., {o n }, {o, o 2 },..., {o, o n },..., {o, o 2,..., o n }} 2 For each H H create/update Pr(aopp H) t using o t 3 For each H H if Entropy(Pr(aopp H)) t > e then discard it 4 Add o t to S (remove the oldest observation if S > n) Predicting an opponent action aopp t Generate the powerset P(S) = H of short-term memory S 2 Predict using arg min Pr(a t opp H) Entropy(Pr(aopp H)) t for all H H / 3
12 Lookahead Example D C D, 4, C,4 3,3 2 / 3
13 Lookahead Example Defect is the dominant action (highest reward) Cooperate-Cooperate is socially optimal (highest sum of rewards) Tit-for-tat (copy opponent s last move) is good for repeated play Can we learn to play optimally against tit-for-tat? 3 / 3
14 Lookahead Example Pred. C Pred. D D C D C 4 3 D C D, 4, C,4 3,3 4 / 3
15 Lookahead Example Pred. C Pred. D D C D C 4 3 D C D, 4, C,4 3,3 With lookahead D has highest reward With lookahead 2 (D,C,D,C) has highest total reward (unlikely) Assume the opponent copies the player s last move (i.e. tit-for-tat) 5 / 3
16 Lookahead Example Pred. C D C D, 4, C,4 3,3 Pred. D D C D C 4 3 Pred. D Pred. C Pred. D Pred. C D C D C D C D C / 3
17 Lookahead Example Pred. C D C D, 4, C,4 3,3 Pred. D D C D C 4 3 Pred. D Pred. C Pred. D Pred. C D C D C D C D C With lookahead of 2 against tit-for-tat C has highest reward 3 7 / 3
18 Results of First Approach Converges to higher average payoffs per game at faster rates than reinforcement learning algorithms such that in... Iterated rock-paper-scissors: Learns to best-respond against variable-markov models Iterated prisoner s dilemma: Comes first in a tournament against finite automata Littman s soccer: Wins 7% of games against reinforcement learning algorithms 8 / 3
19 Summary of First Approach We associate a sequence predictor with each game state During a game we update our: Rewards for action sequences using Q-Learning Sequence predictors with observed opponent actions At each decision point we lookahead and take the first action of an action sequence with the maximum expected cumulative reward 9 / 3
20 Second Approach Sequence prediction to learn the opponent s strategy 2 Online expectation maximisation [3, 4] to predict the opponent s hidden information (to know H to update our opponent model) 3 No-regret learning algorithm to adjust our strategy 4 Simulate games against our opponent model Improves no-regret algorithm performance vs itself, a state-of-the-art reinforcement learning agent and a popular bandit algorithm in: Die-Roll Poker [5] Rhode Island Hold em [6] 2 / 3
21 Online Expectation Maximisation A rational agent will act based on its hidden information At the end of a game, we have observed the opponent s (public) actions but not necessarily their hidden information (e.g. they folded) Expectation step: For each possible instance of hidden information the opponent could hold, calculate the probability of their actions 2 Normalise these probabilities Each normalised probability corresponds to the expected number of opponent visits to the path associated with that hidden information Maximisation step: update the opponent s action probabilities along each path to account for their expected number of visits 2 / 3
22 Online Expectation Maximisation.5J.5K.5J.5K.5J.5K I I I2 I2.6C.4R.6C.4R.3C.7R.3C.7R I3 I4 I5 I6 I3 I4 I5 I6.8C.2R.F.C.C.9R.F.C.8C.2R.F.C.C.9R.F.C.F I7.C -.F I7.C -2.F I8.C 2.F I8.C J = Jack, K = King, F = Fold, C = Call, R = Raise 22 / 3
23 Online Expectation Maximisation.5J.5K.5J.5K.5J.5K I I I2 I2.6C.4R.6C.4R.3C.7R.3C.7R I3 I4 I5 I6 I3 I4 I5 I6.8C.2R.F.C.C.9R.F.C.8C.2R.F.C.C.9R.F.C.F I7.C -.F I7.C -2.F I8.C 2.F I8.C Assume we are P, we got a Jack, opponent P2 got a Jack or a King 23 / 3
24 Online Expectation Maximisation.5J.5K.5J.5K.5J.5K I I I2 I2.6C.4R.6C.4R.3C.7R.3C.7R I3 I4 I5 I6 I3 I4 I5 I6.8C.2R.F.C.C.9R.F.C.8C.2R.F.C.C.9R.F.C.F I7.C -.F I7.C -2.F I8.C 2.F I8.C Pr((J, J, R, F ) σ i ) = and Pr((J, K, R, F ) σ i ) = Update visits to (J, J, R, F ) by and to (J, K, R, F ) by 24 / 3
25 No-Regret Learning Our no-regret method is based on counterfactual regret minimisation State-of-the-art algorithm that provably minimises regret in two-player, zero-sum, imperfect information games [7] In self-play its average strategy profile approaches a Nash equilibrium Can handle games with 2 states ( states was the previous limit using Nesterov s excessive gap technique, limit poker has 8 states) Needs opponent s strategy, we use an online version that removes this 25 / 3
26 Results of Second Approach Has higher average payoffs per game and a higher final performance than the no-regret algorithm on its own such that in... Die-roll poker and Rhode Island hold em: Learns to win against all opponents (except near Nash where it draws) But online expectation maximisation seems less effective in Rhode Island hold em compared to die-roll poker - investigating why 26 / 3
27 Summary of Second Approach We associate a sequence predictor with each game state from the opponent s perspective (opponent information set) At the end of a game we: Predict opponent s hidden information by online expectation maximisation Update the sequence predictors along the path associated with the predicted hidden information and public actions Update our strategy with the reward from the actual game as well as the rewards from a number of simulated games 27 / 3
28 Summary Maximise our rewards when an opponent s actions can affect them Use games to focus on the agent, worry less about the environment Approaches: Reinforcement learning + sequence prediction + lookahead 2 Sequence prediction + online EM + no-regret + simulation 28 / 3
29 References I [] Backgammon Programming. Accessed: //23. [2] Chinook vs. the Checkers Champ - Top Man-vs.-Machine Moments - TIME. Accessed: //23. [3] Scrabble Showdown: Quackle vs. David Boys - Top Man-vs.-Machine Moments - TIME. Accessed: //23. [4] IBM - Deep Blue. Accessed: //23. [5] Othello match of the year. Accessed: //23. [6] CrazyStone at Sensei s Library. Accessed: //23. [7] Man vs Machine II - Polaris vs Online Poker s Best. Accessed: //23. [8] IBM Watson. Accessed: //23. [9] Richard Mealing and Jonathan L. Shapiro. Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games. In: 2th International Conference on Artificial Intelligence and Soft Computing. 23. [] Michael L. Littman. Markov games as a framework for multi-agent reinforcement learning. In: th Proc. of ICML. Morgan Kaufmann, 994, pp [] C. J. C. H. Watkins. Learning from delayed rewards. PhD thesis. Cambridge, 989. [2] Jensen et al. Non-stationary policy learning in 2-player zero sum games. In: Proc. of 2th Int. Conf. on AI. 25, pp / 3
30 References II [3] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. In: Journal of the Royal Statistical Society 39 (977), pp. 38. [4] Olivier Cappé and Eric Moulines. Online EM Algorithm for Latent Data Models. In: Journal of the Royal Statistical Society 7 (28), pp [5] Marc Lanctot et al. No-Regret Learning in Extensive-Form Games with Imperfect Recall. In: Proceedings of the 29th International Conference on Machine Learning (ICML-2). 22. [6] Jiefu Shi and Michael L. Littman. Abstraction Methods for Game Theoretic Poker. In: Revised Papers from the Second International Conference on Computers and Games. 2. [7] Martin Zinkevich et al. Regret Minimization in Games with Incomplete Information. In: Advances in Neural Information Processing Systems [8] G.W. Brown. Activity Analysis of Production and Allocation. In: ed. by T. J. Koopmans. New York: Wiley, 95. Chap. Iterative Solutions of Games by Fictitious Play. [9] Carmel and Markovitch. Learning Models of Intelligent Agents. In: Proc. of 3th Int. Conf. on AI. AAAI, 996, pp [2] John M Butterworth. Stability of gradient-based learning dynamics in two-agent imperfect-information games. PhD thesis. The University of Manchester, 2. [2] Knoll and de Freitas. A Machine Learning Perspective on Predictive Coding with PAQ. arxiv: / 3
31 Appendix: Future Work Change detection methods to discard outdated observations Use the opponent model more when it is more accurate More challenging domains e.g. n-player, continuous values Real-world applications e.g. peer-to-peer file sharing Use implicit as well as explicit opponent modelling 3 / 3
32 Appendix: Potential Applications Learning conditional and adaptive strategies Adapting to user interaction Adjusting the workload or relocating the system resources Responding to network traffic (p2p, spam filtering, virus detection) Overlapping areas: speech recognition/synthesis/tagging, musical score, machine translation, gene prediction, DNA/protein sequence classification/identification, bioinformatics, handwriting, gesture recognition, partial discharges, cryptanalysis, protein folding, metamorphic virus detection, statistical process control, robotic teams, distributed control, resource management, collaborative decision support systems, economics, industrial manufacturing, complex simulations, combinatorial search, etc... 3 / 3
33 Appendix: What has been tried before? Fictitious play assumes a Markov model opponent strategy [8] Unsupervised L* infers deterministic finite automata models [9] ELPH defeated human and agent players in rock-paper-scissors [2] Stochastic gradient ascent with the lagging anchor algorithm [2] PAQ8L defeated human players in rock-paper-scissors [2] 3 / 3
34 Appendix: Counterfactual Regret Minimisation Counterfactual Value: v i (I σ) = Pr(n σ i )u i (n) n I u i (n) = Pr(z σ)u i (z) Pr(n σ) z Z[n] v i (I σ) = player i s counterfactual value of information set I given strategy profile σ Pr(n σ i ) = probability of reaching node n from the root given the opponent s strategy u i (n) = player i s expected reward at node n Pr(n σ) = probability of reaching node n from the root given all players strategies Z[n] = set of terminal nodes that can be reached from node n u i (z) = player i s reward at terminal node z 3 / 3
35 Appendix: Counterfactual Regret Minimisation Counterfactual Regret: r i (I, a) = v i (I σ I a ) v i (I σ) r i (I, a) = player i s counterfactual regret of not playing action a at information set I σ I a = same as σ except a is always played at I v i (I σ I a ) = player i s counterfactual value of playing action a at information set I v i (I σ) = player i s counterfactual value of playing their strategy information set I 3 / 3
36 Appendix: Counterfactual Regret Minimisation Sampled Counterfactual Value: ṽ i (I σ, Q j ) = Pr(n σ i )ũ i (n Q j ) n I ũ i (n Q j ) = Pr(n σ) q(z) Pr(z σ)u i(z) z Q j Z[n] q(z) = j:z Q j q j ṽ i (I σ, Q j ) = player i s sampled counterfactual value of I given strategy profile σ and Q j Q j = set of sampled terminal nodes Pr(n σ i ) = probability of reaching node n from the root given the opponent s strategy ũ i (n) = player i s sampled expected reward at node n given Q j Pr(n σ) = probability of reaching node n from the root given all players strategies Z[n] = set of terminal nodes that can be reached from node n u i (z) = player i s reward at terminal node z q j = probability of sampling Q j 3 / 3
37 Appendix: Counterfactual Regret Minimisation Outcome Sampling ( Q j = and q j = q(z)): ṽ i (I x σ, Q j ) = Pr(n σ i ) Pr(n σ) n I x = Pr(n σ i) Pr(z σ)u i (z) Pr(n σ)q(z) = Pr(n σ i) Pr(z σ)u i (z) Pr(n σ) Pr(z σ ) z Q j Z[n] = Pr(n σ i) Pr(z σ i ) Pr(z σ i )u i (z) Pr(n σ i ) Pr(n σ i ) Pr(z σ i ) Pr(z σ i ) = Pr(z σ i) Pr(z σ i )u i (z) Pr(n σ i ) Pr(z σ i ) Pr(z σ i ) = Pr(z σ i)u i (z) Pr(n σ i ) Pr(z σ i ) Assuming σ i σ i q(z) Pr(z σ)u i(z) 3 / 3
38 Appendix: Zero-Determinant Strategies Unilaterally set an opponent s expected payoff in the iterated prisoner s dilemma irrespective of the opponents strategy Turns the prisoner s dilemma into an ultimatum game Works well against evolutionary players without an opponent model An opponent model could recognise the unfair offer and refuse 3 / 3
Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games
Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games Richard Mealing and Jonathan L. Shapiro {mealingr,jls}@cs.man.ac.uk Machine Learning and Optimisation Group School of omputer
More informationA Polynomial-time Nash Equilibrium Algorithm for Repeated Games
A Polynomial-time Nash Equilibrium Algorithm for Repeated Games Michael L. Littman mlittman@cs.rutgers.edu Rutgers University Peter Stone pstone@cs.utexas.edu The University of Texas at Austin Main Result
More informationDeep Reinforcement Learning
Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 50 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015
More informationGame Theory, Evolutionary Dynamics, and Multi-Agent Learning. Prof. Nicola Gatti
Game Theory, Evolutionary Dynamics, and Multi-Agent Learning Prof. Nicola Gatti (nicola.gatti@polimi.it) Game theory Game theory: basics Normal form Players Actions Outcomes Utilities Strategies Solutions
More informationDavid Silver, Google DeepMind
Tutorial: Deep Reinforcement Learning David Silver, Google DeepMind Outline Introduction to Deep Learning Introduction to Reinforcement Learning Value-Based Deep RL Policy-Based Deep RL Model-Based Deep
More informationDeep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017
Deep Reinforcement Learning STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017 Outline Introduction to Reinforcement Learning AlphaGo (Deep RL for Computer Go)
More informationLecture 1: March 7, 2018
Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights
More informationCS599 Lecture 1 Introduction To RL
CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming
More informationI D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69
R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual
More informationCS 570: Machine Learning Seminar. Fall 2016
CS 570: Machine Learning Seminar Fall 2016 Class Information Class web page: http://web.cecs.pdx.edu/~mm/mlseminar2016-2017/fall2016/ Class mailing list: cs570@cs.pdx.edu My office hours: T,Th, 2-3pm or
More informationOptimal Convergence in Multi-Agent MDPs
Optimal Convergence in Multi-Agent MDPs Peter Vrancx 1, Katja Verbeeck 2, and Ann Nowé 1 1 {pvrancx, ann.nowe}@vub.ac.be, Computational Modeling Lab, Vrije Universiteit Brussel 2 k.verbeeck@micc.unimaas.nl,
More informationConvergence and No-Regret in Multiagent Learning
Convergence and No-Regret in Multiagent Learning Michael Bowling Department of Computing Science University of Alberta Edmonton, Alberta Canada T6G 2E8 bowling@cs.ualberta.ca Abstract Learning in a multiagent
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More informationQ-learning. Tambet Matiisen
Q-learning Tambet Matiisen (based on chapter 11.3 of online book Artificial Intelligence, foundations of computational agents by David Poole and Alan Mackworth) Stochastic gradient descent Experience
More informationReinforcement Learning: the basics
Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46 Introduction Action selection/planning Learning by trial-and-error
More informationMultiagent (Deep) Reinforcement Learning
Multiagent (Deep) Reinforcement Learning MARTIN PILÁT (MARTIN.PILAT@MFF.CUNI.CZ) Reinforcement learning The agent needs to learn to perform tasks in environment No prior knowledge about the effects of
More informationLearning Tetris. 1 Tetris. February 3, 2009
Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are
More informationA reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation
A reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation Hajime Fujita and Shin Ishii, Nara Institute of Science and Technology 8916 5 Takayama, Ikoma, 630 0192 JAPAN
More informationAnnouncements. CS 188: Artificial Intelligence Spring Mini-Contest Winners. Today. GamesCrafters. Adversarial Games
CS 188: Artificial Intelligence Spring 2009 Lecture 7: Expectimax Search 2/10/2009 John DeNero UC Berkeley Slides adapted from Dan Klein, Stuart Russell or Andrew Moore Announcements Written Assignment
More informationBits of Machine Learning Part 2: Unsupervised Learning
Bits of Machine Learning Part 2: Unsupervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification
More informationLecture 23: Reinforcement Learning
Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:
More informationUniversity of Alberta. Marc Lanctot. Doctor of Philosophy. Department of Computing Science
Computers are incredibly fast, accurate and stupid. Human beings are incredibly slow, inaccurate and brilliant. Together they are powerful beyond imagination. Albert Einstein University of Alberta MONTE
More informationDECENTRALIZED LEARNING IN GENERAL-SUM MATRIX GAMES: AN L R I LAGGING ANCHOR ALGORITHM. Xiaosong Lu and Howard M. Schwartz
International Journal of Innovative Computing, Information and Control ICIC International c 20 ISSN 349-498 Volume 7, Number, January 20 pp. 0 DECENTRALIZED LEARNING IN GENERAL-SUM MATRIX GAMES: AN L R
More informationThe Cross Entropy Method for the N-Persons Iterated Prisoner s Dilemma
The Cross Entropy Method for the N-Persons Iterated Prisoner s Dilemma Tzai-Der Wang Artificial Intelligence Economic Research Centre, National Chengchi University, Taipei, Taiwan. email: dougwang@nccu.edu.tw
More informationSolving Zero-Sum Extensive-Form Games. Branislav Bošanský AE4M36MAS, Fall 2013, Lecture 6
Solving Zero-Sum Extensive-Form Games ranislav ošanský E4M36MS, Fall 2013, Lecture 6 Imperfect Information EFGs States Players 1 2 Information Set ctions Utility Solving II Zero-Sum EFG with perfect recall
More informationReinforcement Learning. Introduction
Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control
More informationSolving Dynamic Bandit Problems and Decentralized Games using the Kalman Bayesian Learning Automaton
Solving Dynamic Bandit Problems and Decentralized Games using the Kalman Bayesian Learning Automaton By Stian Berg Thesis submitted in Partial Fulfillment of the Requirements for the Degree Master of Science
More informationAdversarial Search & Logic and Reasoning
CSEP 573 Adversarial Search & Logic and Reasoning CSE AI Faculty Recall from Last Time: Adversarial Games as Search Convention: first player is called MAX, 2nd player is called MIN MAX moves first and
More informationLecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan
COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan Some slides borrowed from Peter Bodik and David Silver Course progress Learning
More informationBelief-based Learning
Belief-based Learning Algorithmic Game Theory Marcello Restelli Lecture Outline Introdutcion to multi-agent learning Belief-based learning Cournot adjustment Fictitious play Bayesian learning Equilibrium
More informationCS 4100 // artificial intelligence. Recap/midterm review!
CS 4100 // artificial intelligence instructor: byron wallace Recap/midterm review! Attribution: many of these slides are modified versions of those distributed with the UC Berkeley CS188 materials Thanks
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More informationEvolutionary Bargaining Strategies
Evolutionary Bargaining Strategies Nanlin Jin http://cswww.essex.ac.uk/csp/bargain Evolutionary Bargaining Two players alternative offering game x A =?? Player A Rubinstein 1982, 1985: Subgame perfect
More informationAn Introduction to Reinforcement Learning
An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@csa.iisc.ernet.in Department of Computer Science and Automation Indian Institute of Science August 2014 What is Reinforcement
More informationAn Introduction to Reinforcement Learning
An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@cse.iitb.ac.in Department of Computer Science and Engineering Indian Institute of Technology Bombay April 2018 What is Reinforcement
More informationEvolutionary Computation: introduction
Evolutionary Computation: introduction Dirk Thierens Universiteit Utrecht The Netherlands Dirk Thierens (Universiteit Utrecht) EC Introduction 1 / 42 What? Evolutionary Computation Evolutionary Computation
More informationLecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010
Lecture 25: Learning 4 Victor R. Lesser CMPSCI 683 Fall 2010 Final Exam Information Final EXAM on Th 12/16 at 4:00pm in Lederle Grad Res Ctr Rm A301 2 Hours but obviously you can leave early! Open Book
More informationDecision Theory: Q-Learning
Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning
More informationLearning to Coordinate Efficiently: A Model-based Approach
Journal of Artificial Intelligence Research 19 (2003) 11-23 Submitted 10/02; published 7/03 Learning to Coordinate Efficiently: A Model-based Approach Ronen I. Brafman Computer Science Department Ben-Gurion
More informationMultiagent Value Iteration in Markov Games
Multiagent Value Iteration in Markov Games Amy Greenwald Brown University with Michael Littman and Martin Zinkevich Stony Brook Game Theory Festival July 21, 2005 Agenda Theorem Value iteration converges
More informationReinforcement Learning
Reinforcement Learning Cyber Rodent Project Some slides from: David Silver, Radford Neal CSC411: Machine Learning and Data Mining, Winter 2017 Michael Guerzhoy 1 Reinforcement Learning Supervised learning:
More informationCMSC 474, Game Theory
CMSC 474, Game Theory 4b. Game-Tree Search Dana Nau University of Maryland Nau: Game Theory 1 Finite perfect-information zero-sum games! Finite: Ø finitely many agents, actions, states, histories! Perfect
More informationMultiagent Learning Using a Variable Learning Rate
Multiagent Learning Using a Variable Learning Rate Michael Bowling, Manuela Veloso Computer Science Department, Carnegie Mellon University, Pittsburgh, PA 15213-3890 Abstract Learning to act in a multiagent
More informationReinforcement Learning
1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision
More informationIntroduction to Reinforcement Learning
CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.
More informationAnnouncements. CS 188: Artificial Intelligence Fall Adversarial Games. Computing Minimax Values. Evaluation Functions. Recap: Resource Limits
CS 188: Artificial Intelligence Fall 2009 Lecture 7: Expectimax Search 9/17/2008 Announcements Written 1: Search and CSPs is up Project 2: Multi-agent Search is up Want a partner? Come to the front after
More informationCS 188: Artificial Intelligence Fall Announcements
CS 188: Artificial Intelligence Fall 2009 Lecture 7: Expectimax Search 9/17/2008 Dan Klein UC Berkeley Many slides over the course adapted from either Stuart Russell or Andrew Moore 1 Announcements Written
More information15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted
15-889e Policy Search: Gradient Methods Emma Brunskill All slides from David Silver (with EB adding minor modificafons), unless otherwise noted Outline 1 Introduction 2 Finite Difference Policy Gradient
More informationLearning in Zero-Sum Team Markov Games using Factored Value Functions
Learning in Zero-Sum Team Markov Games using Factored Value Functions Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 27708 mgl@cs.duke.edu Ronald Parr Department of Computer
More informationTemporal Difference Learning & Policy Iteration
Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.
More informationExponential Moving Average Based Multiagent Reinforcement Learning Algorithms
Artificial Intelligence Review manuscript No. (will be inserted by the editor) Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Howard M. Schwartz Received:
More informationMultiagent Learning. Foundations and Recent Trends. Stefano Albrecht and Peter Stone
Multiagent Learning Foundations and Recent Trends Stefano Albrecht and Peter Stone Tutorial at IJCAI 2017 conference: http://www.cs.utexas.edu/~larg/ijcai17_tutorial Overview Introduction Multiagent Models
More informationReinforcement Learning with Function Approximation. Joseph Christian G. Noel
Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is
More information1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5
Table of contents 1 Introduction 2 2 Markov Decision Processes 2 3 Future Cumulative Reward 3 4 Q-Learning 4 4.1 The Q-value.............................................. 4 4.2 The Temporal Difference.......................................
More informationExponential Moving Average Based Multiagent Reinforcement Learning Algorithms
Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca
More informationPOMDPs and Policy Gradients
POMDPs and Policy Gradients MLSS 2006, Canberra Douglas Aberdeen Canberra Node, RSISE Building Australian National University 15th February 2006 Outline 1 Introduction What is Reinforcement Learning? Types
More informationCS885 Reinforcement Learning Lecture 7a: May 23, 2018
CS885 Reinforcement Learning Lecture 7a: May 23, 2018 Policy Gradient Methods [SutBar] Sec. 13.1-13.3, 13.7 [SigBuf] Sec. 5.1-5.2, [RusNor] Sec. 21.5 CS885 Spring 2018 Pascal Poupart 1 Outline Stochastic
More informationReinforcement Learning
Reinforcement Learning 1 Reinforcement Learning Mainly based on Reinforcement Learning An Introduction by Richard Sutton and Andrew Barto Slides are mainly based on the course material provided by the
More informationHuman-level control through deep reinforcement. Liia Butler
Humanlevel control through deep reinforcement Liia Butler But first... A quote "The question of whether machines can think... is about as relevant as the question of whether submarines can swim" Edsger
More information15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)
15-780: Graduate Artificial Intelligence Reinforcement learning (RL) From MDPs to RL We still use the same Markov model with rewards and actions But there are a few differences: 1. We do not assume we
More informationMachine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel
Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING Slides adapted from Tom Mitchell and Peter Abeel Machine Learning: Jordan Boyd-Graber UMD Machine Learning
More informationUsing Expectation-Maximization for Reinforcement Learning
NOTE Communicated by Andrew Barto and Michael Jordan Using Expectation-Maximization for Reinforcement Learning Peter Dayan Department of Brain and Cognitive Sciences, Center for Biological and Computational
More informationLearning Equilibrium as a Generalization of Learning to Optimize
Learning Equilibrium as a Generalization of Learning to Optimize Dov Monderer and Moshe Tennenholtz Faculty of Industrial Engineering and Management Technion Israel Institute of Technology Haifa 32000,
More informationUniversity of Alberta. Richard Gibson. Doctor of Philosophy. Department of Computing Science
University of Alberta REGRET MINIMIZATION IN GAMES AND THE DEVELOPMENT OF CHAMPION MULTIPLAYER COMPUTER POKER-PLAYING AGENTS by Richard Gibson A thesis submitted to the Faculty of Graduate Studies and
More informationLecture 8: Policy Gradient
Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve
More informationConvergence to Pareto Optimality in General Sum Games via Learning Opponent s Preference
Convergence to Pareto Optimality in General Sum Games via Learning Opponent s Preference Dipyaman Banerjee Department of Math & CS University of Tulsa Tulsa, OK, USA dipyaman@gmail.com Sandip Sen Department
More informationReinforcement Learning (1)
Reinforcement Learning 1 Reinforcement Learning (1) Machine Learning 64-360, Part II Norman Hendrich University of Hamburg, Dept. of Informatics Vogt-Kölln-Str. 30, D-22527 Hamburg hendrich@informatik.uni-hamburg.de
More informationCS 188: Artificial Intelligence
CS 188: Artificial Intelligence Adversarial Search II Instructor: Anca Dragan University of California, Berkeley [These slides adapted from Dan Klein and Pieter Abbeel] Minimax Example 3 12 8 2 4 6 14
More informationMachine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396
Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction
More informationLearning to Compete, Compromise, and Cooperate in Repeated General-Sum Games
Learning to Compete, Compromise, and Cooperate in Repeated General-Sum Games Jacob W. Crandall Michael A. Goodrich Computer Science Department, Brigham Young University, Provo, UT 84602 USA crandall@cs.byu.edu
More informationFictitious Self-Play in Extensive-Form Games
Johannes Heinrich, Marc Lanctot, David Silver University College London, Google DeepMind July 9, 05 Problem Learn from self-play in games with imperfect information. Games: Multi-agent decision making
More informationAn Adaptive Clustering Method for Model-free Reinforcement Learning
An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at
More informationCourse 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016
Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the
More informationARTIFICIAL INTELLIGENCE. Reinforcement learning
INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html
More informationCS 188 Introduction to Fall 2007 Artificial Intelligence Midterm
NAME: SID#: Login: Sec: 1 CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm You have 80 minutes. The exam is closed book, closed notes except a one-page crib sheet, basic calculators only.
More informationOptimism in the Face of Uncertainty Should be Refutable
Optimism in the Face of Uncertainty Should be Refutable Ronald ORTNER Montanuniversität Leoben Department Mathematik und Informationstechnolgie Franz-Josef-Strasse 18, 8700 Leoben, Austria, Phone number:
More informationQ-Learning in Continuous State Action Spaces
Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental
More informationGeneralized Sampling and Variance in Counterfactual Regret Minimization
Generalized Sampling and Variance in Counterfactual Regret Minimization Richard Gison and Marc Lanctot and Neil Burch and Duane Szafron and Michael Bowling Department of Computing Science, University of
More informationBalancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm
Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu
More informationDecision Theory: Markov Decision Processes
Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies
More informationShort Course: Multiagent Systems. Multiagent Systems. Lecture 1: Basics Agents Environments. Reinforcement Learning. This course is about:
Short Course: Multiagent Systems Lecture 1: Basics Agents Environments Reinforcement Learning Multiagent Systems This course is about: Agents: Sensing, reasoning, acting Multiagent Systems: Distributed
More informationA Parameterized Family of Equilibrium Profiles for Three-Player Kuhn Poker
A Parameterized Family of Equilibrium Profiles for Three-Player Kuhn Poker Duane Szafron University of Alberta Edmonton, Alberta dszafron@ualberta.ca Richard Gibson University of Alberta Edmonton, Alberta
More informationToday s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning
CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides
More informationMachine Learning I Reinforcement Learning
Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:
More informationSolving Heads-up Limit Texas Hold em
Solving Heads-up Limit Texas Hold em Oskari Tammelin, 1 Neil Burch, 2 Michael Johanson 2 and Michael Bowling 2 1 http://jeskola.net, ot@iki.fi 2 Department of Computing Science, University of Alberta {nburch,johanson,mbowling}@ualberta.ca
More informationConvergence Rate of Expectation-Maximization
Convergence Rate of Expectation-Maximiation Raunak Kumar University of British Columbia Mark Schmidt University of British Columbia Abstract raunakkumar17@outlookcom schmidtm@csubcca Expectation-maximiation
More informationProf. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be
REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while
More informationCS 188: Artificial Intelligence Spring 2007
CS 188: Artificial Intelligence Spring 2007 Lecture 8: Logical Agents - I 2/8/2007 Srini Narayanan ICSI and UC Berkeley Many slides over the course adapted from Dan Klein, Stuart Russell or Andrew Moore
More informationCyclic Equilibria in Markov Games
Cyclic Equilibria in Markov Games Martin Zinkevich and Amy Greenwald Department of Computer Science Brown University Providence, RI 02912 {maz,amy}@cs.brown.edu Michael L. Littman Department of Computer
More informationAlgorithmic Strategy Complexity
Algorithmic Strategy Complexity Abraham Neyman aneyman@math.huji.ac.il Hebrew University of Jerusalem Jerusalem Israel Algorithmic Strategy Complexity, Northwesten 2003 p.1/52 General Introduction The
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationCOMP3702/7702 Artificial Intelligence Week1: Introduction Russell & Norvig ch.1-2.3, Hanna Kurniawati
COMP3702/7702 Artificial Intelligence Week1: Introduction Russell & Norvig ch.1-2.3, 3.1-3.3 Hanna Kurniawati Today } What is Artificial Intelligence? } Better know what it is first before committing the
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Formal models of interaction Daniel Hennes 27.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Taxonomy of domains Models of
More informationDeep Reinforcement Learning: Policy Gradients and Q-Learning
Deep Reinforcement Learning: Policy Gradients and Q-Learning John Schulman Bay Area Deep Learning School September 24, 2016 Introduction and Overview Aim of This Talk What is deep RL, and should I use
More informationReduced Space and Faster Convergence in Imperfect-Information Games via Pruning
Reduced Space and Faster Convergence in Imperfect-Information Games via Pruning Noam Brown 1 uomas Sandholm 1 Abstract Iterative algorithms such as Counterfactual Regret Minimization CFR) are the most
More informationBasic Game Theory. Kate Larson. January 7, University of Waterloo. Kate Larson. What is Game Theory? Normal Form Games. Computing Equilibria
Basic Game Theory University of Waterloo January 7, 2013 Outline 1 2 3 What is game theory? The study of games! Bluffing in poker What move to make in chess How to play Rock-Scissors-Paper Also study of
More informationMarkov Decision Processes
Markov Decision Processes Noel Welsh 11 November 2010 Noel Welsh () Markov Decision Processes 11 November 2010 1 / 30 Annoucements Applicant visitor day seeks robot demonstrators for exciting half hour
More informationChapter 6: Temporal Difference Learning
Chapter 6: emporal Difference Learning Objectives of this chapter: Introduce emporal Difference (D) learning Focus first on policy evaluation, or prediction, methods hen extend to control methods R. S.
More informationUsing Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *
Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms
More information