Learning an Effective Strategy in a Multi-Agent System with Hidden Information

Size: px
Start display at page:

Download "Learning an Effective Strategy in a Multi-Agent System with Hidden Information"

Transcription

1 Learning an Effective Strategy in a Multi-Agent System with Hidden Information Richard Mealing Supervisor: Jon Shapiro Machine Learning and Optimisation Group School of Computer Science University of Manchester / 3

2 Our Problem: Maximising Reward with An Opponent We focus on the simplest case with just 2 agents Each agent is trying to maximise its own rewards But each agent s actions can affect the other agent s rewards 2 / 3

3 Our Proposal: Predict and Adapt to the Future Before maximising our rewards we learn: What our rewards are for actions - use reinforcement/no-regret learning How the opponent will act - use sequence prediction methods To maximise our rewards: Lookahead - take the actions with the maximum expected reward Simulate - adapt our strategy to rewards against the opponent model Hidden information - what did the opponent base their decision on? Learn the hidden information using online expectation maximisation 3 / 3

4 Why Games? Games let you focus on the agent and worry less about the environment Well-defined rules and clear goals Can allow easy agent comparisons Can allow complex strategies Game theory gives a foundation 4 / 3

5 Artificial Intelligence Success in Games Year Game Backgammon Checkers Scrabble Chess Othello (Reversi) Go Poker Jeopardy! Success BKG 9.8 beat world champion Luigi Villa [] Chinook beat world champion Marion Tinsley [2] Quackle beat former champion David Boys [3] Deep Blue beat world champion Garry Kasparov [4] Logistello beat world champion Takeshi Murakami [5] Crazy Stone beat various pros [6] Polaris beat various pros in heads-up limit Texas hold em [7] Watson beat former winners Brad Rutter and Ken Jennings [8] 5 / 3

6 Perfect and Imperfect Information Perfect information - players always know the state e.g. Tic Tac Toe Checkers Imperfect information - at some point a player doesn t know the state e.g. Rock Paper Scissors Poker 6 / 3

7 First Approach Reinforcement learning (Q-Learning) to learn our own rewards 2 Sequence prediction to learn the opponent s strategy 3 Exhaustive explicit lookahead (to a limited depth) with and 2 to take the actions with the maximum expected reward Outperforms state-of-the-art reinforcement learning agents in: Rock Paper Scissors Prisoner s Dilemma Littman s Soccer [] Richard Mealing and Jonathan L. Shapiro. Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games. In: 2th International Conference on Artificial Intelligence and Soft Computing / 3

8 Reinforcement Learning We use Q(uality)-Learning to learn the rewards for action sequences Comparison agents use Q-Learning or Q-Learning based methods Q-Learning learns the expected value of taking an action in a state and then following a fixed strategy [] s t = α = γ = Q(s t, apla t ) ( α)q(st, apla t ) + α[r t + γ max Q(s t+, a t+ a t+ pla )] pla state at time t learning rate discount factor a t pla = r t = player s action at time t reward at time t We use Q(s t, apla t ) with lookahead and some exploration Comparison agents select max a t pla Q(s t, apla t ) with some exploration 8 / 3

9 Sequence Prediction Markov model - the probability of the opponent s action a t opp depends only on the current state s t Pr(a t opp s t ) Sequence prediction - the probability of the opponent s action depends on a history H Pr(a t opp H) where H {s t, a t, s t, a t 2, s t 2,..., a, s } 9 / 3

10 Sequence Prediction Methods Long-term memory L - a set of distributions, each one conditioned on a different history H L = {Pr(a t opp H) : H {s t, a t, s t, a t 2, s t 2,..., a, s }} Short-term memory S - a list of recent observations (states/actions) Observing a symbol o t S = (o t, o t, o t 2,..., o t n ) Generate a set of histories H = {H, H 2,... } using S 2 For each H H create/update Pr(a t opp H) using o t 3 Add o t to S (remove the oldest observation if needed) Predicting an opponent action a t opp Generate a set of histories H = {H, H 2,... } using S 2 Predict using {Pr(a t opp H) : H H} / 3

11 Sequence Prediction Method Example Entropy Learned Pruned Hypothesis Space [2]: Inputs: memory size n and entropy threshold e Observing a symbol o t Generate the powerset P(S) = H of short-term memory S S = (o t, o t, o t 2,..., o t n ) P(S) = {{}, {o },..., {o n }, {o, o 2 },..., {o, o n },..., {o, o 2,..., o n }} 2 For each H H create/update Pr(aopp H) t using o t 3 For each H H if Entropy(Pr(aopp H)) t > e then discard it 4 Add o t to S (remove the oldest observation if S > n) Predicting an opponent action aopp t Generate the powerset P(S) = H of short-term memory S 2 Predict using arg min Pr(a t opp H) Entropy(Pr(aopp H)) t for all H H / 3

12 Lookahead Example D C D, 4, C,4 3,3 2 / 3

13 Lookahead Example Defect is the dominant action (highest reward) Cooperate-Cooperate is socially optimal (highest sum of rewards) Tit-for-tat (copy opponent s last move) is good for repeated play Can we learn to play optimally against tit-for-tat? 3 / 3

14 Lookahead Example Pred. C Pred. D D C D C 4 3 D C D, 4, C,4 3,3 4 / 3

15 Lookahead Example Pred. C Pred. D D C D C 4 3 D C D, 4, C,4 3,3 With lookahead D has highest reward With lookahead 2 (D,C,D,C) has highest total reward (unlikely) Assume the opponent copies the player s last move (i.e. tit-for-tat) 5 / 3

16 Lookahead Example Pred. C D C D, 4, C,4 3,3 Pred. D D C D C 4 3 Pred. D Pred. C Pred. D Pred. C D C D C D C D C / 3

17 Lookahead Example Pred. C D C D, 4, C,4 3,3 Pred. D D C D C 4 3 Pred. D Pred. C Pred. D Pred. C D C D C D C D C With lookahead of 2 against tit-for-tat C has highest reward 3 7 / 3

18 Results of First Approach Converges to higher average payoffs per game at faster rates than reinforcement learning algorithms such that in... Iterated rock-paper-scissors: Learns to best-respond against variable-markov models Iterated prisoner s dilemma: Comes first in a tournament against finite automata Littman s soccer: Wins 7% of games against reinforcement learning algorithms 8 / 3

19 Summary of First Approach We associate a sequence predictor with each game state During a game we update our: Rewards for action sequences using Q-Learning Sequence predictors with observed opponent actions At each decision point we lookahead and take the first action of an action sequence with the maximum expected cumulative reward 9 / 3

20 Second Approach Sequence prediction to learn the opponent s strategy 2 Online expectation maximisation [3, 4] to predict the opponent s hidden information (to know H to update our opponent model) 3 No-regret learning algorithm to adjust our strategy 4 Simulate games against our opponent model Improves no-regret algorithm performance vs itself, a state-of-the-art reinforcement learning agent and a popular bandit algorithm in: Die-Roll Poker [5] Rhode Island Hold em [6] 2 / 3

21 Online Expectation Maximisation A rational agent will act based on its hidden information At the end of a game, we have observed the opponent s (public) actions but not necessarily their hidden information (e.g. they folded) Expectation step: For each possible instance of hidden information the opponent could hold, calculate the probability of their actions 2 Normalise these probabilities Each normalised probability corresponds to the expected number of opponent visits to the path associated with that hidden information Maximisation step: update the opponent s action probabilities along each path to account for their expected number of visits 2 / 3

22 Online Expectation Maximisation.5J.5K.5J.5K.5J.5K I I I2 I2.6C.4R.6C.4R.3C.7R.3C.7R I3 I4 I5 I6 I3 I4 I5 I6.8C.2R.F.C.C.9R.F.C.8C.2R.F.C.C.9R.F.C.F I7.C -.F I7.C -2.F I8.C 2.F I8.C J = Jack, K = King, F = Fold, C = Call, R = Raise 22 / 3

23 Online Expectation Maximisation.5J.5K.5J.5K.5J.5K I I I2 I2.6C.4R.6C.4R.3C.7R.3C.7R I3 I4 I5 I6 I3 I4 I5 I6.8C.2R.F.C.C.9R.F.C.8C.2R.F.C.C.9R.F.C.F I7.C -.F I7.C -2.F I8.C 2.F I8.C Assume we are P, we got a Jack, opponent P2 got a Jack or a King 23 / 3

24 Online Expectation Maximisation.5J.5K.5J.5K.5J.5K I I I2 I2.6C.4R.6C.4R.3C.7R.3C.7R I3 I4 I5 I6 I3 I4 I5 I6.8C.2R.F.C.C.9R.F.C.8C.2R.F.C.C.9R.F.C.F I7.C -.F I7.C -2.F I8.C 2.F I8.C Pr((J, J, R, F ) σ i ) = and Pr((J, K, R, F ) σ i ) = Update visits to (J, J, R, F ) by and to (J, K, R, F ) by 24 / 3

25 No-Regret Learning Our no-regret method is based on counterfactual regret minimisation State-of-the-art algorithm that provably minimises regret in two-player, zero-sum, imperfect information games [7] In self-play its average strategy profile approaches a Nash equilibrium Can handle games with 2 states ( states was the previous limit using Nesterov s excessive gap technique, limit poker has 8 states) Needs opponent s strategy, we use an online version that removes this 25 / 3

26 Results of Second Approach Has higher average payoffs per game and a higher final performance than the no-regret algorithm on its own such that in... Die-roll poker and Rhode Island hold em: Learns to win against all opponents (except near Nash where it draws) But online expectation maximisation seems less effective in Rhode Island hold em compared to die-roll poker - investigating why 26 / 3

27 Summary of Second Approach We associate a sequence predictor with each game state from the opponent s perspective (opponent information set) At the end of a game we: Predict opponent s hidden information by online expectation maximisation Update the sequence predictors along the path associated with the predicted hidden information and public actions Update our strategy with the reward from the actual game as well as the rewards from a number of simulated games 27 / 3

28 Summary Maximise our rewards when an opponent s actions can affect them Use games to focus on the agent, worry less about the environment Approaches: Reinforcement learning + sequence prediction + lookahead 2 Sequence prediction + online EM + no-regret + simulation 28 / 3

29 References I [] Backgammon Programming. Accessed: //23. [2] Chinook vs. the Checkers Champ - Top Man-vs.-Machine Moments - TIME. Accessed: //23. [3] Scrabble Showdown: Quackle vs. David Boys - Top Man-vs.-Machine Moments - TIME. Accessed: //23. [4] IBM - Deep Blue. Accessed: //23. [5] Othello match of the year. Accessed: //23. [6] CrazyStone at Sensei s Library. Accessed: //23. [7] Man vs Machine II - Polaris vs Online Poker s Best. Accessed: //23. [8] IBM Watson. Accessed: //23. [9] Richard Mealing and Jonathan L. Shapiro. Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games. In: 2th International Conference on Artificial Intelligence and Soft Computing. 23. [] Michael L. Littman. Markov games as a framework for multi-agent reinforcement learning. In: th Proc. of ICML. Morgan Kaufmann, 994, pp [] C. J. C. H. Watkins. Learning from delayed rewards. PhD thesis. Cambridge, 989. [2] Jensen et al. Non-stationary policy learning in 2-player zero sum games. In: Proc. of 2th Int. Conf. on AI. 25, pp / 3

30 References II [3] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. In: Journal of the Royal Statistical Society 39 (977), pp. 38. [4] Olivier Cappé and Eric Moulines. Online EM Algorithm for Latent Data Models. In: Journal of the Royal Statistical Society 7 (28), pp [5] Marc Lanctot et al. No-Regret Learning in Extensive-Form Games with Imperfect Recall. In: Proceedings of the 29th International Conference on Machine Learning (ICML-2). 22. [6] Jiefu Shi and Michael L. Littman. Abstraction Methods for Game Theoretic Poker. In: Revised Papers from the Second International Conference on Computers and Games. 2. [7] Martin Zinkevich et al. Regret Minimization in Games with Incomplete Information. In: Advances in Neural Information Processing Systems [8] G.W. Brown. Activity Analysis of Production and Allocation. In: ed. by T. J. Koopmans. New York: Wiley, 95. Chap. Iterative Solutions of Games by Fictitious Play. [9] Carmel and Markovitch. Learning Models of Intelligent Agents. In: Proc. of 3th Int. Conf. on AI. AAAI, 996, pp [2] John M Butterworth. Stability of gradient-based learning dynamics in two-agent imperfect-information games. PhD thesis. The University of Manchester, 2. [2] Knoll and de Freitas. A Machine Learning Perspective on Predictive Coding with PAQ. arxiv: / 3

31 Appendix: Future Work Change detection methods to discard outdated observations Use the opponent model more when it is more accurate More challenging domains e.g. n-player, continuous values Real-world applications e.g. peer-to-peer file sharing Use implicit as well as explicit opponent modelling 3 / 3

32 Appendix: Potential Applications Learning conditional and adaptive strategies Adapting to user interaction Adjusting the workload or relocating the system resources Responding to network traffic (p2p, spam filtering, virus detection) Overlapping areas: speech recognition/synthesis/tagging, musical score, machine translation, gene prediction, DNA/protein sequence classification/identification, bioinformatics, handwriting, gesture recognition, partial discharges, cryptanalysis, protein folding, metamorphic virus detection, statistical process control, robotic teams, distributed control, resource management, collaborative decision support systems, economics, industrial manufacturing, complex simulations, combinatorial search, etc... 3 / 3

33 Appendix: What has been tried before? Fictitious play assumes a Markov model opponent strategy [8] Unsupervised L* infers deterministic finite automata models [9] ELPH defeated human and agent players in rock-paper-scissors [2] Stochastic gradient ascent with the lagging anchor algorithm [2] PAQ8L defeated human players in rock-paper-scissors [2] 3 / 3

34 Appendix: Counterfactual Regret Minimisation Counterfactual Value: v i (I σ) = Pr(n σ i )u i (n) n I u i (n) = Pr(z σ)u i (z) Pr(n σ) z Z[n] v i (I σ) = player i s counterfactual value of information set I given strategy profile σ Pr(n σ i ) = probability of reaching node n from the root given the opponent s strategy u i (n) = player i s expected reward at node n Pr(n σ) = probability of reaching node n from the root given all players strategies Z[n] = set of terminal nodes that can be reached from node n u i (z) = player i s reward at terminal node z 3 / 3

35 Appendix: Counterfactual Regret Minimisation Counterfactual Regret: r i (I, a) = v i (I σ I a ) v i (I σ) r i (I, a) = player i s counterfactual regret of not playing action a at information set I σ I a = same as σ except a is always played at I v i (I σ I a ) = player i s counterfactual value of playing action a at information set I v i (I σ) = player i s counterfactual value of playing their strategy information set I 3 / 3

36 Appendix: Counterfactual Regret Minimisation Sampled Counterfactual Value: ṽ i (I σ, Q j ) = Pr(n σ i )ũ i (n Q j ) n I ũ i (n Q j ) = Pr(n σ) q(z) Pr(z σ)u i(z) z Q j Z[n] q(z) = j:z Q j q j ṽ i (I σ, Q j ) = player i s sampled counterfactual value of I given strategy profile σ and Q j Q j = set of sampled terminal nodes Pr(n σ i ) = probability of reaching node n from the root given the opponent s strategy ũ i (n) = player i s sampled expected reward at node n given Q j Pr(n σ) = probability of reaching node n from the root given all players strategies Z[n] = set of terminal nodes that can be reached from node n u i (z) = player i s reward at terminal node z q j = probability of sampling Q j 3 / 3

37 Appendix: Counterfactual Regret Minimisation Outcome Sampling ( Q j = and q j = q(z)): ṽ i (I x σ, Q j ) = Pr(n σ i ) Pr(n σ) n I x = Pr(n σ i) Pr(z σ)u i (z) Pr(n σ)q(z) = Pr(n σ i) Pr(z σ)u i (z) Pr(n σ) Pr(z σ ) z Q j Z[n] = Pr(n σ i) Pr(z σ i ) Pr(z σ i )u i (z) Pr(n σ i ) Pr(n σ i ) Pr(z σ i ) Pr(z σ i ) = Pr(z σ i) Pr(z σ i )u i (z) Pr(n σ i ) Pr(z σ i ) Pr(z σ i ) = Pr(z σ i)u i (z) Pr(n σ i ) Pr(z σ i ) Assuming σ i σ i q(z) Pr(z σ)u i(z) 3 / 3

38 Appendix: Zero-Determinant Strategies Unilaterally set an opponent s expected payoff in the iterated prisoner s dilemma irrespective of the opponents strategy Turns the prisoner s dilemma into an ultimatum game Works well against evolutionary players without an opponent model An opponent model could recognise the unfair offer and refuse 3 / 3

Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games

Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games Richard Mealing and Jonathan L. Shapiro {mealingr,jls}@cs.man.ac.uk Machine Learning and Optimisation Group School of omputer

More information

A Polynomial-time Nash Equilibrium Algorithm for Repeated Games

A Polynomial-time Nash Equilibrium Algorithm for Repeated Games A Polynomial-time Nash Equilibrium Algorithm for Repeated Games Michael L. Littman mlittman@cs.rutgers.edu Rutgers University Peter Stone pstone@cs.utexas.edu The University of Texas at Austin Main Result

More information

Deep Reinforcement Learning

Deep Reinforcement Learning Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 50 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015

More information

Game Theory, Evolutionary Dynamics, and Multi-Agent Learning. Prof. Nicola Gatti

Game Theory, Evolutionary Dynamics, and Multi-Agent Learning. Prof. Nicola Gatti Game Theory, Evolutionary Dynamics, and Multi-Agent Learning Prof. Nicola Gatti (nicola.gatti@polimi.it) Game theory Game theory: basics Normal form Players Actions Outcomes Utilities Strategies Solutions

More information

David Silver, Google DeepMind

David Silver, Google DeepMind Tutorial: Deep Reinforcement Learning David Silver, Google DeepMind Outline Introduction to Deep Learning Introduction to Reinforcement Learning Value-Based Deep RL Policy-Based Deep RL Model-Based Deep

More information

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017 Deep Reinforcement Learning STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017 Outline Introduction to Reinforcement Learning AlphaGo (Deep RL for Computer Go)

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69 R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual

More information

CS 570: Machine Learning Seminar. Fall 2016

CS 570: Machine Learning Seminar. Fall 2016 CS 570: Machine Learning Seminar Fall 2016 Class Information Class web page: http://web.cecs.pdx.edu/~mm/mlseminar2016-2017/fall2016/ Class mailing list: cs570@cs.pdx.edu My office hours: T,Th, 2-3pm or

More information

Optimal Convergence in Multi-Agent MDPs

Optimal Convergence in Multi-Agent MDPs Optimal Convergence in Multi-Agent MDPs Peter Vrancx 1, Katja Verbeeck 2, and Ann Nowé 1 1 {pvrancx, ann.nowe}@vub.ac.be, Computational Modeling Lab, Vrije Universiteit Brussel 2 k.verbeeck@micc.unimaas.nl,

More information

Convergence and No-Regret in Multiagent Learning

Convergence and No-Regret in Multiagent Learning Convergence and No-Regret in Multiagent Learning Michael Bowling Department of Computing Science University of Alberta Edmonton, Alberta Canada T6G 2E8 bowling@cs.ualberta.ca Abstract Learning in a multiagent

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Q-learning. Tambet Matiisen

Q-learning. Tambet Matiisen Q-learning Tambet Matiisen (based on chapter 11.3 of online book Artificial Intelligence, foundations of computational agents by David Poole and Alan Mackworth) Stochastic gradient descent Experience

More information

Reinforcement Learning: the basics

Reinforcement Learning: the basics Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46 Introduction Action selection/planning Learning by trial-and-error

More information

Multiagent (Deep) Reinforcement Learning

Multiagent (Deep) Reinforcement Learning Multiagent (Deep) Reinforcement Learning MARTIN PILÁT (MARTIN.PILAT@MFF.CUNI.CZ) Reinforcement learning The agent needs to learn to perform tasks in environment No prior knowledge about the effects of

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

A reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation

A reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation A reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation Hajime Fujita and Shin Ishii, Nara Institute of Science and Technology 8916 5 Takayama, Ikoma, 630 0192 JAPAN

More information

Announcements. CS 188: Artificial Intelligence Spring Mini-Contest Winners. Today. GamesCrafters. Adversarial Games

Announcements. CS 188: Artificial Intelligence Spring Mini-Contest Winners. Today. GamesCrafters. Adversarial Games CS 188: Artificial Intelligence Spring 2009 Lecture 7: Expectimax Search 2/10/2009 John DeNero UC Berkeley Slides adapted from Dan Klein, Stuart Russell or Andrew Moore Announcements Written Assignment

More information

Bits of Machine Learning Part 2: Unsupervised Learning

Bits of Machine Learning Part 2: Unsupervised Learning Bits of Machine Learning Part 2: Unsupervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

University of Alberta. Marc Lanctot. Doctor of Philosophy. Department of Computing Science

University of Alberta. Marc Lanctot. Doctor of Philosophy. Department of Computing Science Computers are incredibly fast, accurate and stupid. Human beings are incredibly slow, inaccurate and brilliant. Together they are powerful beyond imagination. Albert Einstein University of Alberta MONTE

More information

DECENTRALIZED LEARNING IN GENERAL-SUM MATRIX GAMES: AN L R I LAGGING ANCHOR ALGORITHM. Xiaosong Lu and Howard M. Schwartz

DECENTRALIZED LEARNING IN GENERAL-SUM MATRIX GAMES: AN L R I LAGGING ANCHOR ALGORITHM. Xiaosong Lu and Howard M. Schwartz International Journal of Innovative Computing, Information and Control ICIC International c 20 ISSN 349-498 Volume 7, Number, January 20 pp. 0 DECENTRALIZED LEARNING IN GENERAL-SUM MATRIX GAMES: AN L R

More information

The Cross Entropy Method for the N-Persons Iterated Prisoner s Dilemma

The Cross Entropy Method for the N-Persons Iterated Prisoner s Dilemma The Cross Entropy Method for the N-Persons Iterated Prisoner s Dilemma Tzai-Der Wang Artificial Intelligence Economic Research Centre, National Chengchi University, Taipei, Taiwan. email: dougwang@nccu.edu.tw

More information

Solving Zero-Sum Extensive-Form Games. Branislav Bošanský AE4M36MAS, Fall 2013, Lecture 6

Solving Zero-Sum Extensive-Form Games. Branislav Bošanský AE4M36MAS, Fall 2013, Lecture 6 Solving Zero-Sum Extensive-Form Games ranislav ošanský E4M36MS, Fall 2013, Lecture 6 Imperfect Information EFGs States Players 1 2 Information Set ctions Utility Solving II Zero-Sum EFG with perfect recall

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

Solving Dynamic Bandit Problems and Decentralized Games using the Kalman Bayesian Learning Automaton

Solving Dynamic Bandit Problems and Decentralized Games using the Kalman Bayesian Learning Automaton Solving Dynamic Bandit Problems and Decentralized Games using the Kalman Bayesian Learning Automaton By Stian Berg Thesis submitted in Partial Fulfillment of the Requirements for the Degree Master of Science

More information

Adversarial Search & Logic and Reasoning

Adversarial Search & Logic and Reasoning CSEP 573 Adversarial Search & Logic and Reasoning CSE AI Faculty Recall from Last Time: Adversarial Games as Search Convention: first player is called MAX, 2nd player is called MIN MAX moves first and

More information

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan Some slides borrowed from Peter Bodik and David Silver Course progress Learning

More information

Belief-based Learning

Belief-based Learning Belief-based Learning Algorithmic Game Theory Marcello Restelli Lecture Outline Introdutcion to multi-agent learning Belief-based learning Cournot adjustment Fictitious play Bayesian learning Equilibrium

More information

CS 4100 // artificial intelligence. Recap/midterm review!

CS 4100 // artificial intelligence. Recap/midterm review! CS 4100 // artificial intelligence instructor: byron wallace Recap/midterm review! Attribution: many of these slides are modified versions of those distributed with the UC Berkeley CS188 materials Thanks

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Evolutionary Bargaining Strategies

Evolutionary Bargaining Strategies Evolutionary Bargaining Strategies Nanlin Jin http://cswww.essex.ac.uk/csp/bargain Evolutionary Bargaining Two players alternative offering game x A =?? Player A Rubinstein 1982, 1985: Subgame perfect

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@csa.iisc.ernet.in Department of Computer Science and Automation Indian Institute of Science August 2014 What is Reinforcement

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@cse.iitb.ac.in Department of Computer Science and Engineering Indian Institute of Technology Bombay April 2018 What is Reinforcement

More information

Evolutionary Computation: introduction

Evolutionary Computation: introduction Evolutionary Computation: introduction Dirk Thierens Universiteit Utrecht The Netherlands Dirk Thierens (Universiteit Utrecht) EC Introduction 1 / 42 What? Evolutionary Computation Evolutionary Computation

More information

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010 Lecture 25: Learning 4 Victor R. Lesser CMPSCI 683 Fall 2010 Final Exam Information Final EXAM on Th 12/16 at 4:00pm in Lederle Grad Res Ctr Rm A301 2 Hours but obviously you can leave early! Open Book

More information

Decision Theory: Q-Learning

Decision Theory: Q-Learning Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning

More information

Learning to Coordinate Efficiently: A Model-based Approach

Learning to Coordinate Efficiently: A Model-based Approach Journal of Artificial Intelligence Research 19 (2003) 11-23 Submitted 10/02; published 7/03 Learning to Coordinate Efficiently: A Model-based Approach Ronen I. Brafman Computer Science Department Ben-Gurion

More information

Multiagent Value Iteration in Markov Games

Multiagent Value Iteration in Markov Games Multiagent Value Iteration in Markov Games Amy Greenwald Brown University with Michael Littman and Martin Zinkevich Stony Brook Game Theory Festival July 21, 2005 Agenda Theorem Value iteration converges

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Cyber Rodent Project Some slides from: David Silver, Radford Neal CSC411: Machine Learning and Data Mining, Winter 2017 Michael Guerzhoy 1 Reinforcement Learning Supervised learning:

More information

CMSC 474, Game Theory

CMSC 474, Game Theory CMSC 474, Game Theory 4b. Game-Tree Search Dana Nau University of Maryland Nau: Game Theory 1 Finite perfect-information zero-sum games! Finite: Ø finitely many agents, actions, states, histories! Perfect

More information

Multiagent Learning Using a Variable Learning Rate

Multiagent Learning Using a Variable Learning Rate Multiagent Learning Using a Variable Learning Rate Michael Bowling, Manuela Veloso Computer Science Department, Carnegie Mellon University, Pittsburgh, PA 15213-3890 Abstract Learning to act in a multiagent

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Announcements. CS 188: Artificial Intelligence Fall Adversarial Games. Computing Minimax Values. Evaluation Functions. Recap: Resource Limits

Announcements. CS 188: Artificial Intelligence Fall Adversarial Games. Computing Minimax Values. Evaluation Functions. Recap: Resource Limits CS 188: Artificial Intelligence Fall 2009 Lecture 7: Expectimax Search 9/17/2008 Announcements Written 1: Search and CSPs is up Project 2: Multi-agent Search is up Want a partner? Come to the front after

More information

CS 188: Artificial Intelligence Fall Announcements

CS 188: Artificial Intelligence Fall Announcements CS 188: Artificial Intelligence Fall 2009 Lecture 7: Expectimax Search 9/17/2008 Dan Klein UC Berkeley Many slides over the course adapted from either Stuart Russell or Andrew Moore 1 Announcements Written

More information

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted 15-889e Policy Search: Gradient Methods Emma Brunskill All slides from David Silver (with EB adding minor modificafons), unless otherwise noted Outline 1 Introduction 2 Finite Difference Policy Gradient

More information

Learning in Zero-Sum Team Markov Games using Factored Value Functions

Learning in Zero-Sum Team Markov Games using Factored Value Functions Learning in Zero-Sum Team Markov Games using Factored Value Functions Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 27708 mgl@cs.duke.edu Ronald Parr Department of Computer

More information

Temporal Difference Learning & Policy Iteration

Temporal Difference Learning & Policy Iteration Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Artificial Intelligence Review manuscript No. (will be inserted by the editor) Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Howard M. Schwartz Received:

More information

Multiagent Learning. Foundations and Recent Trends. Stefano Albrecht and Peter Stone

Multiagent Learning. Foundations and Recent Trends. Stefano Albrecht and Peter Stone Multiagent Learning Foundations and Recent Trends Stefano Albrecht and Peter Stone Tutorial at IJCAI 2017 conference: http://www.cs.utexas.edu/~larg/ijcai17_tutorial Overview Introduction Multiagent Models

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5 Table of contents 1 Introduction 2 2 Markov Decision Processes 2 3 Future Cumulative Reward 3 4 Q-Learning 4 4.1 The Q-value.............................................. 4 4.2 The Temporal Difference.......................................

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca

More information

POMDPs and Policy Gradients

POMDPs and Policy Gradients POMDPs and Policy Gradients MLSS 2006, Canberra Douglas Aberdeen Canberra Node, RSISE Building Australian National University 15th February 2006 Outline 1 Introduction What is Reinforcement Learning? Types

More information

CS885 Reinforcement Learning Lecture 7a: May 23, 2018

CS885 Reinforcement Learning Lecture 7a: May 23, 2018 CS885 Reinforcement Learning Lecture 7a: May 23, 2018 Policy Gradient Methods [SutBar] Sec. 13.1-13.3, 13.7 [SigBuf] Sec. 5.1-5.2, [RusNor] Sec. 21.5 CS885 Spring 2018 Pascal Poupart 1 Outline Stochastic

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning 1 Reinforcement Learning Mainly based on Reinforcement Learning An Introduction by Richard Sutton and Andrew Barto Slides are mainly based on the course material provided by the

More information

Human-level control through deep reinforcement. Liia Butler

Human-level control through deep reinforcement. Liia Butler Humanlevel control through deep reinforcement Liia Butler But first... A quote "The question of whether machines can think... is about as relevant as the question of whether submarines can swim" Edsger

More information

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL) 15-780: Graduate Artificial Intelligence Reinforcement learning (RL) From MDPs to RL We still use the same Markov model with rewards and actions But there are a few differences: 1. We do not assume we

More information

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING Slides adapted from Tom Mitchell and Peter Abeel Machine Learning: Jordan Boyd-Graber UMD Machine Learning

More information

Using Expectation-Maximization for Reinforcement Learning

Using Expectation-Maximization for Reinforcement Learning NOTE Communicated by Andrew Barto and Michael Jordan Using Expectation-Maximization for Reinforcement Learning Peter Dayan Department of Brain and Cognitive Sciences, Center for Biological and Computational

More information

Learning Equilibrium as a Generalization of Learning to Optimize

Learning Equilibrium as a Generalization of Learning to Optimize Learning Equilibrium as a Generalization of Learning to Optimize Dov Monderer and Moshe Tennenholtz Faculty of Industrial Engineering and Management Technion Israel Institute of Technology Haifa 32000,

More information

University of Alberta. Richard Gibson. Doctor of Philosophy. Department of Computing Science

University of Alberta. Richard Gibson. Doctor of Philosophy. Department of Computing Science University of Alberta REGRET MINIMIZATION IN GAMES AND THE DEVELOPMENT OF CHAMPION MULTIPLAYER COMPUTER POKER-PLAYING AGENTS by Richard Gibson A thesis submitted to the Faculty of Graduate Studies and

More information

Lecture 8: Policy Gradient

Lecture 8: Policy Gradient Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve

More information

Convergence to Pareto Optimality in General Sum Games via Learning Opponent s Preference

Convergence to Pareto Optimality in General Sum Games via Learning Opponent s Preference Convergence to Pareto Optimality in General Sum Games via Learning Opponent s Preference Dipyaman Banerjee Department of Math & CS University of Tulsa Tulsa, OK, USA dipyaman@gmail.com Sandip Sen Department

More information

Reinforcement Learning (1)

Reinforcement Learning (1) Reinforcement Learning 1 Reinforcement Learning (1) Machine Learning 64-360, Part II Norman Hendrich University of Hamburg, Dept. of Informatics Vogt-Kölln-Str. 30, D-22527 Hamburg hendrich@informatik.uni-hamburg.de

More information

CS 188: Artificial Intelligence

CS 188: Artificial Intelligence CS 188: Artificial Intelligence Adversarial Search II Instructor: Anca Dragan University of California, Berkeley [These slides adapted from Dan Klein and Pieter Abbeel] Minimax Example 3 12 8 2 4 6 14

More information

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396 Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction

More information

Learning to Compete, Compromise, and Cooperate in Repeated General-Sum Games

Learning to Compete, Compromise, and Cooperate in Repeated General-Sum Games Learning to Compete, Compromise, and Cooperate in Repeated General-Sum Games Jacob W. Crandall Michael A. Goodrich Computer Science Department, Brigham Young University, Provo, UT 84602 USA crandall@cs.byu.edu

More information

Fictitious Self-Play in Extensive-Form Games

Fictitious Self-Play in Extensive-Form Games Johannes Heinrich, Marc Lanctot, David Silver University College London, Google DeepMind July 9, 05 Problem Learn from self-play in games with imperfect information. Games: Multi-agent decision making

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

ARTIFICIAL INTELLIGENCE. Reinforcement learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm NAME: SID#: Login: Sec: 1 CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm You have 80 minutes. The exam is closed book, closed notes except a one-page crib sheet, basic calculators only.

More information

Optimism in the Face of Uncertainty Should be Refutable

Optimism in the Face of Uncertainty Should be Refutable Optimism in the Face of Uncertainty Should be Refutable Ronald ORTNER Montanuniversität Leoben Department Mathematik und Informationstechnolgie Franz-Josef-Strasse 18, 8700 Leoben, Austria, Phone number:

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Generalized Sampling and Variance in Counterfactual Regret Minimization

Generalized Sampling and Variance in Counterfactual Regret Minimization Generalized Sampling and Variance in Counterfactual Regret Minimization Richard Gison and Marc Lanctot and Neil Burch and Duane Szafron and Michael Bowling Department of Computing Science, University of

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

Short Course: Multiagent Systems. Multiagent Systems. Lecture 1: Basics Agents Environments. Reinforcement Learning. This course is about:

Short Course: Multiagent Systems. Multiagent Systems. Lecture 1: Basics Agents Environments. Reinforcement Learning. This course is about: Short Course: Multiagent Systems Lecture 1: Basics Agents Environments Reinforcement Learning Multiagent Systems This course is about: Agents: Sensing, reasoning, acting Multiagent Systems: Distributed

More information

A Parameterized Family of Equilibrium Profiles for Three-Player Kuhn Poker

A Parameterized Family of Equilibrium Profiles for Three-Player Kuhn Poker A Parameterized Family of Equilibrium Profiles for Three-Player Kuhn Poker Duane Szafron University of Alberta Edmonton, Alberta dszafron@ualberta.ca Richard Gibson University of Alberta Edmonton, Alberta

More information

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

Solving Heads-up Limit Texas Hold em

Solving Heads-up Limit Texas Hold em Solving Heads-up Limit Texas Hold em Oskari Tammelin, 1 Neil Burch, 2 Michael Johanson 2 and Michael Bowling 2 1 http://jeskola.net, ot@iki.fi 2 Department of Computing Science, University of Alberta {nburch,johanson,mbowling}@ualberta.ca

More information

Convergence Rate of Expectation-Maximization

Convergence Rate of Expectation-Maximization Convergence Rate of Expectation-Maximiation Raunak Kumar University of British Columbia Mark Schmidt University of British Columbia Abstract raunakkumar17@outlookcom schmidtm@csubcca Expectation-maximiation

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

CS 188: Artificial Intelligence Spring 2007

CS 188: Artificial Intelligence Spring 2007 CS 188: Artificial Intelligence Spring 2007 Lecture 8: Logical Agents - I 2/8/2007 Srini Narayanan ICSI and UC Berkeley Many slides over the course adapted from Dan Klein, Stuart Russell or Andrew Moore

More information

Cyclic Equilibria in Markov Games

Cyclic Equilibria in Markov Games Cyclic Equilibria in Markov Games Martin Zinkevich and Amy Greenwald Department of Computer Science Brown University Providence, RI 02912 {maz,amy}@cs.brown.edu Michael L. Littman Department of Computer

More information

Algorithmic Strategy Complexity

Algorithmic Strategy Complexity Algorithmic Strategy Complexity Abraham Neyman aneyman@math.huji.ac.il Hebrew University of Jerusalem Jerusalem Israel Algorithmic Strategy Complexity, Northwesten 2003 p.1/52 General Introduction The

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

COMP3702/7702 Artificial Intelligence Week1: Introduction Russell & Norvig ch.1-2.3, Hanna Kurniawati

COMP3702/7702 Artificial Intelligence Week1: Introduction Russell & Norvig ch.1-2.3, Hanna Kurniawati COMP3702/7702 Artificial Intelligence Week1: Introduction Russell & Norvig ch.1-2.3, 3.1-3.3 Hanna Kurniawati Today } What is Artificial Intelligence? } Better know what it is first before committing the

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Formal models of interaction Daniel Hennes 27.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Taxonomy of domains Models of

More information

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Deep Reinforcement Learning: Policy Gradients and Q-Learning Deep Reinforcement Learning: Policy Gradients and Q-Learning John Schulman Bay Area Deep Learning School September 24, 2016 Introduction and Overview Aim of This Talk What is deep RL, and should I use

More information

Reduced Space and Faster Convergence in Imperfect-Information Games via Pruning

Reduced Space and Faster Convergence in Imperfect-Information Games via Pruning Reduced Space and Faster Convergence in Imperfect-Information Games via Pruning Noam Brown 1 uomas Sandholm 1 Abstract Iterative algorithms such as Counterfactual Regret Minimization CFR) are the most

More information

Basic Game Theory. Kate Larson. January 7, University of Waterloo. Kate Larson. What is Game Theory? Normal Form Games. Computing Equilibria

Basic Game Theory. Kate Larson. January 7, University of Waterloo. Kate Larson. What is Game Theory? Normal Form Games. Computing Equilibria Basic Game Theory University of Waterloo January 7, 2013 Outline 1 2 3 What is game theory? The study of games! Bluffing in poker What move to make in chess How to play Rock-Scissors-Paper Also study of

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Noel Welsh 11 November 2010 Noel Welsh () Markov Decision Processes 11 November 2010 1 / 30 Annoucements Applicant visitor day seeks robot demonstrators for exciting half hour

More information

Chapter 6: Temporal Difference Learning

Chapter 6: Temporal Difference Learning Chapter 6: emporal Difference Learning Objectives of this chapter: Introduce emporal Difference (D) learning Focus first on policy evaluation, or prediction, methods hen extend to control methods R. S.

More information

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms * Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

More information