Solving Zero-Sum Extensive-Form Games. Branislav Bošanský AE4M36MAS, Fall 2013, Lecture 6

Similar documents
Fictitious Self-Play in Extensive-Form Games

Supplemental Material for Monte Carlo Sampling for Regret Minimization in Extensive Games

arxiv: v2 [cs.gt] 23 Jan 2019

Belief-based Learning

arxiv: v1 [cs.gt] 1 Sep 2015

MS&E 246: Lecture 12 Static games of incomplete information. Ramesh Johari

Notes on Coursera s Game Theory

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Chapter 6: Temporal Difference Learning

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Basics of reinforcement learning

Deep Reinforcement Learning

First Prev Next Last Go Back Full Screen Close Quit. Game Theory. Giorgio Fagiolo

Lecture 14: Approachability and regret minimization Ramesh Johari May 23, 2007

CITS4211 Mid-semester test 2011

Generalized Sampling and Variance in Counterfactual Regret Minimization

4. Opponent Forecasting in Repeated Games

Reinforcement Learning

Incremental Policy Learning: An Equilibrium Selection Algorithm for Reinforcement Learning Agents with Common Interests

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

A Parameterized Family of Equilibrium Profiles for Three-Player Kuhn Poker

CS261: A Second Course in Algorithms Lecture #12: Applications of Multiplicative Weights to Games and Linear Programs

1 The General Definition

Lecture 23: Reinforcement Learning

Game playing. Chapter 6. Chapter 6 1

Exact and Approximate Equilibria for Optimal Group Network Formation

Learning to Coordinate Efficiently: A Model-based Approach

University of Alberta. Marc Lanctot. Doctor of Philosophy. Department of Computing Science

CS599 Lecture 1 Introduction To RL

Microeconomics. 2. Game Theory

2 Background. Definition 1 [6, p. 200] a finite extensive game with imperfect information has the following components:

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm

Chapter 6: Temporal Difference Learning

Computing Minmax; Dominance

Online Learning Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das

Module 16: Signaling

Zero-Sum Stochastic Games An algorithmic review

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Dynamic Thresholding and Pruning for Regret Minimization

Selecting Efficient Correlated Equilibria Through Distributed Learning. Jason R. Marden

1 Equilibrium Comparisons

Cyber-Awareness and Games of Incomplete Information

1 Primals and Duals: Zero Sum Games

Convergence and No-Regret in Multiagent Learning

Lecture 6 Lecturer: Shaddin Dughmi Scribes: Omkar Thakoor & Umang Gupta

Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Copyright c 2016 Christian Kroer

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.

CS885 Reinforcement Learning Lecture 7a: May 23, 2018

Algorithmic Game Theory and Applications. Lecture 4: 2-player zero-sum games, and the Minimax Theorem

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Probabilistic Planning. George Konidaris

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

Adaptive State Feedback Nash Strategies for Linear Quadratic Discrete-Time Games

Machine Learning I Reinforcement Learning

Cyber Security Games with Asymmetric Information

Economics 201A Economic Theory (Fall 2009) Extensive Games with Perfect and Imperfect Information

arxiv: v2 [cs.gt] 4 Aug 2016

Exact and Approximate Equilibria for Optimal Group Network Formation

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Basic Game Theory. Kate Larson. January 7, University of Waterloo. Kate Larson. What is Game Theory? Normal Form Games. Computing Equilibria

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Probabilistic Graphical Models

Multiagent (Deep) Reinforcement Learning

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm

Efficient Sensor Network Planning Method. Using Approximate Potential Game

Game Theory and its Applications to Networks - Part I: Strict Competition

Reinforcement Learning. George Konidaris

Ambiguity and the Centipede Game

On Equilibria of Distributed Message-Passing Games

The No-Regret Framework for Online Learning

Distributed Optimization. Song Chong EE, KAIST

Introduction to Reinforcement Learning. Part 5: Temporal-Difference Learning

Reduced Space and Faster Convergence in Imperfect-Information Games via Pruning

The Multi-Arm Bandit Framework

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Two hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE. Date: Thursday 17th May 2018 Time: 09:45-11:45. Please answer all Questions.

CS 188: Artificial Intelligence

LINEAR PROGRAMMING III

Bayesian Opponent Exploitation in Imperfect-Information Games

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Learning Near-Pareto-Optimal Conventions in Polynomial Time

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

University of Alberta. Richard Gibson. Doctor of Philosophy. Department of Computing Science

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Reinforcement Learning II

CS 273 Prof. Serafim Batzoglou Prof. Jean-Claude Latombe Spring Lecture 12 : Energy maintenance (1) Lecturer: Prof. J.C.

Machine Learning and Adaptive Systems. Lectures 3 & 4

Learning, Games, and Networks

Evolution & Learning in Games

Computing Minmax; Dominance

CS261: Problem Set #3

Rollout-based Game-tree Search Outprunes Traditional Alpha-beta

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Game Theory. Greg Plaxton Theory in Programming Practice, Spring 2004 Department of Computer Science University of Texas at Austin

Sufficient Statistics in Decentralized Decision-Making Problems

Transcription:

Solving Zero-Sum Extensive-Form Games ranislav ošanský E4M36MS, Fall 2013, Lecture 6

Imperfect Information EFGs States Players 1 2 Information Set ctions Utility

Solving II Zero-Sum EFG with perfect recall Exact algorithms: Transformation to the normal form Using the sequence form (Iterative improvements of the sequence form) pproximate algorithms: Counterfactual Regret Minimization Excessive Gap Technique (variants of Monte-Carlo Tree Search)

Imperfect Information Zero-Sum EFG Max Min X Y Z W C D C D E F E F 3-2 1 3 2 1 0 3

Imperfect Information Zero-Sum EFG XZ XW YZ YW CE 3 3 1 1 X Y Z W C D C D E F E F 3-2 1 3 2 1 0 3 CF 3 3 1 1 DE -2-2 3 3 DF -2-2 3 3 CE 2 0 2 0 CF 1 3 1 3 DE 2 0 2 0 DF 1 3 1 3

II EFGs - Sequences Circle (Σ 1 ) Triangle (Σ 2 ) X Y X Y Z W C D C D E F E F C D E Z W 3-2 1 3 2 1 0 3 alternative representation of strategies σ i Σ i we use σ i a to denote executing an action a after the sequence σ i F

II EFGs - Sequences Circle (Σ 1 ) Triangle (Σ 2 ) X Y X Y Z W C D Z W C D C D E F E F 3-2 1 3 2 1 0 3 extension of the utility function g g i : Σ 1 Σ 2 R sequentially execute actions of the players stop at either: leaf - z Z g i σ 1, σ 2 = u i (z) there is no applicable action g i σ 1, σ 2 = 0 E F

II EFGs - Sequences Circle (Σ 1 ) Triangle (Σ 2 ) X Y X Y Z W C D C D E F E F 3-2 1 3 2 1 0 3 Examples g 1, W = 0 g 1 C, W = 0 g 1 F, W = 3 C D E F Z W

II EFGs Realization Plans Circle (Σ 1 ) Triangle (Σ 2 ) X Y X Y Z W C D C D E F E F C D E F Z W 3-2 1 3 2 1 0 3 behavioral strategies represented as realization plans probabilities for sequences assume the opponent allows us to play the actions from the sequence RP ~ probability that this sequence will be played in this strategy

II EFGs Realization Plans Circle (Σ 1 ) Triangle (Σ 2 ) X Y X Y Z W C D C D E F E F C D E F Z W 3-2 1 3 2 1 0 3 r 1 = 1 r 1 + r 1 = r 1 r 1 C + r 1 D = r 1 r 1 E + r 1 F = r 1 r 2 = 1 r 2 X + r 2 Y = r 2 r 2 Z + r 2 W = r 2 network-flow perspective

II EFGs Sequence Form LP calculate NE optimization against a best-response of the opponent I(σ) information set, in which the last action was executed seq(i) sequence leading to an information set I v Ii - expected utility in an information set max v I r 1 = 1 I 1,k I 1, σ = seq I 1,k : r 1 σ = a I1,k r 1 (σa) σ 2 Σ 2 : v I σ2 I2,j :seq I 2,j =σ 2 v I2,j + σ1 g 1 σ 1, σ 2 r 1 (σ 1 )

lternative algorithms - learning instead of calculating the exact NE, we can learn the best strategy to play repeatedly play the game and adjust the strategy according to the observations hopefully, we should converge to an equilibrium the simplest learning rule in games: fictitious play assume the opponent is playing NE strategy we should play the best response against it we can do that at each iteration

Fictitious play L R U 3 1 D -2 3 1. assume some a-priori strategy of the opponent (e.g., uniform) 2. calculate pure best-response strategy 3. play this strategy and observe the action of the opponent 4. update the belief about the mixed strategy of the opponent 5. repeat from step 2

Fictitious play and learning L R U 3 1 D -2 3 for many game models FP converges to a NE (e.g., zero-sum) the convergence is rather slow there are different variants we are interested in no-regret learning algorithms

Reminder - regret Player i s regret for playing an action a i if the other players adopt action profile a i we can use the concept of regret to learn the optimal strategy to find such a strategy that would not yield less than playing any pure strategy

No-regret Learning: Regret Matching let s define α t the average reward throughout iterations 1 t let s define α t (s) the average reward throughout iterations 1 t the player would have played s and all the opponents play as before regret at time t the agent experiences for not having played s equals R t s = α t s α t learning rule is no-regret if it guarantees that with high probability the agent will experience no positive regret Regret matching start with an arbitrary strategy (e.g., uniform) at each time step each action is chosen with probability proportional to its regret p t+1 s = R t (s) s S Rt (s )

pplication of no-regret learning in EFGs Counterfactual Regret Minimization (CFR) construct the complete game tree in each iteration traverse through the game tree and adapt the strategy in each information set according to the learning rule this learning rule minimizes the (counterfactual) regret it was proven that the algorithm minimizes the overall regret in the game

Comparing the algorithms Sequence form the leading exact algorithm suffers from memory requirements improved by iterative double-oracle construction of the game tree CFR more memory-efficient many incremental variants (MC-CFR, ) leading algorithm for solving Poker can solve some games with imperfect recall