Multiagent Value Iteration in Markov Games

Similar documents
Cyclic Equilibria in Markov Games

Correlated Q-Learning

Games A game is a tuple = (I; (S i ;r i ) i2i) where ffl I is a set of players (i 2 I) ffl S i is a set of (pure) strategies (s i 2 S i ) Q ffl r i :

Decision Theory: Markov Decision Processes

Zero-Sum Stochastic Games An algorithmic review

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Decision Theory: Q-Learning

Markov Decision Processes

A Polynomial-time Nash Equilibrium Algorithm for Repeated Games

Learning to Coordinate Efficiently: A Model-based Approach

Multiagent Learning Using a Variable Learning Rate

Reinforcement Learning. Introduction

REINFORCEMENT LEARNING

Incremental Policy Learning: An Equilibrium Selection Algorithm for Reinforcement Learning Agents with Common Interests

(Deep) Reinforcement Learning

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

A Reinforcement Learning (Nash-R) Algorithm for Average Reward Irreducible Stochastic Games

Game-Theoretic Learning:

Best-Response Multiagent Learning in Non-Stationary Environments

Multi-Agent Reinforcement Learning Algorithms

Least squares policy iteration (LSPI)

Reinforcement Learning

Lecture 3: Markov Decision Processes

Learning to Compete, Compromise, and Cooperate in Repeated General-Sum Games

CS 7180: Behavioral Modeling and Decisionmaking

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Reinforcement Learning and Deep Reinforcement Learning

Convergence and No-Regret in Multiagent Learning

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

15-780: ReinforcementLearning

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Markov Decision Processes and Solving Finite Problems. February 8, 2017

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 45, NO. 7, JULY Yujing Hu, Yang Gao, Member, IEEE, andboan,member, IEEE

CSE250A Fall 12: Discussion Week 9

Internet Monetization

Lecture notes for Analysis of Algorithms : Markov decision processes

Markov Decision Processes (and a small amount of reinforcement learning)

Grundlagen der Künstlichen Intelligenz

Reinforcement Learning

, and rewards and transition matrices as shown below:

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Gradient descent for symmetric and asymmetric multiagent reinforcement learning

Multiagent (Deep) Reinforcement Learning

Q-Learning in Continuous State Action Spaces

Some AI Planning Problems

Figure 1: Bayes Net. (a) (2 points) List all independence and conditional independence relationships implied by this Bayes net.

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Optimal Convergence in Multi-Agent MDPs

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Learning in Zero-Sum Team Markov Games using Factored Value Functions

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm

Reinforcement Learning

16.410/413 Principles of Autonomy and Decision Making

CS599 Lecture 1 Introduction To RL

Gradient Descent for Symmetric and Asymmetric Multiagent Reinforcement Learning

Deep Reinforcement Learning

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Reinforcement learning

Real Time Value Iteration and the State-Action Value Function

C 4,4 0,5 0,0 0,0 B 0,0 0,0 1,1 1 2 T 3,3 0,0. Freund and Schapire: Prisoners Dilemma

Machine Learning I Reinforcement Learning

Multi-Agent Learning with Policy Prediction

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Infinite-Horizon Average Reward Markov Decision Processes

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Optimal Efficient Learning Equilibrium: Imperfect Monitoring in Symmetric Games

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

Markov Decision Processes Chapter 17. Mausam

Probabilistic Planning. George Konidaris

Reinforcement Learning: the basics

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount

Reinforcement Learning

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Reinforcement Learning

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

An Introduction to Reinforcement Learning

Artificial Intelligence & Sequential Decision Problems

Prioritized Sweeping Converges to the Optimal Value Function

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Chapter 3: The Reinforcement Learning Problem

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Final Exam December 12, 2017

Reinforcement learning

Planning and Acting in Partially Observable Stochastic Domains

Sequential Decision Problems

AM 121: Intro to Optimization Models and Methods: Fall 2018

Partially Observable Markov Decision Processes (POMDPs)

CS 598 Statistical Reinforcement Learning. Nan Jiang

1. (3 pts) In MDPs, the values of states are related by the Bellman equation: U(s) = R(s) + γ max a

Optimism in the Face of Uncertainty Should be Refutable

Infinite-Horizon Discounted Markov Decision Processes

Stochastic Safest and Shortest Path Problems

Reinforcement Learning. Yishay Mansour Tel-Aviv University

An Introduction to Markov Decision Processes. MDP Tutorial - 1

Learning Near-Pareto-Optimal Conventions in Polynomial Time

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Lecture 3: The Reinforcement Learning Problem

Transcription:

Multiagent Value Iteration in Markov Games Amy Greenwald Brown University with Michael Littman and Martin Zinkevich Stony Brook Game Theory Festival July 21, 2005

Agenda Theorem Value iteration converges to a stationary optimal policy in Markov decision processes. Question Does multiagent value iteration converge to a stationary equilibrium policy in Markov games? 1

Multiagent Q-Learning Minimax-Q Learning [Littman 1994] provably converges to stationary minimax equilibrium policies in zero-sum Markov games Nash-Q Learning [Hu and Wellman 1998] Correlated-Q Learning [G and Hall 2003] converge empiricially to stationary equilibrium policies on a testbed of general-sum Markov games 2

Multiagent Value Iteration Cyclic Equilibria Theory Multiagent value iteration converges to cyclic equilibrium policies in Marty s game. Experiments Multiagent value iteration converges to cyclic equilibrium policies Michael s game randomly generated Markov games Grid Game 1 [Hu and Wellman 1998] Shopbots and Pricebots [G and Kephart 1999] 3

Markov Decision Processes (MDPs) Decision Process S is a set of states A is a set of actions R : S A R is a reward function P[s t+1 s t, a t,..., s 0, a 0 ] is a probabilistic transition function that describes transitions between states, conditioned on past states and actions MDP = Decision Process + Markov Property: P[s t+1 s t, a t,..., s 0, a 0 ] = P[s t+1 s t, a t ] t, s 0,..., s t S, a 0,..., a t A 4

Bellman s Equations Q (s, a) = R(s, a) + γ s P[s s, a]v (s ) (1) V (s) = max a A Q (s, a) (2) Value Iteration VI(MDP, γ) Inputs Output Initialize discount factor γ optimal state-value function V optimal action-value function Q V arbitrarily REPEAT for all s S for all a A Q(s, a) = R(s, a) + γ s P[s s, a]v (s ) V (s) = max a Q(s, a) FOREVER 5

Markov Games Stochastic Game N is a set of players S is a set of states A i is the ith player s set of actions R i (s, a) is the ith player s reward at state s given action vector a P[s t+1 s t, a t,..., s 0, a 0 ] is a probabilistic transition function that describes transitions between states, conditioned on past states and actions Markov Game = Stochastic Game + Markov Property: P[s t+1 s t, a t,..., s 0, a 0 ] = P[s t+1 s t, a t ] t, s 0,..., s t S, a 0,..., a t A 6

Bellman s Analogue Q i (s, a) = R i(s, a) + γ s P[s s, a]v i (s ) (3) V i (s) = a A π (s, a)q i(s, a) (4) Foe-VI π (s) = (σ1, σ 2 ), a minimax equilibrium policy [Shapley 1953, Littman 1994] Friend-VI π (s) = e a where a argmax a A Q i (s, a) [Littman 2001] Nash-VI CE-VI π (s) Nash(Q 1 (s),..., Q n (s)) [Hu and Wellman 1998] π (s) CE(Q 1 (s),..., Q n(s)) [G and Hall 2003] 7

Multiagent Value Iteration MULTI VI(MGame, γ, f) Inputs discount factor γ selection mechanism f Output equilibrium state-value function V equilibrium action-value function Q Initialize V arbitrarily REPEAT for all s S for all a A for all i N Q i (s, a) = R i (s, a) + γ s P[s s, a]v i (s ) π(s) f(q 1 (s),..., Q n (s)) for all i N V i (s) = a A π(s, a)q i(s, a) FOREVER Friend-or-Foe-VI always converges [Littman 2001] Nash-VI and CE-VI converge to stationary equilibrium policies in zero-sum & common-interest Markov games [GZ and Hall 2005] 8

Cyclic Correlated Equilibria A cyclic policy ρ is a sequence of k < stationary policies. V ρ,t i (s) = ρ t (s, a)q ρ,t i (s, a) (5) a A Q ρ,t i (s, a) = R i (s, a) + γ s S ρ,t mod k+1 P[s s, a]v i (s ) (6) A cyclic policy of length k is a correlated equilibrium if for all i N, s S, a i A i, and t {1,..., k}, a i A i ρ t (s, a i a i )Q ρ,t i (s, a i, a i ) a i A i ρ t (s, a i a i )Q ρ,t i (s, a i, a i) (7) 9

Michael s Game: Best-Response Cycle Observation Michael s game has no stationary deterministic equilibrium policy when γ > 1 2. Proof (A quits, B quits) A prefers send to quit (2γ > 1) (A sends, B quits) B prefers send to quit (0 > 1) (A sends, B sends) A prefers quit to send (1 > 0) (A quits, B sends) B prefers quit to send ( 1 > 2) 10

Michael s Game: Cyclic Policy Observation Michael s game has a deterministic cyclic equilibrium policy when γ = 2 3. Example Policy V (A) V (B) ( 1 (A quits, B sends) (1, 2) 8 9 (, 9) 4 2 (A sends, B sends) 4 3, ) ( 8 2 3 9, 9) ( 4 3 (A sends, B quits) 4 3, 3) 2 (2, 1) 4 (A quits, B quits) (1, 2) (2, 1) 11

Michael s Game: Equilibrium Constraints ( VA 1(A) = Q1 16 A (A, quit) = 1 > 27 = 0 + 2 8 3)( 9) = 0 + γv 2 A (B) = Q 1 ( A (A, send) ) VA 2(A) = Q2 A (A, send) = 0 + γv A 3(B) = 0 + 2 3 (2) = 4 ( 3 > 1 = Q2 A (A, quit) ) VA 3(A) = Q3 A (A, send) = 0 + γv A 4(B) = 0 + 2 3 (2) = 4 3 > 1 = Q3 A (A, quit) ( VA 4(A) = Q4 16 A (A, quit) = 1 > 27 = 0 + 2 8 3)( 9) = 0 + γv 1 A (B) = Q 4 A (A, send) V 1 B (B) = Q1 B (B, send) = 0 + γv 2 B (A) = 0 + ( 2 V 2 B (B) = Q2 B (B, send) = 0 + γv 3 B (A) = 0 + ( 2 3 3 )( 2 3 )( 2 3 ) = 4 9 > 1 = Q1 B (B, quit) ) = 4 9 > 1 = Q2 B (B, quit) ( VB 3(B) = Q3 B (B, quit) = 1 > 4 3 = 0 + 2 3) (2) = 0 + γv 4 B (A) = Q 3 B (B, send) ( VB 4(B) = Q4 B (B, quit) = 1 > 4 3 = 0 + 2 3) (2) = 0 + γv 1 B (A) = Q 4 B (B, send) 12

Marty s Game: Rewards R A (B, send) = 0 R A (A, keep) = 1 A B R A (B, keep) = 3 R A (A, send) = 0 R B (B, send) = 0 R B (A, keep) = 0 A B R B (B, keep) = 1 R B (A, send) = 3 Observation [ZGL 2005] Marty s game has no stationary deterministic equilibrium policy when γ = 3 4. 13

Marty s Game: Q-Values and Values Q A (B, send) = 3 Q A (A, keep) = 4 4 Q A (B, keep) = 7 Q A (A, send) = 4 Q B (B, send) = 4 16 3 16 3 4 Q B (A, keep) = 4 Q B (B, keep) = 4 Q B (A, send) = 6 Theorem [ZGL 2005] Marty s game has a unique (probabilistic) stationary equilibrium policy. 14

Marty s Games: Tweaked Rewards R A (B, send) = 0 R A (A, keep) = 1 A B R A (B, keep) = 3 + x R A (A, send) = x R B (B, send) = 0 R B (A, keep) = 0 A B R B (B, keep) = 1 R B (A, send) = 3 Observation [ZGL 2005] These games have no stationary deterministic equilibria for 1 < x < 7 4 and γ = 3 4. 15

Marty s Games: Q-Values and Tweaked Values Q A (B, send) = 3 Q 4 3 4 A (A, keep) = 4 3 x Q A (B, keep) = 7 Q A (A, send) = 4 Q B (B, send) = 4 16 16 Q 3 4 B (A, keep) = 4 3 x 4 Q B (B, keep) = 4 Q B (A, send) = 6 Theorem [ZGL 2005] These games have unique (probabilistic) stationary equilibrium policies. 16

Negative Result Theorem [ZGL 2005] There exist an infinite number of Marty s games with the same Q-values, but different V -values and different stationary equilibrium policies. 17

Negative Result Theorem [ZGL 2005] There exist an infinite number of Marty s games with the same Q-values, but different V -values and different stationary equilibrium policies. Positive Result Theorem [ZGL 2005] In Marty s games, given any natural equilibrium selection mechanism, there exists some k > 1 s.t. multiagent value iteration converges to a cyclic equilibrium policy of length k. 18

Random Markov Games N = 2 A {2,3} S {1,...,10} Random Rewards [0,99] Random Deterministic Transitions γ = 3 4 Simultaneous Move Game, 2 Actions Simultaneous Move Game, 3 Actions 100 80 Cyclic uce uce 100 80 Cyclic uce uce Percentage 60 40 Percentage 60 40 20 20 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 States States 19

Multiagent Value Iteration in Markov Games Summary of Observations Multiagent value iteration converges to nonstationary deterministic cyclic equilibrium policies in Marty s and Michael s games. Multiagent value iteration converges empirically to not necessarily deterministic, not necessarily stationary, cyclic equilibrium policies in randomly generated deterministic Markov games. 20

Multiagent Value Iteration in Markov Games Summary of Observations Multiagent value iteration converges to nonstationary deterministic cyclic equilibrium policies in Marty s and Michael s games. Multiagent value iteration converges empirically to not necessarily deterministic, not necessarily stationary, cyclic equilibrium policies in randomly generated deterministic Markov games. Open Questions Do deterministic cyclic equilibrium policies necessarily exist in turn-taking games? If so, does multiagent value iteration necessarily converge to deterministic cyclic equilibrium policies in turn-taking games? Just as multiagent value iteration necessarily converges to stationary equilibrium policies in zero-sum Markov games, does multiagent value iteration necessarily converge to nonstationary cyclic equilibrium policies in general-sum Markov games? 21

The Answer is No! Multiagent value iteration does not necessarily converge to stationary equilibrium policies in general-sum Markov games, regardless of the equilibrium selection mechanism. 22