arxiv: v1 [cs.gt] 15 Jan 2019

Similar documents
Additional File 1 - Detailed explanation of the expression level CPD

Specification -- Assumptions of the Simple Classical Linear Regression Model (CLRM) 1. Introduction

Harmonic oscillator approximation

Improvements on Waring s Problem

Batch RL Via Least Squares Policy Iteration

Chapter 11. Supplemental Text Material. The method of steepest ascent can be derived as follows. Suppose that we have fit a firstorder

Team. Outline. Statistics and Art: Sampling, Response Error, Mixed Models, Missing Data, and Inference

Chapter 6 The Effect of the GPS Systematic Errors on Deformation Parameters

Scattering of two identical particles in the center-of. of-mass frame. (b)

Two Approaches to Proving. Goldbach s Conjecture

The Second Anti-Mathima on Game Theory

This appendix presents the derivations and proofs omitted from the main text.

MULTIPLE REGRESSION ANALYSIS For the Case of Two Regressors

Batch Reinforcement Learning

Estimation of Finite Population Total under PPS Sampling in Presence of Extra Auxiliary Information

Start Point and Trajectory Analysis for the Minimal Time System Design Algorithm

Information Acquisition in Global Games of Regime Change (Online Appendix)

Statistical Properties of the OLS Coefficient Estimators. 1. Introduction

Variable Structure Control ~ Basics

Lecture 10 Support Vector Machines II

The multivariate Gaussian probability density function for random vector X (X 1,,X ) T. diagonal term of, denoted

MMA and GCMMA two methods for nonlinear optimization

The Essential Dynamics Algorithm: Essential Results

Discrete Simultaneous Perturbation Stochastic Approximation on Loss Function with Noisy Measurements

Extended Prigogine Theorem: Method for Universal Characterization of Complex System Evolution

Generalized Linear Methods

Pythagorean triples. Leen Noordzij.

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Solution Methods for Time-indexed MIP Models for Chemical Production Scheduling

Small signal analysis

Method Of Fundamental Solutions For Modeling Electromagnetic Wave Scattering Problems

A NUMERICAL MODELING OF MAGNETIC FIELD PERTURBATED BY THE PRESENCE OF SCHIP S HULL

A METHOD TO REPRESENT THE SEMANTIC DESCRIPTION OF A WEB SERVICE BASED ON COMPLEXITY FUNCTIONS

Introduction to Interfacial Segregation. Xiaozhe Zhang 10/02/2015

COS 521: Advanced Algorithms Game Theory and Linear Programming

On the SO 2 Problem in Thermal Power Plants. 2.Two-steps chemical absorption modeling

Root Locus Techniques

ENTROPY BOUNDS USING ARITHMETIC- GEOMETRIC-HARMONIC MEAN INEQUALITY. Guru Nanak Dev University Amritsar, , INDIA

The Price of Anarchy in a Network Pricing Game

Week 5: Neural Networks

Computing Correlated Equilibria in Multi-Player Games

Errors for Linear Systems

EEL 6266 Power System Operation and Control. Chapter 3 Economic Dispatch Using Dynamic Programming

Improvements on Waring s Problem

More metrics on cartesian products

bounds compared to SB and SBB bounds as the former two have an index parameter, while the latter two

Multiple-objective risk-sensitive control and its small noise limit

Module 5. Cables and Arches. Version 2 CE IIT, Kharagpur

APPENDIX A Some Linear Algebra

APPROXIMATE FUZZY REASONING BASED ON INTERPOLATION IN THE VAGUE ENVIRONMENT OF THE FUZZY RULEBASE AS A PRACTICAL ALTERNATIVE OF THE CLASSICAL CRI

OPTIMAL COMPUTING BUDGET ALLOCATION FOR MULTI-OBJECTIVE SIMULATION MODELS. David Goldsman

Separation Axioms of Fuzzy Bitopological Spaces

Assortment Optimization under MNL

Lecture 3. Ax x i a i. i i

Optimal inference of sameness Supporting information

STOCHASTIC BEHAVIOUR OF COMMUNICATION SUBSYSTEM OF COMMUNICATION SATELLITE

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

A New Virtual Indexing Method for Measuring Host Connection Degrees

Confidence intervals for the difference and the ratio of Lognormal means with bounded parameters

Foresighted Resource Reciprocation Strategies in P2P Networks

BOUNDARY ELEMENT METHODS FOR VIBRATION PROBLEMS. Ashok D. Belegundu Professor of Mechanical Engineering Penn State University

Resource Allocation with a Budget Constraint for Computing Independent Tasks in the Cloud

Lecture Notes on Linear Regression

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

1 The Mistake Bound Model

Curve Fitting with the Least Square Method

Lecture 21: Numerical methods for pricing American type derivatives

Feature Selection: Part 1

Expected Value and Variance

EEE 241: Linear Systems

m = 4 n = 9 W 1 N 1 x 1 R D 4 s x i

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

n α j x j = 0 j=1 has a nontrivial solution. Here A is the n k matrix whose jth column is the vector for all t j=0

AP Statistics Ch 3 Examining Relationships

Randomness and Computation

On a direct solver for linear least squares problems

Communication on the Paper A Reference-Dependent Regret Model for. Deterministic Tradeoff Studies

Lecture 14: Bandits with Budget Constraints

Notes on Frequency Estimation in Data Streams

2.3 Least-Square regressions

DEADLOCK INDEX ANALYSIS OF MULTI-LEVEL QUEUE SCHEDULING IN OPERATING SYSTEM USING DATA MODEL APPROACH

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

Modeling of Wave Behavior of Substrate Noise Coupling for Mixed-Signal IC Design

CS286r Assign One. Answer Key

j=0 s t t+1 + q t are vectors of length equal to the number of assets (c t+1 ) q t +1 + d i t+1 (1) (c t+1 ) R t+1 1= E t β u0 (c t+1 ) R u 0 (c t )

A Hybrid Evolution Algorithm with Application Based on Chaos Genetic Algorithm and Particle Swarm Optimization

728. Mechanical and electrical elements in reduction of vibrations

Transfer Functions. Convenient representation of a linear, dynamic model. A transfer function (TF) relates one input and one output: ( ) system

Outline and Reading. Dynamic Programming. Dynamic Programming revealed. Computing Fibonacci. The General Dynamic Programming Technique

A Computational Method for Solving Two Point Boundary Value Problems of Order Four

MODELLING OF STOCHASTIC PARAMETERS FOR CONTROL OF CITY ELECTRIC TRANSPORT SYSTEMS USING EVOLUTIONARY ALGORITHM

Grover s Algorithm + Quantum Zeno Effect + Vaidman

and decompose in cycles of length two

P exp(tx) = 1 + t 2k M 2k. k N

Lecture 20: Lift and Project, SDP Duality. Today we will study the Lift and Project method. Then we will prove the SDP duality theorem.

DUE: WEDS FEB 21ST 2018

TCOM 501: Networking Theory & Fundamentals. Lecture 7 February 25, 2003 Prof. Yannis A. Korilis

Adaptive Centering with Random Effects in Studies of Time-Varying Treatments. by Stephen W. Raudenbush University of Chicago.

MODELLING OF TRANSIENT HEAT TRANSPORT IN TWO-LAYERED CRYSTALLINE SOLID FILMS USING THE INTERVAL LATTICE BOLTZMANN METHOD

Transcription:

Model and algorthm for tme-content rk-aware Markov game Wenje Huang, Pham Vet Ha and Wllam B. Hakell January 16, 2019 arxv:1901.04882v1 [c.gt] 15 Jan 2019 Abtract In th paper, we propoe a model for non-cooperatve Markov game wth tme-content rk-aware player. In partcular, our model characterze the rk arng from both the tochatc tate tranton and the randomzed tratege of the other player. We gve an approprate equlbrum concept for our rk-aware Markov game model and we demontrate the extence of uch equlbra n tatonary tratege. We then propoe and analyze a mulaton-baed Q-learnng type algorthm for equlbrum computaton, and work through the detal for ome pecfc rk meaure. Our numercal experment on a two player queung game demontrate the worth and applcablty of our model and correpondng Q-learnng algorthm. Keyword: Markov game; tme-content rk preference; fxed pont theorem; Q-learnng 1 Introducton Markov game generalze Markov decon procee MDP) to the mult-player ettng. In the clacal cae, each player eek to mnmze h expected cot. In correpondng equlbrum, no player can decreae h expected cot by changng h trategy. We often want to compute equlbrum to predct the outcome of the game and undertand the behavor of the player. In the preent paper, we drectly account for the rk preference of the player n a Markov game n a general way. Player may be rk-avere and thu gve more attenton to low probablty but hgh cot event than a rk-neutral player would. Model for the rk preference of a ngle agent are well etablhed ee e.g. [2, 53] for the ngle perod ettng and [52, 55] for the dynamc cae). In th paper, we extend thee dea to general um Markov game and extend the framework of Markov rk meaure ee [52, 55]) to the mult-agent ettng. There are two major component to our preent paper n th regard. Frt, we dentfy an approprate rk-aware equlbrum concept and then we argue that uch equlbra ext n tatonary tratege. Second, we provde a practcal equlbrum computaton cheme whch a mulatonbaed Q learnng type algorthm). Such Q-learnng type algorthm model-free whch doe not requre any knowledge on the true model and ytem tranton, and thu can earch for the equlbra purely by obervaton. 1.1 Lterature revew Rk preference Expected utlty theory [59, 18, 58]) a hghly developed framework for modelng preference. Yet, ome experment e.g. [42]) how that real human behavor may volate the ndependence axom of expected utlty theory. Rk meaure a developed n [2, 53]) do not requre the ndependence axom and have favorable properte for optmzaton. Alternatvely, [15, 14] defne preference n term of atfyng a contnuum of target.e. atfcng and apratonal preference). Wenje Huang wenje_huang@u.nu.edu) a Reearch Engneer and Ph.D. Canddate n Department of Indutral Sytem Engneerng and Management at Natonal Unverty of Sngapore. Pham Vet Ha epvh@nu.edu.g) a Potdoctoral Reearch Fellow n Department of Indutral Sytem Engneerng and Management at Natonal Unverty of Sngapore. Wllam B. Hakell ehwb@nu.edu.g) an Atant Profeor n Department of Indutral Sytem Engneerng and Management at Natonal Unverty of Sngapore. 1

In the dynamc ettng, [52, 55] develop the cla of Markov a.k.a. neted/terated) rk meaure and etablh ther connecton to tme-contency. Th cla of dynamc rk meaure notable for t recurve formulaton, whch lead to dynamc programmng equaton. Several computatonal cheme for optmzng Markov rk meaure have been propoed. For ntance, n [32, 30, 31], approxmate dynamc programmng and Q-learnng type algorthm for MDP wth Markov rk meaure are developed. In [62], a mulaton-baed ftted value teraton algorthm developed for large-cale mplementaton of th cla of MDP. In contrat to optmzng a Markov rk meaure, one may optmze a ngle rk meaure of the fnal outcome ee [49]). For example, the condtonal value-at-rk of the fnte and nfnte horzon) total cot optmzed n [5]. In [38, 39, 6], the expected utlty of the total cot optmzed. Further, n [6] t hown how to olve th problem for general utlty functon by dong dynamc programmng on an augmented tate pace. In [26], a general law nvarant rk meaure of the total cot optmzed ung the convex analytc method. Rk-aware and robut game Rk-entve game have already been condered n [37, 24, 3, 7]. Here, rk-entvty refer to the pecfc utlty functon 1/θ) ln E [exp θ X)]) where θ > 0 the rk entvty parameter. In [24, 3], the focu on the zero um contnuou tme ettng. In [33], the author model preference a expected utlty. In robut game, there ambguty about the cot or tate tranton probablte of the game. In [1], the author gve a robut equlbrum concept where each player optmze agant the wort-cae expected cot over the range of model ambguty. Th paradgm extended to Markov game n [35], and the extence of robut Markov perfect equlbra demontrated. In both [1, 35], a multlnear ytem formulaton ued to compute the correpondng robut equlbrum. Applcaton of rk-aware game Rk-aware game are not artfcal; rather, they emerge organcally from many real problem. Traffc equlbrum problem wth rk-avere agent are analyzed n [8] wth non-cooperatve game theory. The preference of rk-aware adverare are modeled n Stackelberg ecurty game n [51], and a computatonal cheme for robut defender tratege preented. In [20], the author tudy commodty tradng where the player optmze tme-content neted rk meaure. 1.2 Contrbuton We make the followng contrbuton n th paper: 1. Frt, we develop a model for rk-aware Markov game where agent have tme content preference. Th model pecfcally addree both ource of rk n a Markov game: ) the rk from the tochatc tate tranton and ) the rk from the randomzed tratege of the other player. 2. Second, we propoe a noton of rk-aware Markov perfect equlbra for th game. We how that there ext rk-aware equlbra n tatonary tratege. 3. We create a practcal mulaton-baed Q-learnng type algorthm for computng rk-aware Markov perfect equlbrum, and we how that t converge almot urely. Such Q-learnng type algorthm model-free whch doe not requre any knowledge on the true model and ytem tranton, and thu can earch for the equlbra purely by obervaton. Moreover, trandtonal multlnear formulaton approach ee [35, 1]) for computng equlbrum fal n our model, becaue our model addree both ource of rk, and then wll ncorporate a blnear term n the multlnear formuaton whch lead to computatonal ntractablty. Thu, t a necety to ue an alternatve lke a Q-learnng type algorthm to compute equlbrum. Th paper organzed a follow. Secton 2 revew prelmnare on clacal Markov game. Then, Secton 3 ntroduce our model for rk-aware Markov game. In Secton 4, we certfy the extence of rkaware Markov perfect equlbra. A for computng thee equlbra, Secton 5 develop our Q-learnng type algorthm. We report numercal experment for a queung game n Secton 6 and we conclude the paper n Secton 7. The detaled proof of all our man reult may be found n the Appendce. 2

1.3 Notaton We make ue of the followng tandard notaton: the upremum norm on R d. x, y := d =1 x y the Eucldean nner product n R d. X = D Y equalty n dtrbuton. For a fnte et : P ) the et of all probablty dtrbuton on. e denote the contant functon equal to one on. e δ the δ th unt vector n R for δ. A B the matrx Kronecker product for A R m n and B R p q : a 11 B a 1n B A B =..... R mp nq. a m1 B a mn B A B the matrx Hadamard product where [A B], j = A, j B, j for all = 1,..., m and j = 1,..., n. d H A, B) the Haudorff dtance between nonempty ubet A and B of R d wth repect to the Eucldean norm 2, explctly, } d H A, B) := max up nf a b 2, up nf a b 2. b B a A 2 Prelmnare a A Th ecton brefly revew the etup for clacal Markov game ee e.g. [22, 35, 21]). Our game, denoted I, S, A, P, c}, cont of the followng ngredent: Fnte et of player I. Fnte et of tate S. Fnte et of acton A for each player I; mult-acton A := I A ; tate-acton par K := S A. Tranton probablte P, a) P S) for all, a) K. Cot functon c : S A R for all player I. Each round t 0 of the game follow four tep: ) frt, all player oberve the current tate t S; ) econd, each player I chooe a t A all move are multaneou and ndependent, and the correpondng mult-acton a t = ) a t I ); ) thrd, each player I realze cot c t, a t ); and v) fnally, the tate tranton to t+1 accordng to the dtrbuton P t, a t ). We next characterze the player tratege. Let h t = 0, a 0, 1, a 1,..., a t 1, t ) be the htory up to tme t 0 whch nclude the tate t at the begnnng of tme t), and let H t := S A) t S be the et of all poble htore up to tme t 0. b B Defnton 2.1. ) A decon rule for player at tme t a functon d t : H t PA ), where [ d th t ) ] a ) the probablty that player wll chooe a A, condtoned on h t. 3

) A behavoral trategy π for player I a equence of decon rule π := ) d t t 0, and Π the et of all behavoral tratege of player. ) A behavoral trategy π for player I Markov f [ d t h t ) ] a ) = [ d t t ) ] a ), h t H t, a A, t 0. v) A behavoral trategy π for player I tatonary f [ d t h t ) ] a ) = [ d t ) ] a ), h t H t, a A, t 0. We wrte π := π j) to denote the complementary behavoral trategy to player, o that multtratege can be wrtten a π = π, π ). j Let Ω := t 0 H t denote the et of trajectore ω = t, a t ) t 0. For each t 0, we let F t = σ 0, a 0,..., t ) be the natural fltraton on H t o that F 0 = σ 0 ), F t F t+1, and F t F := t 0 F t. It follow that Ω, F) a meaurable pace. Gven any mult-trategy π = π, π ) and ntal tate 0 =, we obtan a probablty dtrbuton P π on Ω, F) and we let E π denote expectaton wth repect to P π. In the clacal ettng, for a dcount factor γ 0, 1), each player objectve to chooe π to mnmze h expected nfnte horzon dcounted cot [ ] J 0 π, π ) := E π, π ) γ t c t, a t ), 2.1) gven the complementary trategy π, where the ubcrpt 0 denote the dependence on the ntal tate va P π 0 ). We now revew the equlbrum pont n lterature under rk-neutral ettng, whch are known a Markov perfect equlbra. Defnton 2.2. [22] [35, Defnton 1] Markov perfect equlbrum) A mult-trategy π = ) π a I Markov perfect equlbrum for I, S, A, P, c} f t=0 J 0 π, π ) J 0 π, π ), π Π, I. Th defnton tate that π an equlbrum pont f and only f no player can mprove h expected nfnte horzon dcounted cot by unlaterally changng h trategy. In other word, each player trategy a bet repone to the other player tratege. 3 Rk-aware Markov game In th ecton we develop our rk-aware Markov game model. Each player face a tream of cot X t = c t, a t ) for all t 0. There are two ource of tochatcty n th cot equence: ) tochatc tate tranton characterzed by the tranton kernel P, a); and ) the randomzed mxed tratege of other player characterzed by π. The key queton : how hould player account for both ource of tochatcty and evaluate the rk of the tal ubequence X t, X t+1,... from the perpectve of tme t? We begn by formalzng ome detal about the rk of fnte equence X t, T := X t, X t+1,..., X T ) before we conder the rk of the nfnte cot equence X 0, X 1,... actually faced by the player. For a reference dtrbuton P on Ω, F), let L t := L Ω, F t, P ) and L t, T := L t L t+1 L T for all 0 t T <. Defnton 3.1. ) A mappng ρ t, T : L t, T L t, called a condtonal rk meaure f: ρ t, T Z t, T ) ρ t, T X t, T ) for all Z t, T, X t, T L t, T uch that Z t, T X t, T. ) A dynamc rk meaure a equence of condtonal rk meaure ρ t, T } T t=0. Gven a dynamc rk meaure ρ t, T } T t=0, we may defne a larger famly of rk meaure ρ t, τ for 0 t τ T va the conventon ρ t, τ X t,..., X τ ) = ρ t, τ X t,..., X τ, 0,..., 0). We now make our key aumpton about player preference. 4

Aumpton 3.2. Suppoe the dynamc rk meaure ρ t, T } T t=0 atfe the followng condton: ) Normalzaton) ρ t, T 0, 0,..., 0) = 0. ) Condtonal tranlaton nvarance) For any X t, T L t, T, ρ t, T X t, X t+1,..., X T ) = X t + ρ t, T 0, X t+1,..., X T ). ) Convexty) For any X t, T, Y t, T L t, T and 0 λ 1, ρ t, T λ X t, T + 1 λ)y t, T ) λ ρ t, T X t, T ) + 1 λ)ρ t, T Y t, T ). v) Potve homogenety) For any X t, T L t, T and α 0, ρ t, T α X t, T ) = α ρ t, T X t, T ). v) Tme-contency) For any X t, T, Y t, T L t, T and 0 τ θ T, the condton X k = Y k for k = τ,..., θ 1 and ρ θ, T X θ,..., X T ) ρ θ, T Y θ,..., Y T ) mply ρ τ, T X τ,..., X T ) ρ τ, T Y τ,..., Y T ). Many of thee properte monotoncty, convexty, potve homogenety, and tranlaton nvarance) were orgnally ntroduced for tatc rk meaure n the poneerng paper [2]. They have nce been heavly jutfed n other work ncludng [54, 10, 46]. The next theorem gve a recurve formulaton for dynamc rk meaure atfyng Aumpton 3.2. Th repreentaton the foundaton of [52] and ubequent work on tme-content rk meaure. For th reult, we defne a mappng ρ t : L t+1 L t, where t 0, to be a one-tep condtonal) rk meaure f ρ t X t+1 ) = ρ t, t+1 0, X t+1 ). Theorem 3.3. [52, Theorem 1] Suppoe Aumpton 3.2 hold, then ρ t, T X t, X t+1,..., X T,...) = X t + ρ t X t+1 + ρ t+1 X t+2 + + ρ T X T ) + )), 3.1) for all 0 t T, where ρ t,..., ρ T are one-tep rk meaure. Now we may conder the rk of an nfnte cot equence. Baed on [52], the dcounted meaure of rk ρ γ t, T : L t, T R defned va ρ γ t, T X t, X t+1,..., X T ) := ρ t, T γ t X t, γ t+1 X t+1,..., γ T X T ). Defne L t, := L t L t+1 for t 0 and ρ γ : L 0, R va ρ γ X 0, X 1,...) := lm T ργ 0, T X 0, X 1,...). To provde our fnal repreentaton reult, we ntroduce the addtonal aumpton that rk preference are tatonary they only depend on the equence of cot ahead, and are ndependent of the current tme). Aumpton 3.4. Statonary preference) For all T 1 and 0, ρ γ 0, T X 0, X 1,..., X T ) = ρ γ, T + X 0, X 1,..., X T ). When Aumpton 3.2 and 3.4 are atfed, the correpondng dynamc rk meaure gven by the recuron: ρ γ X 0, X 1,..., X T,...) = X 0 + ρ 1 γx 1 + ρ 2 γ 2 X 2 + + ρ T γ T X T ) + )), 3.2) where ρ 1, ρ 2,... are all one-tep rk meaure. Baed on repreentaton 3.2), we may defne the rk-aware objectve for player to be: J 0 π, π ) := ρ c 0, a 0 ) + γ ρ c 1, a 1 ) + γ ρ c 2, a 2 ) + ))). 3.3) Here we ue the ame notaton J 0 π, π ) to preent rk-aware objectve rather than rk neutral one n 2.1). The correpondng bet repone functon for player then: mn π Π J 0 π, π ). 3.4) We let I, S, A, P, c, ρ} denote our correpondng rk-aware game wth preference gven by mappng J π, π ) }. Th formulaton lead to a natural noton of rk-aware equlbrum. Suppoe we replace I all the ρ wth expectaton E n formulaton 3.3) whch lead to formulaton 2.1), then Problem 3.4) wll naturally become rk-neutral Markov game. Thu our formulaton recover the rk-neutral game a a pecal cae. 5

a rk- Defnton 3.5. Rk-aware equlbrum n behavoral tratege) A mult-trategy π = ) π aware equlbrum for I, S, A, P, c, ρ} f I J 0 π, π ) J 0 π, π ), π Π, I. The nterpretaton of Defnton 3.5 analogou to Defnton 2.2 n the rk-neutral cae. In a rk-aware equlbrum π, player cannot reduce h rk a meaured by J 0 π, π )) by devatng from h trategy π. 4 Rk-aware Markov perfect equlbra In th ecton we conder equlbra of the rk-aware game I, S, A, P, c, ρ} n tatonary tratege. Frt we ntroduce new notaton to characterze thee equlbra, and then we demontrate the extence of rkaware equlbra n tatonary tratege. Statonary tratege precrbe a player the ame probablte for h choce each tme the player vt a certan tate, no matter what route he follow to reach that tate. However, normal behavor trategy may condton t choce of mxed acton, at any gven tage, on the entre htory; and therefore t mplementaton often a huge tak. Snce only a many decon rule a tate need be remembered, the memoryle property of tatonary tragete conform to real human behavor ee [60]). In addton, tatonary tratege are prevalent n the tudy of tochatc game due to ther mathematcal tractablty ee [60, 22]). 4.1 Characterzaton of tatonary equlbra For th dcuon, we uppoe that each player ha a tatonary polcy π Π where π = d, d,... ) for a decon rule d. In every tage of the game, each player mut evaluate the rk of the random varable c, A π)) + γ v S π)), where A π) the random mult-acton choen from A whch depend on the mult-trategy π), v ome meaure of the future rk for player to be determned hortly baed on recuron 3.2)), and S π) the random next tate vted whch frt depend on π through the random choce of mult-acton a, and then depend on the tranton kernel P, a) after a A realzed). Th random varable defned on the ample pace A S, where we wrte A before S to emphaze that the current tate fxed, then the mult-acton a choen accordng to π, and fnally the game tranton to the next tate accordng to P, a). The terated rk meaure Eq. 3.3) tate that we are ntereted n the tage-we rk of random varable on A S. We may explctly determne the dtrbuton of A π), S π)) n term of π and P. To contnue, we ntroduce ome mplfyng notaton to characterze tatonary tratege π whch correpond to decon rule d ) ). For each player I and tate S, I x P A ) the mxed trategy over acton where x ) a = [ d ) ] a ) ) for all a A. We defne the trategy x := x ) S X := S P A ) of player, the mult-trategy x := x ) I X := IX of all player, the complementary trategy x := x j ) j X := j X j, and the mult-trategy x = ) x I X := I P A ) for all player n tate S. We ometme wrte a mult-trategy a x = u, x ) to emphaze player trategy. We alo ntroduce the followng uccnct notaton for varou probablte: In tate S, the probablty that an acton tuple a = a ) I A choen and then the ytem tranton to tate k Π I x a )) P k, a). The dtrbuton of A π), S π)) on A S for every S gven by the matrx P u, x ) [ := u a ) Π j x j a j )) P k, a) ] a, k) A S, 4.1) where we explctly denote the dependence on the mult-trategy x = ) u, x n tate. 6

For tatonary tratege π, we adopt the conventon J u, x ) = J π, π ) ung the above notaton. In lne wth the clacal defnton of Markov perfect equlbrum n [22], we now defne a rk-aware Markov perfect equlbrum. Defnton 4.1. Rk-aware Markov perfect equlbrum) A mult-trategy x X a rk-aware Markov perfect equlbrum for I, S, A, P, c, ρ} f J x, x ) J u, x ), S, u X, I. 4.2) In Defnton 4.1, each player I mplement a rk-aware) tatonary bet repone gven the tatonary complementary trategy x. 4.2 Extence of tatonary equlbra To ae the tage-we rk on A S, each player need an etmate of the future rk tartng from the next tate S. Th etmate the value functon, whch form part of the decrpton of tatonary equlbrum: For each player, the value of the tatonary trategy x X n tate S defned to be v ) := Jx), and v := v ) ) the entre value functon for player for all tate. S The pace of value functon for all player V := I R S, and V equpped wth the upremum norm v := max S, I v ). Gven the value functon v, player face the random varable c, A) + γ v S ) where A the random acton choen accordng to x and S the random next tate. Player bet repone n tate S gven the complementary trategy x may then be expreed a nf ρ c, A) + γ v S )) : A, S ) P u, x )} u. PA ) We ee that the mappng c, A) + γ v S ) on A S fxed, whle the player control the dtrbuton P P A S) through ther mxed tratege. To contnue, we wll be more pecfc about the form of the rk functon ρ }. Let L be the et of all I random varable on A S whch are mappng from A S to R, uch random varable are automatcally bounded nce A S fnte). Alo let P denote a general probablty ) dtrbuton on A S correpondng to current tate S e.g. we wll uually take P to be P u, x ). By the Fenchel-Moreau theorem ee e.g. [23, 53, 25]), convex rk meaure on L wth repect to the underlyng probablty dtrbuton P have the form: ρ X) = up µ a, ) X a, ) α µ), X L, µ MP ) a, ) A S where M P ) PA S) a et of probablty dtrbuton that depend on P, M P ) cloed and convex, and α : P A S) R a convex functon. In our cae, evaluaton of the rk-to-go depend on computng wort-cae expectaton of [ c, A) + γ v S ) ], E A, S ) µ a µ range over M P ). To mplfy th expreon, we defne the horthand C v ) := c, a) + γ v ) ) a, ) A S, 4.3) for the rk-to-go, whch depend on v. Expectaton wth repect to µ M P ) may then be wrtten compactly a µ, C v ) := µ a, [ ) c, a) + γ v ) ]. a, ) A S We next ntroduce the followng aumpton throughout th paper on the rk functon ρ, the et of probablty dtrbuton M P ) } S, I and functon } α, that wll lead to the extence of S, I tatonary equlbrum. 7

Aumpton 4.2. ) All ρ are law nvarant, ρ X) = ρ Y ) for all X = D Y, where = D denote equalty n dtrbuton. ) M P ) } S, I P A S) a collecton of et-valued mappng where M P ) are cloed and polyhedral convex. Explctly, we conder M P ) := µ A, m µ + f m P ) h, m, m = 1,..., M, e T µ = 1, µ 0 }, 4.4) where A, m, m = 1,..., M are matrce, f m, m = 1,..., M are lnear functon n P, and h, m, m = 1,..., M are contant. ) } α S, I : P A S) R a collecton of functon. All α are convex and Lpchtz contnuou. v) For the mult-trategy x = ) ) u, x, A, S ) P u, x, and ρ c, A) + γ v S ) ) = max µ, C v ) αµ) }. µ M P) The formulaton 4.4) how the dependence of M P ) on P. In addton, f f m depend lnearly on P, then f m alo depend lnearly on u and x by defnton of P a 4.1). In computatonal term, Aumpton 4.2v) cloe to [35] whch aume polyhedral uncertanty et for the tranton probablte n t robut Markov game model. Th aumpton alo correpond to the one n [20] about repreentaton of agent rk preference. Under Aumpton 4.2v), we may wrte player rk functon ψ u, x, v ) := up µ M P) µ, C v ) α µ) }, 4.5) whch decrbe the rk-to-go for player from tate S under tatonary trategy ) u, x wth value functon v. A value functon correpond to a rk-aware Markov perfect equlbrum when v ) = mn u X J u, x ), S, I, 4.6) x arg mn u X J u, x ), S, I. 4.7) Eq. 4.6) - 4.7) together mply retate Eq. 4.2). However, Eq. 4.6) - 4.7) gve a computatonal recpe that we can encode nto an operator on mult-tratege. In partcular, we defne the operator Φx) := ũ X : ũ arg mn u PA ) ψ u, x, v ), } v ) = mn u PA ) ψ u, x, v ), S, I, 4.8) whch return the et of tratege for every player that are bet repone to all other player tratege. In the defnton of Φx), the condton v ) = mn u PA ) ψu, x, v ) undertood to hold for every S and I. We wh to etablh extence of a fxed pont of Φ, whch correpond to a rk-aware Markov perfect equlbrum. Our proof of extence of uch rk-aware Markov perfect equlbrum draw from [22, 36]. The man dea to apply Kakutan fxed pont theorem to how that th correpondence ha a fxed pont whch concde wth an equlbrum n tatonary tratege. Theorem 4.3. Suppoe Aumpton 4.2 hold, then I, S, A, P, c, ρ} ha an equlbrum n tatonary tratege. Proof. Proof ketch) In Appendx C, we how that Φ atfe the three condton needed to apply Kakutan theorem Theorem C.1): ) Frt, we how that the et Φ x) nonempty and a ubet of X. In Lemma C.7, we how that ψu, x, v ) contnuou n all t argument. In Theorem C.3 and C.4, we then how that Φ x) nonempty and a ubet of X. 8

) Second, we how that the et Φ x) cloed and convex. Th tatement verfed n Lemma C.13. ) Fnally, we how that the correpondence Φ upper emcontnuou. For each mult-trategy, we have already etablhed an operator T x ee C.1) on the pace of value functon, and how that the operator a contracton. Thu, there a unque value functon for each player. Let u defne a mappng from tatonary tratege to value functon va τ x ) := v = v )) S : v ) = mn u PA ) ψ u, x }, v ), S, I. Each τ x ) return the value functon for player correpondng to a bet repone to x. Denote the th element of τ x ) by τx ), let x ) n be a equence of mxed tratege of all player atfyng lm n x ) n = x, and let the correpondng value functon for player be τ x ) ). It hown n n Lemma C.10 that f x ) n x and τx ) n ) v ) a n, then τx ) = v ) for any S). Upper emcontnuty of the correpondence Φ then follow from Lemma C.14 by ung the equalte v ) = ψ y, x, v ) = τ x ) n ) = mn u PA ) ψ u, x, v ), whch are derved from the trangle nequalty and Lemma C.10. We note that n a general-um Markov game ncludng our rk-aware varant), multple Markov perfect equlbra may ext. Each dtnct equlbrum lead to a dfferent unque rk-aware value functon baed on Theorem C.3 and C.4. 5 A Q-Learnng Algorthm In th ecton, we propoe a Q-learnng type algorthm for computng equlbra of the rk-aware game I, S, A, P, c, ρ}: Rk-aware Nah Q-learnng RaNahQL). RaNahQL mulaton-baed o t doe not requre a model for the cot functon c } or the tranton kernel P. I Our algorthm dffer from rk-neutral Q-learnng n two repect: ) t etmate rk wth a tochatc approxmaton type teraton and ) the Q-value update are baed on the Nah equlbra of tage game. Here the tage game are collecton of Q-value array for each player for all mult-acton, that are generated n each teraton of our algorthm. To deal wth tem ), we draw from [32, 30, 31] where multple tochatc approxmaton ntance for both rk etmaton and Q-value update are pated together. To deal wth tem ), we how that the Nah equlbrum mappng for tage game a non-expanve mappng. Then, we can apply tochatc approxmaton type analy to prove the convergence of the algorthm. For th ecton, we aume that our rk meaure ρ } have a pecal form a addle-pont problem to facltate computaton. Aumpton 5.1. Stochatc addle-pont problem) For all I, ρ X) = mn max E [ P G X, y, z) ], 5.1) y Y z Z where: ) Y R d1 and Z R d2 are compact, convex wth dameter D Y and D Z, repectvely. ) G Lpchtz contnuou on L Y Z wth contant K G > 1. ) G convex n y Y and concave n z Z. v) The ubgradent of G on y and z are Borel meaurable and unformly bounded for all X L. For our Q-learnng algorthm, we pecfcally focu on rk meaure that can be etmated by olvng a tochatc addle-pont problem uch a a Problem 5.1). The followng reult, baed on [31, Theorem 3.2], gve pecal condton on G for the correpondng rk functon ρ n 5.1) to have addtonal tructure. Theorem 5.2. Suppoe there a collecton of functon h z } z Z uch that: ) h z P -quare ummable for every y Y, z Z ; ) y h z X y) convex; ) z h z X y) concave; and v) G : L Y Z R gven by G X, y, z) = y + h z X y), then the mnmax rk meaure 5.1) a convex rk meaure. 9

We now gve ome example of functon h z } z Z atfyng the condton of Theorem 5.2 uch that the correpondng rk-aware Markov perfect equlbrum ext. Example 5.3. The dtance between any probablty dtrbuton and a reference dtrbuton may be meaured by a φ-dvergence functon, everal example of φ-dvergence functon are hown n Table 3 n Appendx A. We can, n prncple, approxmate convex φ-dvergence functon wth pecewe lnear convex functon of the form ˆφµ) = max j J d j, µ + g j }. Ung the above form of ˆφ, we may then defne a correpondng et of probablty dtrbuton: M P ) = µ : µ = P ξ, B µ = e, µ 0, B P d j ξ + g j ) α e, j J }, 5.2) for contant α 0, 1) for all I. Baed on [50, Lemma 1], the rk meaure correpondng to 5.2) ha the form [ )]} X η ρx) = nf η + b α + b E P φ, 5.3) b 0, η R b where φ the convex conjugate of ˆφ. ) Let φ z denote a famly of φ-dvergence functon parameterzed by z Z that concave n Z, and let ˆφ z and φ z denote the correpondng pecewe lnear approxmaton and t convex conjugate, repectvely. Then, we may defne M z := µ : µ = P ξ, B µ = e, µ 0, B P ˆφ } z ξ ) α e, 5.4) and the rk meaure correpondng to z Z M z [ ρx) = nf η max + b α + b E P φ b 0, η R z Z z Suppoe we chooe h z from Theorem 5.2) to be h z X η) b = φ z X η b ), X η for any η R and b > 0. Aume X ha bounded upport [η mn, η max ], then 5.5) become ρx) = mn max η + E P η [η mn, η max] z Z [h z X η)]}, b )]}. 5.5) whch conform to the mnmax tructure n Eq 5.1). ) To recover CVaR, we let α 0, 1) for all I and chooe the φ-dvergence functon 0 0 x e φx) = 1 α, otherwe, and we take M P ) = µ : µ = P ξ, B µ = e, µ 0, 0 µ P } 1 α. 5.6) If we take the convex conjugate of th φ-dvergence functon and ubttute t nto Eq. 5.3), we obtan ρx) = mn η + 1 α ) 1 E P [max X η, 0}] }, η [η mn, η max] correpondng to h z X η) = 1 α ) 1 max X η, 0} for all z Z. 10

5.1 Rk-aware Nah Q-learnng algorthm The RaNahQL update are baed on future equlbrum cot whch depend on all player). In contrat, ngle-agent Q-learnng update are only baed on the player own cot. Thu, to predct equlbrum loe, every player mut mantan and update a model for all other player cot and rk. Let x be a tatonary equlbrum, and let v = ) v be the value functon correpondng to I x defned n 4.6). Then, we let Q, a) := mn max E P, a) G c, A) + γ v S), y, z )},, a) S A, I, 5.7) y Y z Z denote the Q-value correpondng to x and t value functon v. In the cae of multple equlbra, dfferent Nah trategy profle may have dfferent equlbra Q-value. A tage game, nformally, a one hot game. In a mult-agent Q-learnng algorthm, the agent are eentally playng a equence of tage game where the payoff are the current Q-value. In the followng defnton, we abue notaton and momentarly drop the dependence on the tate S. Defnton 5.4. ) A tage game a collecton C ) I of player cot C := c a) : a A} for cot functon c : A R. ) For x = x ) where I x P A ) for all I, C x, x ) := a A I x a )) c a) the expected cot to player. ) A mult-trategy x = x ) conttute a Nah equlbrum for the tage game C ) f I I C x, x ) C u, x ), u P A ), I,.e. f no player can reduce h expected cot by devatng from x. v) Let x = x ) denote a Nah equlbrum of the tage game C ). Then for all I, I I player expected cot n th equlbrum. Nah C j ) j I := C x, x ), Remark 5.5. There are everal method for computng Nah equlbra of tage game. The Lemke-Howon algorthm for two player bmatrx) game propoed n [41]. Th algorthm effcent n practce, yet, n the wort cae the number of pvot operaton may be exponental n the number of the game pure tratege. Recently, [44] gve an algorthm for two player game that acheve polynomal-tme complexty. Polynomal-tme approxmaton method, uch a [16, 27, 45], have been propoed for general um game wth more than two player. Now, we decrbe the pecfc of the tage game n each round for our rk-aware ettng. In each tate S, the correpondng tage game the collecton Q )) I, where Q ) := Q, a) : a A} the array of Q-value for player for all mult-acton. Let x be an equlbrum of the tage game Q )) I, then Nah Q j )) j I := x j a j ) Q, a), I, a A j I gve each player correpondng expected cot n tate wth repect to the Q-value). To open the dcuon of our algorthm, we frt ummarze the man tep of Q-learnng for rk-neutral Markov game a developed n [29]. Let θ t } t 0 denote the tep-ze for updatng the Q-value. For every teraton t 0 and player I: 1. Player oberve the current tate t and then chooe a t A. 2. Player oberve t own cot c t, a t ), the acton taken by all other player player cot c j t, a t ) } j, and latly the next tate t+1 after the tranton. ) a j t, the other j 3. Player compute Nah Q j t )) j I. 11

Algorthm 1 Rk-aware Nah Q-learnng Step 0) Intalze: Let n = 1, and t = 1, get the ntal tate 1. Let the learnng agent be ndexed by. For all S and a A, I, let Q j n,t, a) = 0. For n = 1,..., N do Step 1) Chooe a n baed on the exploraton polcy π. Oberve the acton and cot for all player, then oberve a new tate; For t = 1,..., T do Step 2) Compute the Nah Q-value; Compute the rk-aware cot-to-go for all player; Step 3) Update each Q n,t, I ung tochatc approxmaton; Step 4) Stochatc approxmaton of rk meaure by SASP; end for end for Return Approxmated Q-value Q N,T, I. 4. Player update t Q-value accordng to Q t+1 t, a t ) = 1 θ t )Q t t, a t ) + θ t [ c t, a t ) + γ Nah Q j t+1 )) j I ]. RaNahQL buld upon the algorthm n [29] for the rk-aware cae. It an aynchronou algorthm baed on two loop: an outer loop for updatng the Q-value a n the rk-neutral cae) and a new nner loop for etmatng rk whch unque to our ettng). The detal of RaNahQL are gven n Algorthm 1, where: N the number of teraton n the outer loop. T the number of teraton n the nner loop. n, t) an epoch, correpondng to teraton n of the outer loop and teraton t of the nner loop. Q n,t, a) the Q-value etmate for player for tate-acton par, a) K n epoch n N and t T. β 1/2, 1] the learnng rate, β = 1 a lnear learnng rate and β 1/2, 1) a polynomal learnng rate. } θβ n are the tep-ze for the Q-learnng update n the outer loop, for learnng rate β. n 1 λ t,α } t 1 are the tep-ze for the rk etmaton n the nner loop, we take λ t,α = C t α for C > 0 and α 0, 1]. τ : [1, n] [1, n] a functon atfyng τ n) [1, n]. H Y and H Z are potve parameter. y n,t, a), z n,t, a) ) Y Z the etmate of the addle pont of the mnmax problem 5.1) whch repreent player rk, correpondng to tate-acton par, a) K, n epoch n, t). π an exploraton polcy where, n each tate S, all mult-acton a A have potve probablty of beng ampled. We gve further detal on each tep of Algorthm 1 a follow. We wll hortly requre.e., the Eucldean projecton onto Y Z. Step 0: Intalzaton: Π Y Z [y, z)] := arg mn y, z) y, z ) Y Z y, z ) 2, 12

Step 0a: Intalze all Q-value Q 1,1, a) for all, a) K and I; Step 0b: Intalze y 0,t, a), z 0,t, a) ) for all t T,, a) K, and I. Step 1: For all, a) K and I, et yn,1, a), zn,1, a) ) = yn 1,T, a), z n 1,T, a)) and Q n,1, a) = Q n 1,T, a). Step 2: All agent oberve the current tate n t : Step 2a: Generate an acton a n from polcy π whch gve ome potve probablty to all acton); Step 2b: Oberve acton a n = a n) I, cot c n t, a n ) } I, and next tate n t+1 P n t, a n ). ) Step 3: Compute Nah Q-value vn 1 n t+1) = Nah Q j n 1,T n t+1) for all I: Step 3a: Compute and ˆq n,t n t, a n ) = G c n t, a n ) + γ v n 1 n t+1), y n,t n t, a n ), z n,t n t, a n ) ), 5.8) y n,t n t, a n ), z n,t n t, a n ) ) = 1 t τ t) + 1 t τ=τ t) j I y n,τ n t, a n ), z n.τ n t, a n ) ), 5.9) for all I. Th tep oberve a new tate and compute the etmated Q-value ˆq n,t; Step 3b: Ue the Q n 1,T at the lat tage for etmaton rather than record all the etmated Q-value for each Q-value for each t T at tage n 1. Step 4: For all, a) K, and I, compute Q n,t, a) = 1 θ n β, a) ) Q n 1,T, a) + θ n β, a) ˆq n,t n t, a n ). 5.10) Th update the ame a n tandard Q-learnng w.r.t. the outer loop. Step 5: Update y n,t+1 n t, a n ), z n,t+1 n t, a n ) ) =Π Y Z y n,t n t, a n ), z n,t n t, a n ) ) for all I, and λ t,α ψ c n t, a n ) + γ v n 1 n t+1), y n,t n t, a n ), z n,t n t, a n ) )}, 5.11) ψ vn 1 n t+1), yn,t n t, a n ), zn,t n t, a n ) ) HY G = yc n t, a n ) + γ vn 1 n t+1), yn,t n t, a n ), zn,t n t, a n )) H Z G zc n t, a n ) + γ vn 1 n t+1), yn,t n t, a n ), zn,t n t, a n )) ). 5.12) Th the rk etmaton tep, t update the current terate of the rk correpondng to each elected tate-acton par. In Step 5.9), 5.11) and 5.12), we ue the tochatc approxmaton for addle-pont problem SASP) algorthm, [47, Algorthm 2.1]. Clacal tochatc approxmaton may reult n extremely low convergence for degenerate objectve.e. when the objectve ha a ngular Hean). However, the SASP algorthm wth a properly choen parameter α 0, 1] preerve a reaonable cloe to On 1/2 )) convergence rate, even when the objectve non-mooth and/or degenerate. Thu, SASP a robut choce for olvng our addle-pont problem 5.1). 13

Game 1 Left Rght Up 0, 1 10, 7 Down 7, 10 11, 8 Game 2 Left Rght Up 5, 5 10, 4 Down 4, 10 8, 8 Game 3 Left Rght Up 0, 1 10, 9 Down 7, 10 8, 8 Table 1: Example of Stage Game 5.2 Almot ure convergence analy We now tate the man convergence reult for RaNahQL. We would lke to how convergence of Q n,t to the rk-aware equlbrum Q-value Q for all player. We are ntereted n the followng pecal type of Nah equlbra, whch play a major role n our analy a n [29]. Defnton 5.6. ) [29, Defnton 12] A mult-trategy x X a global optmal pont of C ) I player mnmze h expected cot at x: f every C x) C x ), x X, I. ) [29, Defnton 13] A mult-trategy x X a addle pont of C ) f 1) t a Nah equlbrum I and 2) each player would receve a lower expected cot f at leat one of the other player devate: C x, x ) C x, u ), u X, I. ) A mult-trategy x X a I -mxed pont of C ) I ext an ndex of player I I uch that: f 1) t a Nah equlbrum and 2) there C x) C x ), x X, I, and C x, x ) C x, u ), u X, I\I. A global optmal pont alway a Nah equlbrum, and t can be hown that all global optma have equal value. Addtonally, [29, Lemma 14] how that all addle pont of tage game have equal value. Our defnton of I -mxed pont new, t combne the precedng two dea. In Defnton 5.6, a ubet of player I I mnmze ther expected cot at x. The ret of the player I\I each would receve a lower expected cot when at leat one of the other player devate. Example 5.7. We gve an example of I -mxed pont n Table 1. Player 1 ha choce Up and Down, and Player 2 ha choce Left and Rght. Player 1 lo the frt entry n each cell, and Player 2 are the econd. The frt game ha a unque Nah equlbrum 0, 1), whch a global optmal pont. The econd game alo ha a unque Nah equlbrum 8, 8), whch a addle-pont. The thrd game ha two Nah equlbrum: a global optmum 0, 1), and a mxed pont 8, 8). In equlbrum 8, 8), Player 1 receve a lower cot f Player 2 devate, whle Player 2 receve a hgher cot f Player 1 devate. We ntroduce the followng aumpton for our analy of RaNahQL. Aumpton 5.8. One of the followng hold for all tage game Q n,t )) I for all n and S n Algorthm 1. Condton A. Every Q n,t )) I for all n and S ha a global optmal pont. Condton B. Every Q n,t )) I for all n and S ha a addle pont. Condton C. For any two tage game Q, Q Q n,t )) I for all n and S, we uppoe Q 1 ha a I 1 -mxed pont x and Q 2 ha a I 2 -mxed pont x. Then: ) For I 1 I\I 2 ), then Q x) Q x); ) For I 2 I\I 1 ), then Q x) Q x). Condton C mply tate that there are I -mxed pont for any two tage game. Compared wth [29, Aumpton 3], Condton C n Aumpton 5.8 enable wder applcaton of RaNahQL, even f the ndce I 1 and I 2 of all the tage game Q n,t )) I dffer. 14

Remark 5.9. ) Implementaton of RaNahQL complcated by the fact that there mght be multple Nah equlbra for a tage game. In RaNahQL, we chooe a unque Nah equlbrum ether baed on t expected lo, or baed on the order t ranked n a lt of oluton. Such an order determned by the acton equence, whch ha lttle to do wth the equlbrum condton. ) For a two-player game, we calculate Nah equlbra ung the Lemke-Howon method ee [41]), whch can generate equlbrum n a certan order. We now lt the neceary defnton and aumpton for our algorthm, mot of whch are tandard n the tochatc approxmaton lterature. We frt defne a probablty pace Ω, G, P ) where G = σ n t, a n ), n N, t T }, and the fltraton G n t = σ τ, a τ ), < n, τ T } n τ, a n τ ), τ t} }, for t T and n N, wth G 0 t =, Ω} for all t T. Th fltraton neted, G n t G n t+1 for 1 t T 1 and G n T Gn+1 0. The followng aumpton reflect our exploraton requrement. Aumpton 5.10. ε-greedy polcy) There an ε > 0 uch that the polcy π atfe, for any n N, t T, and all, a) K, P ) n t, a n ) =, a) Gt 1 n ε and P n 1, a n ) =, a) G n 1 ) T ε. In partcular, let x X be a Nah equlbrum of the tage game Q )) I. Then, wth probablty ε 0, 1), the acton a choen randomly from A, and wth probablty 1 ε, the acton a drawn from A accordng to x. Aumpton 5.10 guarantee, by the Extended Borel-Cantell Lemma n [13], that we wll vt every tate-acton par nfntely often wth probablty one. Th aumpton balance exploraton and explotaton n RaNahQL and n Q-learnng more generally. The next aumpton contan our requrement on the tep-ze for the Q-value update. Aumpton 5.11. For all, a) K and for all n N, t T, the tep-ze for the Q-value update atfy: n=1 θn β, a) = and n=1 θn β, a)2 < for all t T and, a) K a.. Let #, a, n) denote one plu the number of tme, untl the begnnng of teraton n, that the tate-acton par, a) ha been vted, and let N,a denote the et of outer teraton where acton a wa performed n tate. The tep-ze θβ n, a) atfy θn β, a) = 1 f n N,a and θ n [#,a,n)] β β, a) = 0 otherwe. Aumpton 5.11 reflect the aynchronou nature of the Q-learnng algorthm a tated n [19], only a ngle tate-acton par updated when t oberved n each teraton. Our man convergence reult for Algorthm 1 next. Theorem 5.12. Suppoe Aumpton 5.10, 5.11, and 5.8 hold. For any T 1, there ext 0 < γ 1/K G, uch that Algorthm 1 generate equence } Q n,t uch that n 1 Q n, T Q almot urely a n for all I. Proof. Proof ketch) The complete tep of the proof are preented n Appendx D. ) In Lemma D.2, we how that all mxed pont of a tage game have equal value. Conequently, n Lemma D.3, we how that the mappng from Q-value to Nah equlbrum of the tage game) nonexpanve. ) We how n Lemma D.1 that the Haudorff dtance between the two ubdfferental correpondng to the repreentaton of ρ n 5.1)) and S n,t := G yc + γ v n 1, y n,t, z n,t), G zc + γ v n 1, y n,t, z n,t))}, S n,t := G yc + γ v, y n,t, z n,t), G zc + γ v, y n,t, z n,t))}, bounded by a functon of Q n 1,T Q 2. ) Baed on part ) and ), we how that the gap of all the addle pont etmaton problem yn,t, zn,t) yn,, zn, ) 2 2 and y, z ) yn,, zn, ) 2 2 are bounded by functon of Q n 1,T Q 2 2 and Q 2 2, repectvely, n Lemma D.4 and D.5. v) Fnally, baed on part )-), we etablh that the Q-value n RaNahQL are a well-behaved tochatc approxmaton equence ee e.g. [19, Defnton 7]). Almot ure convergence of Q n, T Q a n for all I then follow. Q n 1 T 15

, we can compute the correpondng rk-aware Markov perfect equ- Once we have the Q-value ) Q lbrum and value functon. I Remark 5.13. We brefly dcu the computatonal complexty of RaNahQL. RaNahQL need to mantan I Q-value and I S rk etmate n term of computng oluton of the correpondng addle-pont problem). In each teraton, RaNahQL update all Q, a) for all, a) S A and I. Addtonally, t update y, a), z, a) ) for all I through SASP. The total number of entre n each array Q S A. Snce RaNahQL ha to mantan the Q-value for every player, the total pace requrement I S A. The torage requrement for the rk etmaton mlar. Therefore, the torage requrement of RaNahQL n term of pace lnear n the number of tate, polynomal n the number of acton, and exponental n the number of player. The algorthm runnng tme domnated by computaton of Nah equlbrum for the Q-functon update. In general, the complexty of equlbrum computaton n matrx game unknown. A mentoned n the prevou ecton, ome commonly ued algorthm for two player game have exponental wort-cae bound, and approxmaton method are typcally ued for n-player game ee [45]). 6 A Queung Control Applcaton We apply our technque to the ngle erver exponental queung ytem from [35]. In th packet wtched network, packet block of data) are routed between erver over lnk hared wth other traffc. The ervce rate of each erver can be et to dfferent level and controlled by a ervce provder Player 1). Packet are routed by a programmable phycal devce, called a router Player 2). A router dynamcally control the flow of arrvng packet nto a fnte buffer at each erver. The rate choen by the ervce provder and router depend on the number of packet n the ytem. In fact, t to the beneft of a ervce provder to ncreae the amount of packet proceed n the ytem. However, uch an ncreae may reult n an ncreae n packet watng tme n the buffer called latency), and router are ued to reduce packet watng tme. Thu, the game theoretcal nature of the problem are becaue the ervce provder and router the have uch competng objectve. The tate pace S = 0, 1,..., S}, where S < the maxmum number of packet allowed n the ytem. Only one packet can be n ervce at each tme, whle the remanng packet wat for ervce n the buffer. The router admt one packet nto the ytem at each tme. Every tme a tate vted, the ervce provder and the router multaneouly chooe a ervce rate µ > 0 and an admon rate λ > 0. Suppoe there are packet n the ytem and the player chooe the acton tuple µ, λ), then the router ncur a holdng cot h) and a cot θµ, λ) aocated wth havng packet erved at rate µ when t admt packet at rate λ. If there are no packet n the ytem, θµ, λ) can be nterpreted a the etup cot of the erver. Thee payoff are modeled a beng pad to the ervce provder, nce the player objectve are n conflct. The ervce provder, n turn, pay the router βµ, λ) whch repreent the reward to the router for choong the rate λ. It can alo be nterpreted a the etup cot of the router. The cot functon for mult-acton a = µ, λ) are then: c 1, a) := βa) θa) and c 2, a) := h) + θa) βa). We aume that the tme untl the admon of a new packet and the next ervce completon are both exponentally dtrbuted wth mean 1/λ and 1/µ, repectvely. We can, therefore, model the number of packet n the ytem a a brth and death proce wth tate tranton probablte: µ/λ + µ), 1 < < S, k = 1, λ/λ + µ), 0 < < S 1, k = + 1, P k, a) := 1, = 0, k = 1, 1, = S, k = S 1. We chooe the followng parameter for our example: S = 30. Each player ha the ame two avalable acton n every tate: 16

Servce Provder α 1 ) Router α 2 ) Scenaro 1 0.1 0.1 Scenaro 2 0.2 0.2 Scenaro 3 0.1 0.2 Scenaro 4 0.2 0.1 Table 2: Rk Tolerance Level α router: frt acton denoted λ) to admt one packet nto the ytem every 10; econd acton denoted λ) to admt one packet every 25. ervce provder: frt acton denoted µ) to erve one packet every 11; econd acton denoted µ) to erve a packet every 20. Holdng cot are exponental h) = a b α for 1 wth a = 1.2 and b = e, and α = 0.2 and h 0) = 0. We et cot: θµ, λ) = θµ, λ) = 110, θµ, λ) = θµ, λ) = 90, βµ, λ) = 60, βµ, λ) = 30, βµ, λ) = 20, and βµ, λ) = 70. In th ettng, the router pay the ervce provder more when the ervce rate hgher. Alo, the router receve hgher reward when both player chooe hgher rate or lower rate. The router receve lower reward when the admon and ervce rate do not match. We conduct three experment, where all rk-aware player ue CVaR. The CVaR for player CVaR α X) := mn η R η + 1 E [max X η, 0}] 1 α where α the rk tolerance for player. When mplementng RaNahQL, we ue the Lemke-Howon method to compute the Nah equlbra of tage game, and we ue the frt-nah learnng agent to update the Q-value baed on the frt Nah equlbrum generated. We run our experment n Matlab R2015a on a computer wth an Intel Core 7 2.30GHz proceor, 8GM RAM, runnng the 64-bt Wndow 8 operatng ytem. Experment I RaNahQL v. Nah Q-learnng): We compare RaNahQL for rk-aware Markov game wth Nah Q-learnng n [29] for rk-neutral Markov game, n term of ther convergence rate. Gven any precon ɛ > 0, we record the teraton count n untl the convergence crteron Q n, T Q 2 ɛ atfed where Q n, T replaced by Q n for Nah Q-learnng). Here we chooe T = 10 and we chooe N = 1 10 5 for RaNahQL and N = 1 10 6 for Nah Q-learnng, uch that both method have the ame total number of teraton. When ɛ extremely mall e.g., ɛ = 0.001, the total number of teraton for RaNahQL and Nah Q-learnng for the two player are repectvely: 983443 Nah Q-learnng, Servce provder), 936761 Nah Q-learnng, Router), 999991 RaNahQL, Servce provder and Router), whch are relatvely equal. Moreover, Fgure 1 how that the total number of teraton for Nah Q-learnng decreae dramatcally a the ncreae of precon ɛ, whch reveal that RaNahQL more computatonally expenve than Nah Q-learnng n term of achevng the ame convergence crteron. Fgure 2 preent the Markov perfect equlbrum for the rk neutral and rk-aware cae. It how the equlbrum hftng when conderng the rk-awarene of player. It alo how that the both rk-neutral and rk-aware Markov perfect equlbrum are entve to the perturbaton n the ervce rate, and rkaware tratege for both player hghly fluctuate wth the change of tate number of packet n the queung ytem). We alo tudy how the rk tolerance level α See Table 2) affect the rk-aware Markov perfect equlbrum, whch alo how the rk-aware Markov perfect equlbrum fluctuate wth the change of the rk tolerance level of CVaR. Next, we evaluate the dcounted cot under rk-neutral and rk-aware Markov perfect equlbrum n mulaton 1000 ample). The rk tolerance level are elected a α 1 = α 2 = 0.1, for the rk-aware CVaR) method n Table 3 here. Table 3 how that conderng rk awarene wll gnfcantly ncreae the varance of the dcounted cot, whch contrary to reaon. The poble reaon the hgher fluctuaton of rk-aware tratege wth the change of tate number of packet n the queung ytem) than rk neutral tratege. }, 17

Fgure 1: Comparon between NahQL and RaNahQL Player Method Mean Varance 5%-CVaR 10%-CVaR Servce Provder Router Rk-neutral 22.22 1.4736e 06 22.22 22.22 Rk-aware CVaR) 77.78 407.84 69.34 68.26 Rk-neutral 37.48 7.32 37.94 38.18 Rk-aware CVaR) 83.68 491.20 86.03 87.54 Table 3: Smulaton for Rk-neutral Stratege and Rk-aware Stratege α 1 = α 2 = 0.1) In th experment, ncorporatng rk wll help the ervce provder reduce t mean cot, whle ncreae the mean cot of the router. Suppoe we conder an extreme cae when router almot a rk neutral decon maker α 1 = 0.1 and α 2 = 0.95), hown n Table 4. Table 4 how that the mean cot of ervce provder 44.31) lower than that under the rk neutral Markov perfect equlbrum 22.22), and the mean cot of router 59.64) lower than that under the rk-aware Markov perfect equlbrum 37.48). Th reult how that ncorporatng rk preference or not, can help decon maker to reach new equlbrum that further reduce h mean cot than the cae where both player are ether rk-neutral or rk-aware. Experment II RaNahQL v. Multlnear Sytem): In th experment, we conder a pecal cae where the rk only come from th ettng bacally a rk-aware nterpretaton of [36] where the ambguty over the tranton kernel). In th pecal cae, we can compute rk-aware Markov equlbrum Player Method Mean Varance 5%-CVaR 10%-CVaR Servce Provder Rk-aware CVaR) 44.31 266.06 43.38 42.70 Router Rk-aware CVaR) 59.64 316.71 61.18 62.77 Table 4: Smulaton for Rk-aware Stratege α 1 = 0.1, α 2 = 0.95) 18

Fgure 2: Comparon of Rk Neutral and Rk-aware Markov Perfect Equlbrum 19

Fgure 3: Almot Sure Convergence of RaNahQL ung a multlnear ytem a detaled n Appendx E. We evaluate performance n term of the relatve error ) 2 S Nah Q j n, T )) j I v ), n N. S v ) 2 In th experment, we take the rk meaure a 10%-CVaR. The multlnear ytem olved by an nteror pont algorthm wthn 5 10 7 maxmum functon evaluaton and 1 10 5 maxmum teraton, and t converge to a local optmal oluton n 10471.975 econd. For RaNahQL, we chooe T = 10 and N = 2 10 6, and the total mplementaton tme for RaNahQL 10245.314 econd. The followng Fgure 3 valdate the almot ure convergence of RaNahQL to the ervce provder trategy. For the router, the relatve error large around 190%). One poble reaon that RaNahQL converge to dfferent equlbra a the one olved by multlnear ytem. We ee that RaNahQL poee far uperor computatonal performance than nteror pont algorthm olvng a mult-lnear ytem, nce the relatve error of ervce provder wthn 25% n 1 10 6 teraton, and the mplementaton tme wll be 5122.657 econd. Experment III Tme Complexty): In term of tme complexty, n [31, Theorem 4.7] t hown that the ngle-agent veron of RaNahQL ha convergence rate Ω S A lns A/δɛ)/ɛ 2 ) 1/β + ln S A/ɛ)) 1/1 β) ), 6.1) wth probablty 1 δ. In the mult-agent cae, we may replace A wth A n the term 6.1) to get a rough etmate of the tme complexty of RaNahQL although we cannot clam theoretcal convergence rate). Intead, we valdate th conjecture numercally. In th experment, we chooe the precon level ɛ [0.01, 0.1]. Fgure 4 llutrate that the optmalty gap of ervce provder and router under the number of teraton that computed through 6.1), are bounded by ɛ, whch confrm that the order 6.1) an acceptable etmaton of the convergence rate of RaNahQL. 7 Concluon In th paper, we propoe a model for non-cooperatve Markov game wth tme-content rk-aware player. Th model characterze the rk from both the tochatc tate tranton and the randomzed tratege of the other player. We frt propoe an approprate concept for rk-aware Markov perfect equlbrum and then we demontrate the extence of uch rk-aware equlbra n tatonary tratege. The extence of equlbra derved by Kakutan fxed pont theorem. We then analyze the convergence of a mulatonbaed Q-learnng type algorthm for equlbrum computaton, where the rk meaure have a pecal form to 20