Distributed Learning based on Entropy-Driven Game Dynamics

Similar documents
LEARNING IN CONCAVE GAMES

Fair and Efficient User-Network Association Algorithm for Multi-Technology Wireless Networks

Entropy-driven dynamics and robust learning procedures in games

Near-Potential Games: Geometry and Dynamics

Evolutionary Game Theory: Overview and Recent Results

Inertial Game Dynamics

Survival of Dominated Strategies under Evolutionary Dynamics. Josef Hofbauer University of Vienna. William H. Sandholm University of Wisconsin

Pairwise Comparison Dynamics for Games with Continuous Strategy Space

DETERMINISTIC AND STOCHASTIC SELECTION DYNAMICS

Near-Potential Games: Geometry and Dynamics

A Modified Q-Learning Algorithm for Potential Games

Game Theory and its Applications to Networks - Part I: Strict Competition

Spatial Economics and Potential Games

Game theory Lecture 19. Dynamic games. Game theory

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

4. Opponent Forecasting in Repeated Games

6.254 : Game Theory with Engineering Applications Lecture 8: Supermodular and Potential Games

General Revision Protocols in Best Response Algorithms for Potential Games

Evolution & Learning in Games

Ergodicity and Non-Ergodicity in Economics

Distributional stability and equilibrium selection Evolutionary dynamics under payoff heterogeneity (II) Dai Zusai

Population Games and Evolutionary Dynamics

MS&E 246: Lecture 17 Network routing. Ramesh Johari

Learning in Mean Field Games

Long Run Equilibrium in Discounted Stochastic Fictitious Play

On Equilibria of Distributed Message-Passing Games

Potential Games. Krzysztof R. Apt. CWI, Amsterdam, the Netherlands, University of Amsterdam. Potential Games p. 1/3

Multi-Agent Learning with Policy Prediction

Fictitious Self-Play in Extensive-Form Games

Gains in evolutionary dynamics. Dai ZUSAI

Quitting games - An Example

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

On Robustness Properties in Empirical Centroid Fictitious Play

CYCLES IN ADVERSARIAL REGULARIZED LEARNING

Computational Evolutionary Game Theory and why I m never using PowerPoint for another presentation involving maths ever again

Stability of Feedback Solutions for Infinite Horizon Noncooperative Differential Games

Game Theory and Control

Unified Convergence Proofs of Continuous-time Fictitious Play

Almost Sure Convergence of Two Time-Scale Stochastic Approximation Algorithms

Bayesian Learning in Social Networks

Retrospective Spectrum Access Protocol: A Completely Uncoupled Learning Algorithm for Cognitive Networks

Mean-field equilibrium: An approximation approach for large dynamic games

Noncooperative Games, Couplings Constraints, and Partial Effi ciency

Cyber-Awareness and Games of Incomplete Information

Utility Maximization for Communication Networks with Multi-path Routing

Reinforcement Learning and Control

Players as Serial or Parallel Random Access Machines. Timothy Van Zandt. INSEAD (France)

Asymptotically Truthful Equilibrium Selection in Large Congestion Games

EVOLUTIONARY GAMES WITH GROUP SELECTION

CS364A: Algorithmic Game Theory Lecture #16: Best-Response Dynamics

Multi-Robotic Systems

Mirror Descent Learning in Continuous Games

Belief-based Learning

For general queries, contact

Topics on strategic learning

MS&E 246: Lecture 12 Static games of incomplete information. Ramesh Johari

Stability and Long Run Equilibrium in Stochastic Fictitious Play

What kind of bonus point system makes the rugby teams more offensive?

ECOLE POLYTECHNIQUE HIGHER ORDER LEARNING AND EVOLUTION IN GAMES. Cahier n Rida LARAKI Panayotis MERTIKOPOULOS

Economics 209B Behavioral / Experimental Game Theory (Spring 2008) Lecture 3: Equilibrium refinements and selection

Appendix B for The Evolution of Strategic Sophistication (Intended for Online Publication)

Evolutionary Internalization of Congestion Externalities When Values of Time are Unknown

On Selfish Behavior in CSMA/CA Networks

The No-Regret Framework for Online Learning

A Globally Stable Adaptive Congestion Control Scheme for Internet-Style Networks with Delay 1

Population Games and Evolutionary Dynamics

Calibration and Nash Equilibrium

Utility Maximization for Communication Networks with Multi-path Routing

First Prev Next Last Go Back Full Screen Close Quit. Game Theory. Giorgio Fagiolo

An introduction to Mathematical Theory of Control

Notes on Coursera s Game Theory

Lecture notes for Analysis of Algorithms : Markov decision processes

A Reinforcement Learning (Nash-R) Algorithm for Average Reward Irreducible Stochastic Games

Cyclic Equilibria in Markov Games

Games on Manifolds. Noah Stein

Algorithmic Strategy Complexity

IMITATION DYNAMICS WITH PAYOFF SHOCKS PANAYOTIS MERTIKOPOULOS AND YANNICK VIOSSAT

Equilibria in Games with Weak Payoff Externalities

Game Theory, Evolutionary Dynamics, and Multi-Agent Learning. Prof. Nicola Gatti

MS&E 246: Lecture 4 Mixed strategies. Ramesh Johari January 18, 2007

Numerical methods in molecular dynamics and multiscale problems

Selecting Efficient Correlated Equilibria Through Distributed Learning. Jason R. Marden

Game Theory with Information: Introducing the Witsenhausen Intrinsic Model

Decoupling Coupled Constraints Through Utility Design

Definition Existence CG vs Potential Games. Congestion Games. Algorithmic Game Theory

On the Stability of the Best Reply Map for Noncooperative Differential Games

Confronting Theory with Experimental Data and vice versa. Lecture VII Social learning. The Norwegian School of Economics Nov 7-11, 2011

Game interactions and dynamics on networked populations

Game-Theoretic Learning:

Dynamic Games with Asymmetric Information: Common Information Based Perfect Bayesian Equilibria and Sequential Decomposition

CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash

Walras-Bowley Lecture 2003

Introduction to Reinforcement Learning

Self-stabilizing uncoupled dynamics

Linear Quadratic Zero-Sum Two-Person Differential Games Pierre Bernhard June 15, 2013

Stochastic stability in spatial three-player games

1. Introduction. 2. A Simple Model

Incremental Policy Learning: An Equilibrium Selection Algorithm for Reinforcement Learning Agents with Common Interests

Correlated Equilibria: Rationality and Dynamics

Learning Near-Pareto-Optimal Conventions in Polynomial Time

Transcription:

Distributed Learning based on Entropy-Driven Game Dynamics Bruno Gaujal joint work with Pierre Coucheney and Panayotis Mertikopoulos Inria Aug., 2014

Model Shared resource systems (network, processors) can often be modeled by a finite game where: the structure of the game is unknown to each player (e.g. the set of players), players only observe the payoff of their chosen action, Observations (of the payoffs) may be corrupted by noise. 2 / 30

Learning Algorithm In the following we consider a finite potential game. Our goal is to design a distributed algorithm (executed by each user) that learns the Nash equilibria. Each user computes its strategy according to a sequential process that should satisfies the following desired properties: only uses in-game information (stateless), does not require time-coordination between the users (asynchronous), tolerates outdated measurements on the observations of the payoffs (delay-oblivious), tolerates random perturbations on the observations (robust), converges fast even if the number of users is very large (scalable). 3 / 30

Notations k is for players, α, β A are for actions, x X is a mixed strategy profile, u kα (x) is the (expected) payoff of k when playing α and facing x. 4 / 30

Notations k is for players, α, β A are for actions, x X is a mixed strategy profile, u kα (x) is the (expected) payoff of k when playing α and facing x. Potential Games (Monderer and Shapley, 1996) There exists a multilinear function U : X R such that u kα (x) u kβ (x) = U(α; x k ) U(β; x k ) 4 / 30

Notations k is for players, α, β A are for actions, x X is a mixed strategy profile, u kα (x) is the (expected) payoff of k when playing α and facing x. Potential Games (Monderer and Shapley, 1996) There exists a multilinear function U : X R such that u kα (x) u kβ (x) = U(α; x k ) U(β; x k ) Congestion games are potential games. Mechanism design can turn a game (e.g. general routing game with additive costs) into a potential game. 4 / 30

Basic Convergence Results in Potential Games When facing a vector of payoffs u R A coming from a potential game, if players successively play best response (asymptotic) convergence to Nash equilibrium, 5 / 30

Basic Convergence Results in Potential Games When facing a vector of payoffs u R A coming from a potential game, if players successively play best response (asymptotic) convergence to Nash equilibrium, play according to the Gibbs map G : R A X G α (u) = eγuα β eγu β then convergence with high probability (when γ ) to a NE that globally maximizes the potential. 5 / 30

Basic Convergence Results in Potential Games II But in both cases, each players needs to know its payoff vector. Stateless: no Convergence may fail in the presence of noise. Robust: no Convergence may fail if players do not play one at a time. asynchronous: no Convergence may also fail with delayed observations. Delay-oblivious: no Convergence is slow when the number of players is large. Scalable: no 6 / 30

Outline of the Talk Simple Learning scheme based on payoff vector limit Entropy-driven game dynamics stochastic approximation 2 Learning schemes only based on current payoff 7 / 30

Outline of the Talk Simple Learning scheme based on payoff vector limit weight of past information in aggregation process Entropy-driven game dynamics stochastic approximation rationality level of players 2 Learning schemes only based on current payoff 7 / 30

A General Learning Process Let us consider a single agent repeatedly observing a stream of payoffs u(t). 8 / 30

A General Learning Process Let us consider a single agent repeatedly observing a stream of payoffs u(t). Two stages: 1. Score phase. 2. Choice phase. 8 / 30

A General Learning Process Let us consider a single agent repeatedly observing a stream of payoffs u(t). Two stages: 1. Score phase. y α (t) is the score of α A, i.e. the agent assessment of α up to time t. The score phase describes how y α (t) is updated using the payoffs u α (s), s t. 2. Choice phase. 8 / 30

A General Learning Process Let us consider a single agent repeatedly observing a stream of payoffs u(t). Two stages: 1. Score phase. y α (t) is the score of α A, i.e. the agent assessment of α up to time t. The score phase describes how y α (t) is updated using the payoffs u α (s), s t. 2. Choice phase. Specifies the choice map Q : R A X which prescribes the agent s mixed strategy x X given his assessment of actions (the vector y). 8 / 30

Score Stage The score is a discounted aggregation of the payoffs: y α (t) = t 0 e T (t s) u α (x(s))ds, 9 / 30

Score Stage The score is a discounted aggregation of the payoffs: y α (t) = t 0 e T (t s) u α (x(s))ds, In differential form: ẏ α (t) = u α (t) Ty α (t) 9 / 30

Score Stage The score is a discounted aggregation of the payoffs: y α (t) = t 0 e T (t s) u α (x(s))ds, In differential form: ẏ α (t) = u α (t) Ty α (t) and T is the discount factor of the learning scheme. T > 0: exponentially more weight to recent observations. T = 0: no discounting: the score is the aggregated payoff T < 0: reinforcing past observations. 9 / 30

Choice Stage: Smoothed Best-Response At each decision time t, the agent chooses a mixed action x X according to the choice map Q : R A X x(t) = Q(y(t)). 10 / 30

Choice Stage: Smoothed Best-Response At each decision time t, the agent chooses a mixed action x X according to the choice map Q : R A X x(t) = Q(y(t)). We assume Q(y) to be a smoothed best-response, i.e. the unique solution of maximize β x βy β h(x), subject to x 0, β x β = 1, where the entropy function h is smooth, strictly convex on X such that dh(x) whenever x bd(x ). 10 / 30

Entropy-Driven Learning Dynamics If the agent s payoffs are coming from a finite game, the continuous-time learning dynamics is Score Dynamics where ẏ kα = u kα (x) Ty kα, x k = Q k (y k ), Q k is the choice map of the driving entropy of player k, h k : X k R, T is the discounting parameter. 11 / 30

Entropy-Driven Learning Dynamics If the agent s payoffs are coming from a finite game, the continuous-time learning dynamics is Score Dynamics where ẏ kα = u kα (x) Ty kα, x k = Q k (y k ), Q k is the choice map of the driving entropy of player k, h k : X k R, T is the discounting parameter. This is the score-based formulation of the dynamics. 11 / 30

Strategy-based Formulation of the Dynamics Let us assume that h is decomposable: h(x) = α θ(x α ). By setting [ ] 1 Θ (x) = β 1/θ (x β ). Then, a little algebra yields: Strategy Dynamics ẋ α = [ 1 θ u α (x) Θ (x) (x α ) β T θ (x α ) [ θ (x α ) Θ k (x) β ] u β (x) θ (x β ) θ (x β ) θ (x β ) ], 12 / 30

Boltzmann-Gibbs Entropy Let h(x) = α x α log(x α ). Then the dynamics become: ẏ α = u α (x) Ty α, x = Gibbs(y), and ẋ α = x α [u α (x) ] x βu β (x) β T x α [ log x α β x β log x β ] Replicator Dynamics Entropy-adjustment term 13 / 30

Rest Points of the Dynamics Proposition 1. For T = 0, the rest points of the entropy-driven dynamics are the restricted Nash equilibria of the game. 14 / 30

Rest Points of the Dynamics Proposition 1. For T = 0, the rest points of the entropy-driven dynamics are the restricted Nash equilibria of the game. 2. For T > 0, the rest points coincide with the restricted QRE (solutions of x = Q(u/T ), that converge to restricted NE when T 0) 14 / 30

Rest Points of the Dynamics Proposition 1. For T = 0, the rest points of the entropy-driven dynamics are the restricted Nash equilibria of the game. 2. For T > 0, the rest points coincide with the restricted QRE (solutions of x = Q(u/T ), that converge to restricted NE when T 0) 3. For T < 0, the rest points are the restricted QRE of the opposite game (opposite payoffs). 14 / 30

Rest Points of the Dynamics Proposition 1. For T = 0, the rest points of the entropy-driven dynamics are the restricted Nash equilibria of the game. 2. For T > 0, the rest points coincide with the restricted QRE (solutions of x = Q(u/T ), that converge to restricted NE when T 0) 3. For T < 0, the rest points are the restricted QRE of the opposite game (opposite payoffs). The parameter T plays a double role: it reflects the importance given by players to past events (the discount factor), it determines the rationality level of rest points. 14 / 30

Convergence for Potential Games Lemma Let U : X R be the potential of a finite potential game. Then, F (x) U(x) Th(x) is strictly Lyapunov for the entropy-driven dynamics. 15 / 30

Convergence for Potential Games Lemma Let U : X R be the potential of a finite potential game. Then, F (x) U(x) Th(x) is strictly Lyapunov for the entropy-driven dynamics. - the dynamics tend to move towards the maximizers of F for every parameter T, - the parameter modifies the set of maximizers. 15 / 30

Convergence for Potential Games Lemma Let U : X R be the potential of a finite potential game. Then, F (x) U(x) Th(x) is strictly Lyapunov for the entropy-driven dynamics. - the dynamics tend to move towards the maximizers of F for every parameter T, - the parameter modifies the set of maximizers. Proposition Let x(t) be a trajectory of the entropy-driven dynamics for a potential game. Then: 1. For T > 0, x(t) converges to a QRE. 2. For T = 0, x(t) converges to a restricted Nash equilibrium. 3. For T < 0, x(t) converges to a vertex or is stationary. 15 / 30

y y y y Penalty Adjusted Replicator Dyanmics T 2 1.0 3, 1 0, 0 Penalty Adjusted Replicator Dyanmics T 0.5 1.0 3, 1 0, 0 0.8 0.8 0.6 0.6 0.4 0.4 Phase portrait of the parameter-adjusted replicator dynamics in a 2 2 potential game. 0.2 0, 0 1, 3 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Temperature Adjusted Replicator Dynamics T.2 1.0 3, 1 0, 0 0.2 0, 0 1, 3 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Temperature Adjusted Replicator Dynamics T 4 1.0 3, 1 0, 0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0, 0 1, 3 0.0 0.0 0.2 0.4 0.6 0.8 1.0 x 0, 0 1, 3 0.0 0.0 0.2 0.4 0.6 0.8 1.0 x 16 / 30

From Continuous-Time Dynamics to Distributed Algorithm Aim: discretization of the entropy-driven dynamics with the same convergence properties than the continuous dynamics. 17 / 30

From Continuous-Time Dynamics to Distributed Algorithm Aim: discretization of the entropy-driven dynamics with the same convergence properties than the continuous dynamics. Main difficulty: a distributed algorithm should only use random samples of the payoff u(x), i.e. u(a) where A is a random action profile with distribution x. We derive two stochastic approximations of the dynamics respectively with scores and strategies updates. 17 / 30

The Stochastic Approximation Tool We will consider the random process Z(n + 1) = Z(n) + γ n+1 (f (Z(n)) + V (n + 1)) as a stochastic approximation of the ODE where ż = f (z) (γ n ) is the sequence of step-sizes assumed to be (l 2 l 1 ) summable series (typically, γ n = 1/n), E[V (n + 1) F n ] = 0. 18 / 30

The Stochastic Approximation Tool Theorem (Benaim, 99) Assume that the ODE admits a strict Lyapunov function, weak condition on the Lyapunov function, sup E[ V (n) 2 ] <, n sup Z(n) < a.s., n Then the set of accumulation points of the sequence (Z(n)) generated by the stochastic approximation of the ODE ż = f (z) almost surely belongs to a connected invariant set of the ODE. 19 / 30

Score-Based learning with imperfect payoff monitoring Recall the score based version of the dynamics: ẏ kα = u kα (x) Ty kα, x k = Q k (y k ), A stochastic approximation yields the following algorithm: Score-based Algorithm 1. At stage n + 1, each player selects an action α k (n + 1) A k based on a mixed strategy X k (n) X k. 2. Every player gets bounded and unbiased estimates û kα (n + 1) s.t. 2.1 E [û kα (n + 1) F n ] = u kα (X (n)), 2.2 û kα (n + 1) C (a.s.), 3. For each action α, the score is updated: Y kα (n + 1) Y kα (n) + γ n (û kα (n + 1) TY kα (n)); 4. Each player updates its mixed strategy X k (n + 1) Q k (Y k (n + 1)); and the process repeats ad infinitum. 20 / 30

Score-Based Algorithm Assessment Theorem The Score-Based Algorithm converges (a.s.) to a connected set of QRE of G with rationality parameter 1/T. Each players needs to know its payoff vector. Stateless: no Convergence to QRE in the presence of noise. Robust: yes Players play simultaneously. Asynchronous: no Convergence may also fail with delayed observations. Delay-oblivious: no Convergence is fast when the number of players is large. Scalable: yes (based on numerical evidence). 21 / 30

Score-Based learning using in-game payoff Assume now that the only information at the players disposal is the payoff of their chosen actions (possibly perturbed by some random noise process). û k (n + 1) = u k (α 1 (n + 1),..., α N (n + 1)) + ξ k (n + 1), where the noise ξ k is a bounded, F-adapted martingale difference (i.e. E [ξ k (n + 1) F n ] = 0 and ξ k C for some C > 0) with ξ k (n + 1) independent of α k (n + 1). We can use the the unbiased estimator 1(α k (n + 1) = α) û kα (n+1) = û k (n+1) E (α k (n + 1) = α F n ) = 1(α + 1) k(n+1) = α) ûk(n X kα (n) 22 / 30

Score-Based learning using in-game payoff (II) This allows us to replace the inner action-sweeping loop of the Scored-Based Algorithm with the update score: Y kαk Y kαk + γ n (û k TY kαk )/X kαk This algorithm works well in practice (it converges to a QRE in potential games whenever T > 0). But both conditions 1. sup E[ V (n) 2 ] <, n 2. sup Z(n) < a.s., n are hard (impossible?) to prove (in fact 2 1). 23 / 30

Strategy-Based Algorithm The strategy-based version of the dynamics is [ u α (x) Θ (x k ) β ẋ α = 1 θ (x α ) u β (x) θ (x β ) Tg α,θ(x k ) A stochastic approximation of the strategy based dynamics is [ ( ) ] + 1 uk (A) X kα = γn θ 1(A k = α) Θ (X k ) (X kα ) θ Tg α,θ (X k ) (X kak ) X kak where A is the randomly chosen action profile with distribution X. ], 24 / 30

Algorithm assessment Theorem When T > 0, the sequence (X (n)) generated by the strategy-based algorithm remains positive and converges almost surely to a connected set of QRE of G with rationality level 1/T. Each players only observes in-game payoff. Stateless: yes Convergence to QRE in the presence of noise. Robust: yes Players play simultaneously Asynchronous: no no delay in observations Delay-oblivious: no Convergence is fast when the number of players is large. Scalable: yes (based on numerical evidence). 25 / 30

Asynchronous version of the Algorithm Using extensions of the stochatic approximation theorem (Borkar, 2008), converge to QRE is still true for the asynchronous and delayed version of the algorithm: let d j,k (n) de the (integer-valued) lag between player j and k when k plays at stage n. The observed payoff û k (n + 1) of k at stage n + 1 depends on his opponents past actions: E [û k (n + 1) F n ] = u k (X 1 (n d 1,k (n)),..., X k (n),..., X N (n d N,k (n))). (1) 26 / 30

Asynchronous version of the Algorithm Algorithm 1: Strategy-based learning for one player (with asynchronous and imperfect in-game observations) n 0; Initialize X k rel int(x k ) as a mixed strategy with full support; repeat until Next Play occurs at time τ k (n + 1) T ; # Local time clock n n + 1; select new action α k according to mixed strategy X k ; # current action observe û k ; # receive measured payoff foreach action α A k do X kα += γ n [ ( 1αk =α X kα ) ûk TX kα ( log X kα k β X kβ log X kβ )] until termination criterion is reached; # update strategy 27 / 30

Algorithm assessment Each players only observes in-game payoff. Stateless: yes Convergence to QRE in the presence of noise. Robust: yes Players play according to local timers. Asynchronous: yes Convergence with delayed observations. Delay-oblivious: yes Convergence is fast when the number of players is large. Scalable: yes (based on numerical evidence). 28 / 30

1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Numerical experiments 3, 1 0, 0 3, 1 0, 0 0, 0 1, 3 0, 0 1, 3 (a) Initial db (n = 0). (b) Db. with n = 2. 3, 1 0, 0 3, 1 0, 0 0, 0 1, 3 0, 0 1, 3 (c) Db with n = 5. (d) Db. with n = 10. 29 / 30

Merci! 30 / 30