Optimal Convergence in Multi-Agent MDPs

Optimal Convergence in Multi-Agent MDPs Peter Vrancx 1, Katja Verbeeck 2, and Ann Nowé 1 1 {pvrancx, ann.nowe}@vub.ac.be, Computational Modeling Lab, Vrije Universiteit Brussel 2 k.verbeeck@micc.unimaas.nl, MICC-IKAT Maastricht University Abstract. Learning Automata (LA) were recently shown to be valuable tools for designing Multi-Agent Reinforcement Learning algorithms. One of the principal contributions of LA theory is that a set of decentralized, independent learning automata is able to control a finite Markov Chain with unknown transition probabilities and rewards. We extend this result to the framework of Multi-Agent MDP s, a straightforward extension of single-agent MDP s to distributed cooperative multi-agent decision problems. Furthermore, we combine this result with the application of parametrized learning automata yielding global optimal convergence results. 1 Introduction Reinforcement Learning was originally developed for Markov Decision Problems (MDP s) [8]. However the MDP model does not allow multiple agents to act in the same environment. A straightforward extension of the MDP model to the multi-agent case is given by the framework of Markov Games [5]. In a Markov Game, actions are the result of the joint action selection of all agents and rewards and state transitions depend on these joint actions. In this paper we assume that all the agents share the same reward function, i.e. we consider the so-called team games or multi-agent MDP s (MMDP s). In this case, the game is purely cooperative and all agents should learn how to find and agree on the same optimal policy. In addition, the agents face the problem of incomplete information with respect to the action choice. One can assume that the agents get information about their own choice of action as well as that of the others. This is the case in what is called joint action learning, a popular way to address multi-agent learning [5, 3, 2, 4]. In contrast, independent agents only know their own action. The latter is often a more realistic assumption since distributed multi-agent applications are typically subject to limitations such as partial or non observability, communication costs, asynchronism and stochasticity. Since learning automata work strictly on the basis of the response funded by a Ph.D grant of the Institute for the Promotion of Innovation through Science and Technology in Flanders (IWT Vlaanderen). funded by the Interactive Collaborative Information Systems (ICIS) project, supported by the Dutch Ministry of Economic Affairs, nr: BSIK03024 (The Netherlands)

of the environment, and not on the basis of any knowledge regarding other automata, i. e. nor their strategies, nor their feedback, they are very well suited as independent multi-agent learners. In this paper we will focus on how learning automata (LA) can tackle the problem of learning MMDP s. One of the principal contributions of LA theory is that a set of decentralized learning automata is able to control a finite Markov Chain with unknown transition probabilities and rewards. This result can be extended to MMDP s, using a simple extension of the original LA network provided in [6]. A simple learning automaton is put for every agent in each state. The problem can then be approached from two perspectives. First as a multi-agent game, in which each agent is represented by all the automata it was associated with in every state. Alternatively, the problem can be viewed as a LA game, i.e. the view in which each automaton itself represents an agent. Both views can be shown to share the same pure equilibrium points, and as such existing LA theorems on LA-games and parametrized LA will guarantee the convergence to global optimal points of the MMDP. This paper is organised as follows. First, we start with the definition of MDP s, MMDP s and basic LA theory. Next, the LA-models for learning respectively MDP s and MMDP s are given. Theoretical convergence guarantees are discussed and a simple example is added. 2 Definitions 2.1 Definition of an MDP The problem of controlling a finite Markov Chain, called a Markov Decision Problem (MDP) for which transition probabilities and rewards are unknown can be stated as follows. Let S = {s 1,...,s N } be the state space of a finite Markov chain {x l } l 0 and A i = {a i 1,..., a i r i } the action set available in state s i. Each starting state s i, action choice a i A i and ending state s j has an associated transition probability T ij (a i ) and reward R ij (a i ). The overall goal is to learn a policy α, or a set of actions, α = (a 1,..., a N ) with a j A j so that the expected average reward J(α) is maximized: [ l 1 ] 1 J(α) lim l l E R x(t)x(t+1) (α) The policies we consider, are limited to stationary, nonrandomized policies. Under the assumption that the Markov chain corresponding to each policy α is ergodic, it can be shown that the best strategy in any state is a pure strategy, independent of the time at which the state is occupied [11]. A Markov chain {x n } n 0 is said to be ergodic when the distribution of the chain converges to a limiting distribution π(α) = (π 1 (α),..., π N (α)) with i, π i (α) > 0 as n. Thus, there are no transient states and the limiting distribution π(α) can be used to rewrite Equation 1 as: t=0 (1)

N N J(α) = π i (α) T ij (α)r ij (α) (2) i=1 j=1 2.2 Definition of MMDPs An extension of single agent Markov decision problems (MDP s) to the cooperative multi-agent case can be defined by Multi-agent MDPs (MMDPs) [1]. In an MMDP, actions are the joint result of multiple agents choosing an action seperately. Note that A i k = {ai k1,...,ai ki r } is now the action set available in state s i for agent k, with k : 1...n, n being the total number of agents present in the system. Transition probabilities T ij (a i ) and rewards R ij (a) now depend on a starting state s i, ending state s j and a joint action from state s i, i.e. a i = (a i 1,... a i n) with a i k Ai k. Since the agents individual action choices may be jointly suboptimal, the added problem in MMDP s is for the agents to learn to coordinate their actions so that joint optimality is achieved. The value of a joint policy α = (a 1,..., a N ) with a i a joint action of state s i in A i 1... A i n, can still be defined by Equation 1. Under the same assumption considered above, i.e. the markov chain corresponding to each joint policy α is ergodic, it is sufficient to only consider joint policies in which the agents choose pure strategies. Moreover, under this assumption the expected average reward of a joint policy α can also be expressed by Equation 2. 3 Learning Automata A learning automaton describes the internal state of an agent as a probability distribution according to which actions should be chosen [9]. These probabilities are adjusted with some reinforcement scheme according to the success or failure of the actions taken. The LA is defined by a quadruple {A, β, p, T } for which A is the action or output set {a 1, a 2,...a r } of the automaton, β is a random variable in the interval [0, 1] and p is a vector of the automatons, or agents action probabilities and T denotes the update scheme. An important update scheme is the linear reward-penalty scheme. The philosophy is essentially to increase the probability of an action when it results in a success and to decrease it when the response is a failure. The general algorithm is given by: p m (t + 1) = p m (t) + λ 1 (1 β(t))(1 p m (t)) λ 2 β(t)p m (t) (3) if a m is the action taken at time t p j (t + 1) = p j (t) λ 1 (1 β(t))p j (t) + λ 2 β(t)[(r 1) 1 p j (t)] (4) if a j a m The constants λ 1 en λ 2 are the reward and penalty parameters respectively. When λ 1 = λ 2 the algorithm is referred to as linear reward-penalty (L R P ),

when λ 2 = 0 it is referred to as linear reward-inaction (L R I ) and when λ 2 is small compared to λ 1 it is called linear reward-ǫ-penalty (L R ǫp ). 3.1 Learning Automata Games A play a(t) of n automata is a set of strategies chosen by the automata at stage t, such that a j (t) is an element of the action set of the jth automaton. Correspondingly the outcome is now also a vector β(t) = (β 1 (t)...β n (t)). At every time-step all automata update their probability distributions based on the responses of the environment. Each automaton participating in the game operates without information concerning the number of other participants, their strategies, actions or payoffs. The following result was proved: Theorem 1. [7] When the automata game is repeatedly played with each player making use of the L R I scheme with a sufficiently small step size, then local convergence is established towards pure Nash equilibria. 3.2 Parameterized Learning Automata Parameterized Learning Automata (PLA) keep an internal state vector u, of real numbers which is not necessarily a probability vector. The probabilities of various actions are then generated, based on this vector u and a probability generating function g : R M A [0, 1]. This allows for a richer update mechanism by using a random perturbation term in the update scheme using ideas similar to Simulated Annealing. It can be shown that thanks to these perturbations these PLA are able to converge to a globally optimal solution in team games and certain feedforward network systems. When the automaton receives a feedback r(t) it updates the parameter vector u in stead of directly modifying the probabilities. In this paper we use following update rule proposed by Thathachar and Phansalkar [9]: with: u i (t + 1) = u i (t) + br(t) δ ln g δ u i (u(t), α(t)) + bh (u i (t)) + bs i (t) (5) K(x L) 2n x L h(x) = 0 x L K(x + L) 2n x L where h (x) is the derivative of h(x), {s i (t) : k 0} is a set of i.i.d. variables with zero mean and variance σ 2, b is the learning parameter, σ and K are positive constants and n is a positive integer. In this update rule, the second term is a gradient following term, the third term is used to keep the solutions bounded with u i L and the final term is a random term that allows the algorithm to escape local optima that are not globally optimal. In [9] the author show that the algorithm converges weakly to the solution of the Langevin equation, which globally maximizes the appropriate function. (6)

4 Learning in finite MDP s The problem of controlling a Markov chain can be formulated as a network of automata in which control passes from one automaton to another. In this set-up every action state 3 in the Markov chain has a LA that tries to learn the optimal action probabilities in that state with learning scheme (3,4). Only one LA is active at each time step and transition to the next state triggers the LA from that state to become active and take some action. LA LA i active in state s i is not informed of the one-step reward R ij (a i ) resulting from choosing action a i A i in s i and leading to state s j. When state s i is visited again, LA i receives two pieces of data: the cumulative reward generated by the process up to the current time step and the current global time. From these, LA i computes the incremental reward generated since this last visit and the corresponding elapsed global time. The environment response or the input to LA i is then taken to be: β i (t i + 1) = ρi (t i + 1) η i (t i + 1) (7) where ρ i (t i + 1) is the cumulative total reward generated for action a i in state s i and η i (t i + 1) the cumulative total time elapsed. The authors in [11] denote updating scheme (3,4) with environment response as in (7) as learning scheme T1. The following result was proved: Theorem 2 (Wheeler and Narendra, 1986). Let for each action state s i of an N state Markov chain, an automaton LA i using learning scheme T1 and having r i actions be associated with. Assume that the Markov Chain, corresponding to each policy α is ergodic Then the decentralized adaptation of the LA is globally ǫ-optimal with respect to the long-term expected reward per time step, i.e. J(α). 5 Learning in finite MMDPs In an MMDP the action chosen at any state is the result of individual action components performed by the agents present in the system. Instead of putting a single learning automaton in each state of the system, we propose to put an automaton LA i k in each state s i with i : 1...N and for each agent k, k : 1...n. At each time step only the automata of one state are active; a joint action triggers the LA from that state to become active and take some joint action. As before, LA LA i k active for agent k in state s i is not informed of the one-step reward R ij (a i ) resulting from choosing joint action a i = (a i 1,..., ai n ) with a i k Ai k in s i and leading to state s j. When state s i is visited again, all automata LA i k receive two pieces of data: the cumulative reward generated by the process up to the current time step and the current global time. From these, all LA i k compute the incremental reward generated since this last visit and the 3 A state is called an action state, when more than one action is present.

corresponding elapsed global time. The environment response or the input to LA i k is exactly the same as in Equation 7. As an example, consider the MMDP in Figure 1 with 2 agents and 4 states, with only s 0 and s 1 having more then one action. In both states 4 joint actions are present:(0, 0), (0, 1), (1, 0) and (1, 1). All transitions, except those leaving states s 2 and s 3 are deterministic, while the tranisitions leaving state s 2 or s 3 have uniform probability of going to one the other states or itself. Rewards are only given for the transitions (s 1, s 2 ) and (s 1, s 3 ). S1 (0,0) R: 1.0, (1,1) R: 0.7 S2 (0,0), (1,1) (1,0),(0,1) R: 0.5 S0 (1,0),(0,1) S3 Fig.1. An example MMDP with 2 action states s 0 and s 1; each with 2 actions: 0 and 1. Joint actions and nonzero rewards (R) are shown. Transitions are deterministic; except in the non-action states s 2 and s 3 where the process goes to any other state with equal probability (1/4) In the multi-agent view, the underlying game played depends on the agents individual policies. This game is a 2-player identical payoff game with 4 actions. We have a 2-player game, because we have 2 agents, and we have 4 actions because each agent has 4 possible policies it can take, i.e (0, 0), (0, 1), (1, 0) and (1, 1). Note that here (a 1, a 2 ) denotes the policy instead of a joint action, i.e. the agent takes action a 1 in state s 0 and action a 2 in state s 1. In Figure 2 you find the game matrix for the MMDP of Figure 1. 4 equilibria are present of which 2 are optimal and 2 are sub-optimal. In the LA view we consider the game between all the learning automata that are present in the different states. For the MMDP of Figure 1 this would give a 4 player game with 2 actions for each player: 0 or 1. The complete game is shown in Figure 3. In [10] we already showed that both views share the same pure attractor points. Combining this result with Theorem 1 we can state the following: Theorem 3. The Learning Automata model proposed here is able to find an equilibrium in pure strategies in an ergodic MMDP.

agent2 policies (0, 0) (0, 1) (1, 0) (1, 1) (0, 0) 0.2857 0.1667 0.1429 0.0833 (0, 1) 0.1667 0.2 0.0833 0.1167 (1, 0) 0.1429 0.0833 0.2857 0.1667 (1, 1) 0.0833 0.1167 0.1667 0.2 agent 1 Fig. 2. An identical payoffgame with 4 actions that approximates the multi agent view of the MMDP of Figure 1 When we now use parametrized LA instead of reward-inaction LA, we can even achieve global convergence: Theorem 4. The Learning Automata model proposed here, with all automata being parametrized and using the update scheme given in Equation 5 is able to find an optimal equilibrium in pure strategies in an ergodic MMDP. (LA 0 0, LA 0 1, LA 1 0, LA 1 1) J(α) (LA 0 0, LA 0 1, LA 1 0, LA 1 1) J(α) (0,0, 0,0) 0.2857 (1,0, 0,0) 0.1667 (0,0, 0,1) 0.1429 (1,0, 0,1) 0.0833 (0,0, 1,0) 0.1429 (1,0, 1,0) 0.0833 (0,0, 1,1) 0.2 (1,0, 1,1) 0.1167 (0,1, 0,0) 0.1667 (1,1, 0,0) 0.2857 (0,1, 0,1) 0.0833 (1,1, 0,1) 0.1429 (0,1, 1,0) 0.0833 (1,1, 1,0) 0.1429 (0,1, 1,1) 0.1167 (1,1, 1,1) 0.2 Fig. 3. An identical payoffgame between 4 players each with 2 actions, that approximates the LA view of the MMDP of Figure 1 Figure 4 shows experimental results on the MMDP of Figure 1. We compared the reward-inaction scheme using learning rates 0.01, 0.05 and 0.001 with parameterized LAs. To demonstrate convergence we show a single very long run (10 million time steps) and restart the automata every 2 million steps. Both algorithms were initialized with all LAs having a probability of 0.9 of playing action 1. This gives a large bias towards one the suboptimal equilibrium ((1,1),(1,1)). The L R I automata converged to this equilibrium in every trial, while the PLAs manage to escape and converge to the optimal equilibrium ((1, 0), (1, 0)). References 1. C. Boutilier. Planning, learning and coordination in multiagent decision processes. In Proceedings of the 6th Conference on Theoretical Aspects of Rationality and

0.35 0.3 MMDP PLAs LR-I 0.01 LR-I 0.005 LR-I 0.001 0.25 Avg Reward 0.2 0.15 0.1 0.05 0 0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06 1e+07 Time Fig. 4. Experimental results for reward-inaction and PLAs on the MMDP of Figure 1. The figure shows the average reward over time during the last 1000 steps. Both algorithms were initialized with a high bias towards the suboptimal equilibrium. Rewardinaction was tested with learning rates 0.01,0.005 and 0.001. Settings for the PLAs were: K=1.0,n=1,L=1.5,b=0.04,σ = 0.1 Knowledge, pages 195 210, Renesse, Holland, 1996. 2. G. Chalkiadakis and C. Boutilier. Coordination in multiagent reinforcement learning: A bayesian approach. In Proceedings of the 2nd International Joint Conference on Autonomous Agents and Multiagent Systems, pages 709 716, Melbourne, Australia, 2003. 3. C. Claus and C. Boutilier. The dynamics of reinforcement learning in cooperative multiagent systems. In Proceedings of the 15th National Conference on Artificial Intelligence, pages 746 752, 1998. 4. J. Hu and M. Wellman. Nash q-learning for general-sum stochastic games. Journal of Machine Learning Research, 4:1039 1069, 2003. 5. M. Littman. Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the 11th International Conference on Machine Learning, pages 322 328, 1994. 6. K. Narendra and M. Thathachar. Learning Automata: An Introduction. Prentice- Hall International, Inc, 1989. 7. P. Sastry, V. Phansalkar, and M. Thathachar. Decentralized learning of nash equilibria in multi-person stochastic games with incomplete information. IEEE Transactions on Systems, Man, and Cybernetics, 24(5):769 777, 1994. 8. R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998. 9. M.A.L. Thathachar and P.S. Sastry. Networks of Learning Automata: Techniques for Online Stochastic Optimization. Kluwer Academic Publishers, 2004.

10. P. Vrancx, K. Verbeeck, and A. Nowé. Decentralized learning of markov games. Technical Report COMO/12/2006, Computational Modeling Lab, Vrije Universiteit Brussel, Brussels, Belgium, 2006. 11. R.M. Wheeler and K.S. Narendra. Decentralized learning in finite markov chains. IEEE Transactions on Automatic Control, AC-31:519 526, 1986.