Optimal Convergence in Multi-Agent MDPs
|
|
- Zoe Hodge
- 6 years ago
- Views:
Transcription
1 Optimal Convergence in Multi-Agent MDPs Peter Vrancx 1, Katja Verbeeck 2, and Ann Nowé 1 1 {pvrancx, ann.nowe}@vub.ac.be, Computational Modeling Lab, Vrije Universiteit Brussel 2 k.verbeeck@micc.unimaas.nl, MICC-IKAT Maastricht University Abstract. Learning Automata (LA) were recently shown to be valuable tools for designing Multi-Agent Reinforcement Learning algorithms. One of the principal contributions of LA theory is that a set of decentralized, independent learning automata is able to control a finite Markov Chain with unknown transition probabilities and rewards. We extend this result to the framework of Multi-Agent MDP s, a straightforward extension of single-agent MDP s to distributed cooperative multi-agent decision problems. Furthermore, we combine this result with the application of parametrized learning automata yielding global optimal convergence results. 1 Introduction Reinforcement Learning was originally developed for Markov Decision Problems (MDP s) [8]. However the MDP model does not allow multiple agents to act in the same environment. A straightforward extension of the MDP model to the multi-agent case is given by the framework of Markov Games [5]. In a Markov Game, actions are the result of the joint action selection of all agents and rewards and state transitions depend on these joint actions. In this paper we assume that all the agents share the same reward function, i.e. we consider the so-called team games or multi-agent MDP s (MMDP s). In this case, the game is purely cooperative and all agents should learn how to find and agree on the same optimal policy. In addition, the agents face the problem of incomplete information with respect to the action choice. One can assume that the agents get information about their own choice of action as well as that of the others. This is the case in what is called joint action learning, a popular way to address multi-agent learning [5, 3, 2, 4]. In contrast, independent agents only know their own action. The latter is often a more realistic assumption since distributed multi-agent applications are typically subject to limitations such as partial or non observability, communication costs, asynchronism and stochasticity. Since learning automata work strictly on the basis of the response funded by a Ph.D grant of the Institute for the Promotion of Innovation through Science and Technology in Flanders (IWT Vlaanderen). funded by the Interactive Collaborative Information Systems (ICIS) project, supported by the Dutch Ministry of Economic Affairs, nr: BSIK03024 (The Netherlands)
2 of the environment, and not on the basis of any knowledge regarding other automata, i. e. nor their strategies, nor their feedback, they are very well suited as independent multi-agent learners. In this paper we will focus on how learning automata (LA) can tackle the problem of learning MMDP s. One of the principal contributions of LA theory is that a set of decentralized learning automata is able to control a finite Markov Chain with unknown transition probabilities and rewards. This result can be extended to MMDP s, using a simple extension of the original LA network provided in [6]. A simple learning automaton is put for every agent in each state. The problem can then be approached from two perspectives. First as a multi-agent game, in which each agent is represented by all the automata it was associated with in every state. Alternatively, the problem can be viewed as a LA game, i.e. the view in which each automaton itself represents an agent. Both views can be shown to share the same pure equilibrium points, and as such existing LA theorems on LA-games and parametrized LA will guarantee the convergence to global optimal points of the MMDP. This paper is organised as follows. First, we start with the definition of MDP s, MMDP s and basic LA theory. Next, the LA-models for learning respectively MDP s and MMDP s are given. Theoretical convergence guarantees are discussed and a simple example is added. 2 Definitions 2.1 Definition of an MDP The problem of controlling a finite Markov Chain, called a Markov Decision Problem (MDP) for which transition probabilities and rewards are unknown can be stated as follows. Let S = {s 1,...,s N } be the state space of a finite Markov chain {x l } l 0 and A i = {a i 1,..., a i r i } the action set available in state s i. Each starting state s i, action choice a i A i and ending state s j has an associated transition probability T ij (a i ) and reward R ij (a i ). The overall goal is to learn a policy α, or a set of actions, α = (a 1,..., a N ) with a j A j so that the expected average reward J(α) is maximized: [ l 1 ] 1 J(α) lim l l E R x(t)x(t+1) (α) The policies we consider, are limited to stationary, nonrandomized policies. Under the assumption that the Markov chain corresponding to each policy α is ergodic, it can be shown that the best strategy in any state is a pure strategy, independent of the time at which the state is occupied [11]. A Markov chain {x n } n 0 is said to be ergodic when the distribution of the chain converges to a limiting distribution π(α) = (π 1 (α),..., π N (α)) with i, π i (α) > 0 as n. Thus, there are no transient states and the limiting distribution π(α) can be used to rewrite Equation 1 as: t=0 (1)
3 N N J(α) = π i (α) T ij (α)r ij (α) (2) i=1 j=1 2.2 Definition of MMDPs An extension of single agent Markov decision problems (MDP s) to the cooperative multi-agent case can be defined by Multi-agent MDPs (MMDPs) [1]. In an MMDP, actions are the joint result of multiple agents choosing an action seperately. Note that A i k = {ai k1,...,ai ki r } is now the action set available in state s i for agent k, with k : 1...n, n being the total number of agents present in the system. Transition probabilities T ij (a i ) and rewards R ij (a) now depend on a starting state s i, ending state s j and a joint action from state s i, i.e. a i = (a i 1,... a i n) with a i k Ai k. Since the agents individual action choices may be jointly suboptimal, the added problem in MMDP s is for the agents to learn to coordinate their actions so that joint optimality is achieved. The value of a joint policy α = (a 1,..., a N ) with a i a joint action of state s i in A i 1... A i n, can still be defined by Equation 1. Under the same assumption considered above, i.e. the markov chain corresponding to each joint policy α is ergodic, it is sufficient to only consider joint policies in which the agents choose pure strategies. Moreover, under this assumption the expected average reward of a joint policy α can also be expressed by Equation 2. 3 Learning Automata A learning automaton describes the internal state of an agent as a probability distribution according to which actions should be chosen [9]. These probabilities are adjusted with some reinforcement scheme according to the success or failure of the actions taken. The LA is defined by a quadruple {A, β, p, T } for which A is the action or output set {a 1, a 2,...a r } of the automaton, β is a random variable in the interval [0, 1] and p is a vector of the automatons, or agents action probabilities and T denotes the update scheme. An important update scheme is the linear reward-penalty scheme. The philosophy is essentially to increase the probability of an action when it results in a success and to decrease it when the response is a failure. The general algorithm is given by: p m (t + 1) = p m (t) + λ 1 (1 β(t))(1 p m (t)) λ 2 β(t)p m (t) (3) if a m is the action taken at time t p j (t + 1) = p j (t) λ 1 (1 β(t))p j (t) + λ 2 β(t)[(r 1) 1 p j (t)] (4) if a j a m The constants λ 1 en λ 2 are the reward and penalty parameters respectively. When λ 1 = λ 2 the algorithm is referred to as linear reward-penalty (L R P ),
4 when λ 2 = 0 it is referred to as linear reward-inaction (L R I ) and when λ 2 is small compared to λ 1 it is called linear reward-ǫ-penalty (L R ǫp ). 3.1 Learning Automata Games A play a(t) of n automata is a set of strategies chosen by the automata at stage t, such that a j (t) is an element of the action set of the jth automaton. Correspondingly the outcome is now also a vector β(t) = (β 1 (t)...β n (t)). At every time-step all automata update their probability distributions based on the responses of the environment. Each automaton participating in the game operates without information concerning the number of other participants, their strategies, actions or payoffs. The following result was proved: Theorem 1. [7] When the automata game is repeatedly played with each player making use of the L R I scheme with a sufficiently small step size, then local convergence is established towards pure Nash equilibria. 3.2 Parameterized Learning Automata Parameterized Learning Automata (PLA) keep an internal state vector u, of real numbers which is not necessarily a probability vector. The probabilities of various actions are then generated, based on this vector u and a probability generating function g : R M A [0, 1]. This allows for a richer update mechanism by using a random perturbation term in the update scheme using ideas similar to Simulated Annealing. It can be shown that thanks to these perturbations these PLA are able to converge to a globally optimal solution in team games and certain feedforward network systems. When the automaton receives a feedback r(t) it updates the parameter vector u in stead of directly modifying the probabilities. In this paper we use following update rule proposed by Thathachar and Phansalkar [9]: with: u i (t + 1) = u i (t) + br(t) δ ln g δ u i (u(t), α(t)) + bh (u i (t)) + bs i (t) (5) K(x L) 2n x L h(x) = 0 x L K(x + L) 2n x L where h (x) is the derivative of h(x), {s i (t) : k 0} is a set of i.i.d. variables with zero mean and variance σ 2, b is the learning parameter, σ and K are positive constants and n is a positive integer. In this update rule, the second term is a gradient following term, the third term is used to keep the solutions bounded with u i L and the final term is a random term that allows the algorithm to escape local optima that are not globally optimal. In [9] the author show that the algorithm converges weakly to the solution of the Langevin equation, which globally maximizes the appropriate function. (6)
5 4 Learning in finite MDP s The problem of controlling a Markov chain can be formulated as a network of automata in which control passes from one automaton to another. In this set-up every action state 3 in the Markov chain has a LA that tries to learn the optimal action probabilities in that state with learning scheme (3,4). Only one LA is active at each time step and transition to the next state triggers the LA from that state to become active and take some action. LA LA i active in state s i is not informed of the one-step reward R ij (a i ) resulting from choosing action a i A i in s i and leading to state s j. When state s i is visited again, LA i receives two pieces of data: the cumulative reward generated by the process up to the current time step and the current global time. From these, LA i computes the incremental reward generated since this last visit and the corresponding elapsed global time. The environment response or the input to LA i is then taken to be: β i (t i + 1) = ρi (t i + 1) η i (t i + 1) (7) where ρ i (t i + 1) is the cumulative total reward generated for action a i in state s i and η i (t i + 1) the cumulative total time elapsed. The authors in [11] denote updating scheme (3,4) with environment response as in (7) as learning scheme T1. The following result was proved: Theorem 2 (Wheeler and Narendra, 1986). Let for each action state s i of an N state Markov chain, an automaton LA i using learning scheme T1 and having r i actions be associated with. Assume that the Markov Chain, corresponding to each policy α is ergodic Then the decentralized adaptation of the LA is globally ǫ-optimal with respect to the long-term expected reward per time step, i.e. J(α). 5 Learning in finite MMDPs In an MMDP the action chosen at any state is the result of individual action components performed by the agents present in the system. Instead of putting a single learning automaton in each state of the system, we propose to put an automaton LA i k in each state s i with i : 1...N and for each agent k, k : 1...n. At each time step only the automata of one state are active; a joint action triggers the LA from that state to become active and take some joint action. As before, LA LA i k active for agent k in state s i is not informed of the one-step reward R ij (a i ) resulting from choosing joint action a i = (a i 1,..., ai n ) with a i k Ai k in s i and leading to state s j. When state s i is visited again, all automata LA i k receive two pieces of data: the cumulative reward generated by the process up to the current time step and the current global time. From these, all LA i k compute the incremental reward generated since this last visit and the 3 A state is called an action state, when more than one action is present.
6 corresponding elapsed global time. The environment response or the input to LA i k is exactly the same as in Equation 7. As an example, consider the MMDP in Figure 1 with 2 agents and 4 states, with only s 0 and s 1 having more then one action. In both states 4 joint actions are present:(0, 0), (0, 1), (1, 0) and (1, 1). All transitions, except those leaving states s 2 and s 3 are deterministic, while the tranisitions leaving state s 2 or s 3 have uniform probability of going to one the other states or itself. Rewards are only given for the transitions (s 1, s 2 ) and (s 1, s 3 ). S1 (0,0) R: 1.0, (1,1) R: 0.7 S2 (0,0), (1,1) (1,0),(0,1) R: 0.5 S0 (1,0),(0,1) S3 Fig.1. An example MMDP with 2 action states s 0 and s 1; each with 2 actions: 0 and 1. Joint actions and nonzero rewards (R) are shown. Transitions are deterministic; except in the non-action states s 2 and s 3 where the process goes to any other state with equal probability (1/4) In the multi-agent view, the underlying game played depends on the agents individual policies. This game is a 2-player identical payoff game with 4 actions. We have a 2-player game, because we have 2 agents, and we have 4 actions because each agent has 4 possible policies it can take, i.e (0, 0), (0, 1), (1, 0) and (1, 1). Note that here (a 1, a 2 ) denotes the policy instead of a joint action, i.e. the agent takes action a 1 in state s 0 and action a 2 in state s 1. In Figure 2 you find the game matrix for the MMDP of Figure 1. 4 equilibria are present of which 2 are optimal and 2 are sub-optimal. In the LA view we consider the game between all the learning automata that are present in the different states. For the MMDP of Figure 1 this would give a 4 player game with 2 actions for each player: 0 or 1. The complete game is shown in Figure 3. In [10] we already showed that both views share the same pure attractor points. Combining this result with Theorem 1 we can state the following: Theorem 3. The Learning Automata model proposed here is able to find an equilibrium in pure strategies in an ergodic MMDP.
7 agent2 policies (0, 0) (0, 1) (1, 0) (1, 1) (0, 0) (0, 1) (1, 0) (1, 1) agent 1 Fig. 2. An identical payoffgame with 4 actions that approximates the multi agent view of the MMDP of Figure 1 When we now use parametrized LA instead of reward-inaction LA, we can even achieve global convergence: Theorem 4. The Learning Automata model proposed here, with all automata being parametrized and using the update scheme given in Equation 5 is able to find an optimal equilibrium in pure strategies in an ergodic MMDP. (LA 0 0, LA 0 1, LA 1 0, LA 1 1) J(α) (LA 0 0, LA 0 1, LA 1 0, LA 1 1) J(α) (0,0, 0,0) (1,0, 0,0) (0,0, 0,1) (1,0, 0,1) (0,0, 1,0) (1,0, 1,0) (0,0, 1,1) 0.2 (1,0, 1,1) (0,1, 0,0) (1,1, 0,0) (0,1, 0,1) (1,1, 0,1) (0,1, 1,0) (1,1, 1,0) (0,1, 1,1) (1,1, 1,1) 0.2 Fig. 3. An identical payoffgame between 4 players each with 2 actions, that approximates the LA view of the MMDP of Figure 1 Figure 4 shows experimental results on the MMDP of Figure 1. We compared the reward-inaction scheme using learning rates 0.01, 0.05 and with parameterized LAs. To demonstrate convergence we show a single very long run (10 million time steps) and restart the automata every 2 million steps. Both algorithms were initialized with all LAs having a probability of 0.9 of playing action 1. This gives a large bias towards one the suboptimal equilibrium ((1,1),(1,1)). The L R I automata converged to this equilibrium in every trial, while the PLAs manage to escape and converge to the optimal equilibrium ((1, 0), (1, 0)). References 1. C. Boutilier. Planning, learning and coordination in multiagent decision processes. In Proceedings of the 6th Conference on Theoretical Aspects of Rationality and
8 MMDP PLAs LR-I 0.01 LR-I LR-I Avg Reward e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06 1e+07 Time Fig. 4. Experimental results for reward-inaction and PLAs on the MMDP of Figure 1. The figure shows the average reward over time during the last 1000 steps. Both algorithms were initialized with a high bias towards the suboptimal equilibrium. Rewardinaction was tested with learning rates 0.01,0.005 and Settings for the PLAs were: K=1.0,n=1,L=1.5,b=0.04,σ = 0.1 Knowledge, pages , Renesse, Holland, G. Chalkiadakis and C. Boutilier. Coordination in multiagent reinforcement learning: A bayesian approach. In Proceedings of the 2nd International Joint Conference on Autonomous Agents and Multiagent Systems, pages , Melbourne, Australia, C. Claus and C. Boutilier. The dynamics of reinforcement learning in cooperative multiagent systems. In Proceedings of the 15th National Conference on Artificial Intelligence, pages , J. Hu and M. Wellman. Nash q-learning for general-sum stochastic games. Journal of Machine Learning Research, 4: , M. Littman. Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the 11th International Conference on Machine Learning, pages , K. Narendra and M. Thathachar. Learning Automata: An Introduction. Prentice- Hall International, Inc, P. Sastry, V. Phansalkar, and M. Thathachar. Decentralized learning of nash equilibria in multi-person stochastic games with incomplete information. IEEE Transactions on Systems, Man, and Cybernetics, 24(5): , R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, M.A.L. Thathachar and P.S. Sastry. Networks of Learning Automata: Techniques for Online Stochastic Optimization. Kluwer Academic Publishers, 2004.
9 10. P. Vrancx, K. Verbeeck, and A. Nowé. Decentralized learning of markov games. Technical Report COMO/12/2006, Computational Modeling Lab, Vrije Universiteit Brussel, Brussels, Belgium, R.M. Wheeler and K.S. Narendra. Decentralized learning in finite markov chains. IEEE Transactions on Automatic Control, AC-31: , 1986.
Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms
Artificial Intelligence Review manuscript No. (will be inserted by the editor) Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Howard M. Schwartz Received:
More informationInteracting Vehicles: Rules of the Game
Chapter 7 Interacting Vehicles: Rules of the Game In previous chapters, we introduced an intelligent control method for autonomous navigation and path planning. The decision system mainly uses local information,
More informationIncremental Policy Learning: An Equilibrium Selection Algorithm for Reinforcement Learning Agents with Common Interests
Incremental Policy Learning: An Equilibrium Selection Algorithm for Reinforcement Learning Agents with Common Interests Nancy Fulda and Dan Ventura Department of Computer Science Brigham Young University
More informationState-Coupled Replicator Dynamics
State-Coupled Replicator Dynamics Daniel Hennes Eindhoven University of Technology PO Box 53, 56 MB, Eindhoven, The Netherlands dhennes@tuenl Karl Tuyls Eindhoven University of Technology PO Box 53, 56
More informationDECENTRALIZED LEARNING IN GENERAL-SUM MATRIX GAMES: AN L R I LAGGING ANCHOR ALGORITHM. Xiaosong Lu and Howard M. Schwartz
International Journal of Innovative Computing, Information and Control ICIC International c 20 ISSN 349-498 Volume 7, Number, January 20 pp. 0 DECENTRALIZED LEARNING IN GENERAL-SUM MATRIX GAMES: AN L R
More informationExponential Moving Average Based Multiagent Reinforcement Learning Algorithms
Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca
More informationImproving the performance of Continuous Action Reinforcement Learning Automata
Improving the performance of Continuous Action Reinforcement Learning Automata Abdel Rodríguez,2, Matteo Gagliolo 2, Peter Vrancx 2, Ricardo Grau, and Ann Nowé 2 Bioinformatics Lab, Universidad Central
More informationLearning to Coordinate Efficiently: A Model-based Approach
Journal of Artificial Intelligence Research 19 (2003) 11-23 Submitted 10/02; published 7/03 Learning to Coordinate Efficiently: A Model-based Approach Ronen I. Brafman Computer Science Department Ben-Gurion
More informationMultiagent Learning Using a Variable Learning Rate
Multiagent Learning Using a Variable Learning Rate Michael Bowling, Manuela Veloso Computer Science Department, Carnegie Mellon University, Pittsburgh, PA 15213-3890 Abstract Learning to act in a multiagent
More informationConvergence and No-Regret in Multiagent Learning
Convergence and No-Regret in Multiagent Learning Michael Bowling Department of Computing Science University of Alberta Edmonton, Alberta Canada T6G 2E8 bowling@cs.ualberta.ca Abstract Learning in a multiagent
More informationAn Adaptive Clustering Method for Model-free Reinforcement Learning
An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at
More informationA Modified Q-Learning Algorithm for Potential Games
Preprints of the 19th World Congress The International Federation of Automatic Control A Modified Q-Learning Algorithm for Potential Games Yatao Wang Lacra Pavel Edward S. Rogers Department of Electrical
More informationA Reinforcement Learning (Nash-R) Algorithm for Average Reward Irreducible Stochastic Games
Learning in Average Reward Stochastic Games A Reinforcement Learning (Nash-R) Algorithm for Average Reward Irreducible Stochastic Games Jun Li Jun.Li@warnerbros.com Kandethody Ramachandran ram@cas.usf.edu
More informationA Polynomial-time Nash Equilibrium Algorithm for Repeated Games
A Polynomial-time Nash Equilibrium Algorithm for Repeated Games Michael L. Littman mlittman@cs.rutgers.edu Rutgers University Peter Stone pstone@cs.utexas.edu The University of Texas at Austin Main Result
More informationMultiagent Value Iteration in Markov Games
Multiagent Value Iteration in Markov Games Amy Greenwald Brown University with Michael Littman and Martin Zinkevich Stony Brook Game Theory Festival July 21, 2005 Agenda Theorem Value iteration converges
More informationOptimism in the Face of Uncertainty Should be Refutable
Optimism in the Face of Uncertainty Should be Refutable Ronald ORTNER Montanuniversität Leoben Department Mathematik und Informationstechnolgie Franz-Josef-Strasse 18, 8700 Leoben, Austria, Phone number:
More informationA Multiagent Reinforcement Learning Algorithm with Non-linear Dynamics
Journal of Artificial Intelligence Research 33 (28) 52-549 Submitted 6/8; published 2/8 A Multiagent Reinforcement Learning Algorithm with Non-linear Dynamics Sherief Abdallah Faculty of Informatics The
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Formal models of interaction Daniel Hennes 27.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Taxonomy of domains Models of
More informationLearning Equilibrium as a Generalization of Learning to Optimize
Learning Equilibrium as a Generalization of Learning to Optimize Dov Monderer and Moshe Tennenholtz Faculty of Industrial Engineering and Management Technion Israel Institute of Technology Haifa 32000,
More informationarxiv:cs/ v1 [cs.lg] 25 May 2001
Learning to Cooperate via Policy Search arxiv:cs/0105032v1 [cs.lg] 25 May 2001 Leonid Peshkin pesha@ai.mit.edu MIT AI Laboratory 545 Technology Square Cambridge, MA 02139 Abstract Kee-Eung Kim kek@cs.brown.edu
More information6 Reinforcement Learning
6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,
More informationUsing Expectation-Maximization for Reinforcement Learning
NOTE Communicated by Andrew Barto and Michael Jordan Using Expectation-Maximization for Reinforcement Learning Peter Dayan Department of Brain and Cognitive Sciences, Center for Biological and Computational
More informationMachine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396
Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction
More informationOptimal Efficient Learning Equilibrium: Imperfect Monitoring in Symmetric Games
Optimal Efficient Learning Equilibrium: Imperfect Monitoring in Symmetric Games Ronen I. Brafman Department of Computer Science Stanford University Stanford, CA 94305 brafman@cs.stanford.edu Moshe Tennenholtz
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More informationReinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN
Reinforcement Learning for Continuous Action using Stochastic Gradient Ascent Hajime KIMURA, Shigenobu KOBAYASHI Tokyo Institute of Technology, 4259 Nagatsuda, Midori-ku Yokohama 226-852 JAPAN Abstract:
More informationOptimal Tuning of Continual Online Exploration in Reinforcement Learning
Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens {youssef.achbany, francois.fouss, luh.yen, alain.pirotte,
More informationA reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation
A reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation Hajime Fujita and Shin Ishii, Nara Institute of Science and Technology 8916 5 Takayama, Ikoma, 630 0192 JAPAN
More informationConvergence to Pareto Optimality in General Sum Games via Learning Opponent s Preference
Convergence to Pareto Optimality in General Sum Games via Learning Opponent s Preference Dipyaman Banerjee Department of Math & CS University of Tulsa Tulsa, OK, USA dipyaman@gmail.com Sandip Sen Department
More informationREINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning
REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari
More informationRobust Learning Equilibrium
Robust Learning Equilibrium Itai Ashlagi Dov Monderer Moshe Tennenholtz Faculty of Industrial Engineering and Management Technion Israel Institute of Technology Haifa 32000, Israel Abstract We introduce
More informationA System Theoretic Perspective of Learning and Optimization
A System Theoretic Perspective of Learning and Optimization Xi-Ren Cao* Hong Kong University of Science and Technology Clear Water Bay, Kowloon, Hong Kong eecao@ee.ust.hk Abstract Learning and optimization
More informationProf. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be
REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while
More informationProcedia Computer Science 00 (2011) 000 6
Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-
More informationEvaluation of multi armed bandit algorithms and empirical algorithm
Acta Technica 62, No. 2B/2017, 639 656 c 2017 Institute of Thermomechanics CAS, v.v.i. Evaluation of multi armed bandit algorithms and empirical algorithm Zhang Hong 2,3, Cao Xiushan 1, Pu Qiumei 1,4 Abstract.
More informationElements of Reinforcement Learning
Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,
More informationReinforcement Learning
Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value
More informationCS599 Lecture 1 Introduction To RL
CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and
More informationQ-Learning for Markov Decision Processes*
McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of
More informationQUICR-learning for Multi-Agent Coordination
QUICR-learning for Multi-Agent Coordination Adrian K. Agogino UCSC, NASA Ames Research Center Mailstop 269-3 Moffett Field, CA 94035 adrian@email.arc.nasa.gov Kagan Tumer NASA Ames Research Center Mailstop
More informationSatisfaction Equilibrium: Achieving Cooperation in Incomplete Information Games
Satisfaction Equilibrium: Achieving Cooperation in Incomplete Information Games Stéphane Ross and Brahim Chaib-draa Department of Computer Science and Software Engineering Laval University, Québec (Qc),
More informationErgodicity and Non-Ergodicity in Economics
Abstract An stochastic system is called ergodic if it tends in probability to a limiting form that is independent of the initial conditions. Breakdown of ergodicity gives rise to path dependence. We illustrate
More informationGradient descent for symmetric and asymmetric multiagent reinforcement learning
Web Intelligence and Agent Systems: An international journal 3 (25) 17 3 17 IOS Press Gradient descent for symmetric and asymmetric multiagent reinforcement learning Ville Könönen Neural Networks Research
More informationReinforcement Learning
Reinforcement Learning 1 Reinforcement Learning Mainly based on Reinforcement Learning An Introduction by Richard Sutton and Andrew Barto Slides are mainly based on the course material provided by the
More informationMachine Learning I Reinforcement Learning
Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:
More informationCyclic Equilibria in Markov Games
Cyclic Equilibria in Markov Games Martin Zinkevich and Amy Greenwald Department of Computer Science Brown University Providence, RI 02912 {maz,amy}@cs.brown.edu Michael L. Littman Department of Computer
More informationPolicy Gradient Reinforcement Learning Without Regret
Policy Gradient Reinforcement Learning Without Regret by Travis Dick A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science Department of Computing Science University
More informationLearning in Zero-Sum Team Markov Games using Factored Value Functions
Learning in Zero-Sum Team Markov Games using Factored Value Functions Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 27708 mgl@cs.duke.edu Ronald Parr Department of Computer
More informationReinforcement learning
Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error
More informationA Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time
A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement
More informationPartially observable Markov decision processes. Department of Computer Science, Czech Technical University in Prague
Partially observable Markov decision processes Jiří Kléma Department of Computer Science, Czech Technical University in Prague https://cw.fel.cvut.cz/wiki/courses/b4b36zui/prednasky pagenda Previous lecture:
More informationTemporal Difference Learning & Policy Iteration
Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.
More informationPolicy Gradient Reinforcement Learning for Robotics
Policy Gradient Reinforcement Learning for Robotics Michael C. Koval mkoval@cs.rutgers.edu Michael L. Littman mlittman@cs.rutgers.edu May 9, 211 1 Introduction Learning in an environment with a continuous
More informationBalancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm
Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu
More informationSelecting Efficient Correlated Equilibria Through Distributed Learning. Jason R. Marden
1 Selecting Efficient Correlated Equilibria Through Distributed Learning Jason R. Marden Abstract A learning rule is completely uncoupled if each player s behavior is conditioned only on his own realized
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationUsing Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *
Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms
More information, and rewards and transition matrices as shown below:
CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount
More informationReinforcement Learning
Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of
More informationMcGill University Department of Electrical and Computer Engineering
McGill University Department of Electrical and Computer Engineering ECSE 56 - Stochastic Control Project Report Professor Aditya Mahajan Team Decision Theory and Information Structures in Optimal Control
More informationConnections Between Cooperative Control and Potential Games Illustrated on the Consensus Problem
Proceedings of the European Control Conference 2007 Kos, Greece, July 2-5, 2007 Connections Between Cooperative Control and Potential Games Illustrated on the Consensus Problem Jason R. Marden, Gürdal
More informationReinforcement Learning and NLP
1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value
More informationLearning ε-pareto Efficient Solutions With Minimal Knowledge Requirements Using Satisficing
Learning ε-pareto Efficient Solutions With Minimal Knowledge Requirements Using Satisficing Jacob W. Crandall and Michael A. Goodrich Computer Science Department Brigham Young University Provo, UT 84602
More informationLecture 8: Policy Gradient
Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve
More informationBest-Response Multiagent Learning in Non-Stationary Environments
Best-Response Multiagent Learning in Non-Stationary Environments Michael Weinberg Jeffrey S. Rosenschein School of Engineering and Computer Science Heew University Jerusalem, Israel fmwmw,jeffg@cs.huji.ac.il
More informationLearning Near-Pareto-Optimal Conventions in Polynomial Time
Learning Near-Pareto-Optimal Conventions in Polynomial Time Xiaofeng Wang ECE Department Carnegie Mellon University Pittsburgh, PA 15213 xiaofeng@andrew.cmu.edu Tuomas Sandholm CS Department Carnegie Mellon
More informationRecursive Learning Automata Approach to Markov Decision Processes
1 Recursive Learning Automata Approach to Markov Decision Processes Hyeong Soo Chang, Michael C. Fu, Jiaqiao Hu, and Steven I. Marcus Abstract We present a sampling algorithm, called Recursive Automata
More informationQ-Learning in Continuous State Action Spaces
Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental
More informationGame-Theoretic Learning:
Game-Theoretic Learning: Regret Minimization vs. Utility Maximization Amy Greenwald with David Gondek, Amir Jafari, and Casey Marks Brown University University of Pennsylvania November 17, 2004 Background
More informationMulti-Agent Learning with Policy Prediction
Multi-Agent Learning with Policy Prediction Chongjie Zhang Computer Science Department University of Massachusetts Amherst, MA 3 USA chongjie@cs.umass.edu Victor Lesser Computer Science Department University
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More informationThe Reinforcement Learning Problem
The Reinforcement Learning Problem Slides based on the book Reinforcement Learning by Sutton and Barto Formalizing Reinforcement Learning Formally, the agent and environment interact at each of a sequence
More information1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5
Table of contents 1 Introduction 2 2 Markov Decision Processes 2 3 Future Cumulative Reward 3 4 Q-Learning 4 4.1 The Q-value.............................................. 4 4.2 The Temporal Difference.......................................
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationREINFORCEMENT LEARNING
REINFORCEMENT LEARNING Larry Page: Where s Google going next? DeepMind's DQN playing Breakout Contents Introduction to Reinforcement Learning Deep Q-Learning INTRODUCTION TO REINFORCEMENT LEARNING Contents
More informationTemporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI
Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning
More informationCOMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati
COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati Today } What is machine learning? } Where is it used? } Types of machine learning
More informationCS 570: Machine Learning Seminar. Fall 2016
CS 570: Machine Learning Seminar Fall 2016 Class Information Class web page: http://web.cecs.pdx.edu/~mm/mlseminar2016-2017/fall2016/ Class mailing list: cs570@cs.pdx.edu My office hours: T,Th, 2-3pm or
More informationIn Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.
In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and J. Alspector, (Eds.). Morgan Kaufmann Publishers, San Fancisco, CA. 1994. Convergence of Indirect Adaptive Asynchronous
More informationTheoretical Advantages of Lenient Q-learners: An Evolutionary Game Theoretic Perspective
Theoretical Advantages of Lenient Q-learners: An Evolutionary Game Theoretic erspective Liviu anait Google Inc 604 Arizona Ave, Santa Monica, CA, USA liviu@google.com Karl Tuyls Maastricht University MiCC-IKAT,
More informationFinding Optimal Strategies for Influencing Social Networks in Two Player Games. MAJ Nick Howard, USMA Dr. Steve Kolitz, Draper Labs Itai Ashlagi, MIT
Finding Optimal Strategies for Influencing Social Networks in Two Player Games MAJ Nick Howard, USMA Dr. Steve Kolitz, Draper Labs Itai Ashlagi, MIT Problem Statement Given constrained resources for influencing
More informationAverage Reward Optimization Objective In Partially Observable Domains
Average Reward Optimization Objective In Partially Observable Domains Yuri Grinberg School of Computer Science, McGill University, Canada Doina Precup School of Computer Science, McGill University, Canada
More informationReinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina
Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it
More informationReinforcement Learning. Yishay Mansour Tel-Aviv University
Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak
More informationDecision Theory: Q-Learning
Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning
More informationIncremental Policy Learning: An Equilibrium Selection Algorithm for Reinforcement Learning Agents with Common Interests
Brigham Young University BYU ScholarsArchive All Faculty Publications 2004-07-01 Incremental Policy Learning: An Equilibrium Selection Algorithm for Reinforcement Learning Agents with Common Interests
More informationReinforcement learning an introduction
Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,
More informationPlayers as Serial or Parallel Random Access Machines. Timothy Van Zandt. INSEAD (France)
Timothy Van Zandt Players as Serial or Parallel Random Access Machines DIMACS 31 January 2005 1 Players as Serial or Parallel Random Access Machines (EXPLORATORY REMARKS) Timothy Van Zandt tvz@insead.edu
More informationSequential Optimality and Coordination in Multiagent Systems
Sequential Optimality and Coordination in Multiagent Systems Craig Boutilier Department of Computer Science University of British Columbia Vancouver, B.C., Canada V6T 1Z4 cebly@cs.ubc.ca Abstract Coordination
More informationReinforcement Learning. Spring 2018 Defining MDPs, Planning
Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state
More information(Deep) Reinforcement Learning
Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015
More informationDialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department
Dialogue management: Parametric approaches to policy optimisation Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department 1 / 30 Dialogue optimisation as a reinforcement learning
More informationReinforcement Learning: An Introduction
Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is
More informationLearning Tetris. 1 Tetris. February 3, 2009
Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are
More information1 Problem Formulation
Book Review Self-Learning Control of Finite Markov Chains by A. S. Poznyak, K. Najim, and E. Gómez-Ramírez Review by Benjamin Van Roy This book presents a collection of work on algorithms for learning
More informationConsistency of Fuzzy Model-Based Reinforcement Learning
Consistency of Fuzzy Model-Based Reinforcement Learning Lucian Buşoniu, Damien Ernst, Bart De Schutter, and Robert Babuška Abstract Reinforcement learning (RL) is a widely used paradigm for learning control.
More informationReinforcement Learning
1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision
More informationDecentralized reinforcement learning control of a robotic manipulator
Delft University of Technology Delft Center for Systems and Control Technical report 6-6 Decentralized reinforcement learning control of a robotic manipulator L. Buşoniu, B. De Schutter, and R. Babuška
More information