Optimal Convergence in Multi-Agent MDPs

Size: px
Start display at page:

Download "Optimal Convergence in Multi-Agent MDPs"

Transcription

1 Optimal Convergence in Multi-Agent MDPs Peter Vrancx 1, Katja Verbeeck 2, and Ann Nowé 1 1 {pvrancx, ann.nowe}@vub.ac.be, Computational Modeling Lab, Vrije Universiteit Brussel 2 k.verbeeck@micc.unimaas.nl, MICC-IKAT Maastricht University Abstract. Learning Automata (LA) were recently shown to be valuable tools for designing Multi-Agent Reinforcement Learning algorithms. One of the principal contributions of LA theory is that a set of decentralized, independent learning automata is able to control a finite Markov Chain with unknown transition probabilities and rewards. We extend this result to the framework of Multi-Agent MDP s, a straightforward extension of single-agent MDP s to distributed cooperative multi-agent decision problems. Furthermore, we combine this result with the application of parametrized learning automata yielding global optimal convergence results. 1 Introduction Reinforcement Learning was originally developed for Markov Decision Problems (MDP s) [8]. However the MDP model does not allow multiple agents to act in the same environment. A straightforward extension of the MDP model to the multi-agent case is given by the framework of Markov Games [5]. In a Markov Game, actions are the result of the joint action selection of all agents and rewards and state transitions depend on these joint actions. In this paper we assume that all the agents share the same reward function, i.e. we consider the so-called team games or multi-agent MDP s (MMDP s). In this case, the game is purely cooperative and all agents should learn how to find and agree on the same optimal policy. In addition, the agents face the problem of incomplete information with respect to the action choice. One can assume that the agents get information about their own choice of action as well as that of the others. This is the case in what is called joint action learning, a popular way to address multi-agent learning [5, 3, 2, 4]. In contrast, independent agents only know their own action. The latter is often a more realistic assumption since distributed multi-agent applications are typically subject to limitations such as partial or non observability, communication costs, asynchronism and stochasticity. Since learning automata work strictly on the basis of the response funded by a Ph.D grant of the Institute for the Promotion of Innovation through Science and Technology in Flanders (IWT Vlaanderen). funded by the Interactive Collaborative Information Systems (ICIS) project, supported by the Dutch Ministry of Economic Affairs, nr: BSIK03024 (The Netherlands)

2 of the environment, and not on the basis of any knowledge regarding other automata, i. e. nor their strategies, nor their feedback, they are very well suited as independent multi-agent learners. In this paper we will focus on how learning automata (LA) can tackle the problem of learning MMDP s. One of the principal contributions of LA theory is that a set of decentralized learning automata is able to control a finite Markov Chain with unknown transition probabilities and rewards. This result can be extended to MMDP s, using a simple extension of the original LA network provided in [6]. A simple learning automaton is put for every agent in each state. The problem can then be approached from two perspectives. First as a multi-agent game, in which each agent is represented by all the automata it was associated with in every state. Alternatively, the problem can be viewed as a LA game, i.e. the view in which each automaton itself represents an agent. Both views can be shown to share the same pure equilibrium points, and as such existing LA theorems on LA-games and parametrized LA will guarantee the convergence to global optimal points of the MMDP. This paper is organised as follows. First, we start with the definition of MDP s, MMDP s and basic LA theory. Next, the LA-models for learning respectively MDP s and MMDP s are given. Theoretical convergence guarantees are discussed and a simple example is added. 2 Definitions 2.1 Definition of an MDP The problem of controlling a finite Markov Chain, called a Markov Decision Problem (MDP) for which transition probabilities and rewards are unknown can be stated as follows. Let S = {s 1,...,s N } be the state space of a finite Markov chain {x l } l 0 and A i = {a i 1,..., a i r i } the action set available in state s i. Each starting state s i, action choice a i A i and ending state s j has an associated transition probability T ij (a i ) and reward R ij (a i ). The overall goal is to learn a policy α, or a set of actions, α = (a 1,..., a N ) with a j A j so that the expected average reward J(α) is maximized: [ l 1 ] 1 J(α) lim l l E R x(t)x(t+1) (α) The policies we consider, are limited to stationary, nonrandomized policies. Under the assumption that the Markov chain corresponding to each policy α is ergodic, it can be shown that the best strategy in any state is a pure strategy, independent of the time at which the state is occupied [11]. A Markov chain {x n } n 0 is said to be ergodic when the distribution of the chain converges to a limiting distribution π(α) = (π 1 (α),..., π N (α)) with i, π i (α) > 0 as n. Thus, there are no transient states and the limiting distribution π(α) can be used to rewrite Equation 1 as: t=0 (1)

3 N N J(α) = π i (α) T ij (α)r ij (α) (2) i=1 j=1 2.2 Definition of MMDPs An extension of single agent Markov decision problems (MDP s) to the cooperative multi-agent case can be defined by Multi-agent MDPs (MMDPs) [1]. In an MMDP, actions are the joint result of multiple agents choosing an action seperately. Note that A i k = {ai k1,...,ai ki r } is now the action set available in state s i for agent k, with k : 1...n, n being the total number of agents present in the system. Transition probabilities T ij (a i ) and rewards R ij (a) now depend on a starting state s i, ending state s j and a joint action from state s i, i.e. a i = (a i 1,... a i n) with a i k Ai k. Since the agents individual action choices may be jointly suboptimal, the added problem in MMDP s is for the agents to learn to coordinate their actions so that joint optimality is achieved. The value of a joint policy α = (a 1,..., a N ) with a i a joint action of state s i in A i 1... A i n, can still be defined by Equation 1. Under the same assumption considered above, i.e. the markov chain corresponding to each joint policy α is ergodic, it is sufficient to only consider joint policies in which the agents choose pure strategies. Moreover, under this assumption the expected average reward of a joint policy α can also be expressed by Equation 2. 3 Learning Automata A learning automaton describes the internal state of an agent as a probability distribution according to which actions should be chosen [9]. These probabilities are adjusted with some reinforcement scheme according to the success or failure of the actions taken. The LA is defined by a quadruple {A, β, p, T } for which A is the action or output set {a 1, a 2,...a r } of the automaton, β is a random variable in the interval [0, 1] and p is a vector of the automatons, or agents action probabilities and T denotes the update scheme. An important update scheme is the linear reward-penalty scheme. The philosophy is essentially to increase the probability of an action when it results in a success and to decrease it when the response is a failure. The general algorithm is given by: p m (t + 1) = p m (t) + λ 1 (1 β(t))(1 p m (t)) λ 2 β(t)p m (t) (3) if a m is the action taken at time t p j (t + 1) = p j (t) λ 1 (1 β(t))p j (t) + λ 2 β(t)[(r 1) 1 p j (t)] (4) if a j a m The constants λ 1 en λ 2 are the reward and penalty parameters respectively. When λ 1 = λ 2 the algorithm is referred to as linear reward-penalty (L R P ),

4 when λ 2 = 0 it is referred to as linear reward-inaction (L R I ) and when λ 2 is small compared to λ 1 it is called linear reward-ǫ-penalty (L R ǫp ). 3.1 Learning Automata Games A play a(t) of n automata is a set of strategies chosen by the automata at stage t, such that a j (t) is an element of the action set of the jth automaton. Correspondingly the outcome is now also a vector β(t) = (β 1 (t)...β n (t)). At every time-step all automata update their probability distributions based on the responses of the environment. Each automaton participating in the game operates without information concerning the number of other participants, their strategies, actions or payoffs. The following result was proved: Theorem 1. [7] When the automata game is repeatedly played with each player making use of the L R I scheme with a sufficiently small step size, then local convergence is established towards pure Nash equilibria. 3.2 Parameterized Learning Automata Parameterized Learning Automata (PLA) keep an internal state vector u, of real numbers which is not necessarily a probability vector. The probabilities of various actions are then generated, based on this vector u and a probability generating function g : R M A [0, 1]. This allows for a richer update mechanism by using a random perturbation term in the update scheme using ideas similar to Simulated Annealing. It can be shown that thanks to these perturbations these PLA are able to converge to a globally optimal solution in team games and certain feedforward network systems. When the automaton receives a feedback r(t) it updates the parameter vector u in stead of directly modifying the probabilities. In this paper we use following update rule proposed by Thathachar and Phansalkar [9]: with: u i (t + 1) = u i (t) + br(t) δ ln g δ u i (u(t), α(t)) + bh (u i (t)) + bs i (t) (5) K(x L) 2n x L h(x) = 0 x L K(x + L) 2n x L where h (x) is the derivative of h(x), {s i (t) : k 0} is a set of i.i.d. variables with zero mean and variance σ 2, b is the learning parameter, σ and K are positive constants and n is a positive integer. In this update rule, the second term is a gradient following term, the third term is used to keep the solutions bounded with u i L and the final term is a random term that allows the algorithm to escape local optima that are not globally optimal. In [9] the author show that the algorithm converges weakly to the solution of the Langevin equation, which globally maximizes the appropriate function. (6)

5 4 Learning in finite MDP s The problem of controlling a Markov chain can be formulated as a network of automata in which control passes from one automaton to another. In this set-up every action state 3 in the Markov chain has a LA that tries to learn the optimal action probabilities in that state with learning scheme (3,4). Only one LA is active at each time step and transition to the next state triggers the LA from that state to become active and take some action. LA LA i active in state s i is not informed of the one-step reward R ij (a i ) resulting from choosing action a i A i in s i and leading to state s j. When state s i is visited again, LA i receives two pieces of data: the cumulative reward generated by the process up to the current time step and the current global time. From these, LA i computes the incremental reward generated since this last visit and the corresponding elapsed global time. The environment response or the input to LA i is then taken to be: β i (t i + 1) = ρi (t i + 1) η i (t i + 1) (7) where ρ i (t i + 1) is the cumulative total reward generated for action a i in state s i and η i (t i + 1) the cumulative total time elapsed. The authors in [11] denote updating scheme (3,4) with environment response as in (7) as learning scheme T1. The following result was proved: Theorem 2 (Wheeler and Narendra, 1986). Let for each action state s i of an N state Markov chain, an automaton LA i using learning scheme T1 and having r i actions be associated with. Assume that the Markov Chain, corresponding to each policy α is ergodic Then the decentralized adaptation of the LA is globally ǫ-optimal with respect to the long-term expected reward per time step, i.e. J(α). 5 Learning in finite MMDPs In an MMDP the action chosen at any state is the result of individual action components performed by the agents present in the system. Instead of putting a single learning automaton in each state of the system, we propose to put an automaton LA i k in each state s i with i : 1...N and for each agent k, k : 1...n. At each time step only the automata of one state are active; a joint action triggers the LA from that state to become active and take some joint action. As before, LA LA i k active for agent k in state s i is not informed of the one-step reward R ij (a i ) resulting from choosing joint action a i = (a i 1,..., ai n ) with a i k Ai k in s i and leading to state s j. When state s i is visited again, all automata LA i k receive two pieces of data: the cumulative reward generated by the process up to the current time step and the current global time. From these, all LA i k compute the incremental reward generated since this last visit and the 3 A state is called an action state, when more than one action is present.

6 corresponding elapsed global time. The environment response or the input to LA i k is exactly the same as in Equation 7. As an example, consider the MMDP in Figure 1 with 2 agents and 4 states, with only s 0 and s 1 having more then one action. In both states 4 joint actions are present:(0, 0), (0, 1), (1, 0) and (1, 1). All transitions, except those leaving states s 2 and s 3 are deterministic, while the tranisitions leaving state s 2 or s 3 have uniform probability of going to one the other states or itself. Rewards are only given for the transitions (s 1, s 2 ) and (s 1, s 3 ). S1 (0,0) R: 1.0, (1,1) R: 0.7 S2 (0,0), (1,1) (1,0),(0,1) R: 0.5 S0 (1,0),(0,1) S3 Fig.1. An example MMDP with 2 action states s 0 and s 1; each with 2 actions: 0 and 1. Joint actions and nonzero rewards (R) are shown. Transitions are deterministic; except in the non-action states s 2 and s 3 where the process goes to any other state with equal probability (1/4) In the multi-agent view, the underlying game played depends on the agents individual policies. This game is a 2-player identical payoff game with 4 actions. We have a 2-player game, because we have 2 agents, and we have 4 actions because each agent has 4 possible policies it can take, i.e (0, 0), (0, 1), (1, 0) and (1, 1). Note that here (a 1, a 2 ) denotes the policy instead of a joint action, i.e. the agent takes action a 1 in state s 0 and action a 2 in state s 1. In Figure 2 you find the game matrix for the MMDP of Figure 1. 4 equilibria are present of which 2 are optimal and 2 are sub-optimal. In the LA view we consider the game between all the learning automata that are present in the different states. For the MMDP of Figure 1 this would give a 4 player game with 2 actions for each player: 0 or 1. The complete game is shown in Figure 3. In [10] we already showed that both views share the same pure attractor points. Combining this result with Theorem 1 we can state the following: Theorem 3. The Learning Automata model proposed here is able to find an equilibrium in pure strategies in an ergodic MMDP.

7 agent2 policies (0, 0) (0, 1) (1, 0) (1, 1) (0, 0) (0, 1) (1, 0) (1, 1) agent 1 Fig. 2. An identical payoffgame with 4 actions that approximates the multi agent view of the MMDP of Figure 1 When we now use parametrized LA instead of reward-inaction LA, we can even achieve global convergence: Theorem 4. The Learning Automata model proposed here, with all automata being parametrized and using the update scheme given in Equation 5 is able to find an optimal equilibrium in pure strategies in an ergodic MMDP. (LA 0 0, LA 0 1, LA 1 0, LA 1 1) J(α) (LA 0 0, LA 0 1, LA 1 0, LA 1 1) J(α) (0,0, 0,0) (1,0, 0,0) (0,0, 0,1) (1,0, 0,1) (0,0, 1,0) (1,0, 1,0) (0,0, 1,1) 0.2 (1,0, 1,1) (0,1, 0,0) (1,1, 0,0) (0,1, 0,1) (1,1, 0,1) (0,1, 1,0) (1,1, 1,0) (0,1, 1,1) (1,1, 1,1) 0.2 Fig. 3. An identical payoffgame between 4 players each with 2 actions, that approximates the LA view of the MMDP of Figure 1 Figure 4 shows experimental results on the MMDP of Figure 1. We compared the reward-inaction scheme using learning rates 0.01, 0.05 and with parameterized LAs. To demonstrate convergence we show a single very long run (10 million time steps) and restart the automata every 2 million steps. Both algorithms were initialized with all LAs having a probability of 0.9 of playing action 1. This gives a large bias towards one the suboptimal equilibrium ((1,1),(1,1)). The L R I automata converged to this equilibrium in every trial, while the PLAs manage to escape and converge to the optimal equilibrium ((1, 0), (1, 0)). References 1. C. Boutilier. Planning, learning and coordination in multiagent decision processes. In Proceedings of the 6th Conference on Theoretical Aspects of Rationality and

8 MMDP PLAs LR-I 0.01 LR-I LR-I Avg Reward e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06 1e+07 Time Fig. 4. Experimental results for reward-inaction and PLAs on the MMDP of Figure 1. The figure shows the average reward over time during the last 1000 steps. Both algorithms were initialized with a high bias towards the suboptimal equilibrium. Rewardinaction was tested with learning rates 0.01,0.005 and Settings for the PLAs were: K=1.0,n=1,L=1.5,b=0.04,σ = 0.1 Knowledge, pages , Renesse, Holland, G. Chalkiadakis and C. Boutilier. Coordination in multiagent reinforcement learning: A bayesian approach. In Proceedings of the 2nd International Joint Conference on Autonomous Agents and Multiagent Systems, pages , Melbourne, Australia, C. Claus and C. Boutilier. The dynamics of reinforcement learning in cooperative multiagent systems. In Proceedings of the 15th National Conference on Artificial Intelligence, pages , J. Hu and M. Wellman. Nash q-learning for general-sum stochastic games. Journal of Machine Learning Research, 4: , M. Littman. Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the 11th International Conference on Machine Learning, pages , K. Narendra and M. Thathachar. Learning Automata: An Introduction. Prentice- Hall International, Inc, P. Sastry, V. Phansalkar, and M. Thathachar. Decentralized learning of nash equilibria in multi-person stochastic games with incomplete information. IEEE Transactions on Systems, Man, and Cybernetics, 24(5): , R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, M.A.L. Thathachar and P.S. Sastry. Networks of Learning Automata: Techniques for Online Stochastic Optimization. Kluwer Academic Publishers, 2004.

9 10. P. Vrancx, K. Verbeeck, and A. Nowé. Decentralized learning of markov games. Technical Report COMO/12/2006, Computational Modeling Lab, Vrije Universiteit Brussel, Brussels, Belgium, R.M. Wheeler and K.S. Narendra. Decentralized learning in finite markov chains. IEEE Transactions on Automatic Control, AC-31: , 1986.

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Artificial Intelligence Review manuscript No. (will be inserted by the editor) Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Howard M. Schwartz Received:

More information

Interacting Vehicles: Rules of the Game

Interacting Vehicles: Rules of the Game Chapter 7 Interacting Vehicles: Rules of the Game In previous chapters, we introduced an intelligent control method for autonomous navigation and path planning. The decision system mainly uses local information,

More information

Incremental Policy Learning: An Equilibrium Selection Algorithm for Reinforcement Learning Agents with Common Interests

Incremental Policy Learning: An Equilibrium Selection Algorithm for Reinforcement Learning Agents with Common Interests Incremental Policy Learning: An Equilibrium Selection Algorithm for Reinforcement Learning Agents with Common Interests Nancy Fulda and Dan Ventura Department of Computer Science Brigham Young University

More information

State-Coupled Replicator Dynamics

State-Coupled Replicator Dynamics State-Coupled Replicator Dynamics Daniel Hennes Eindhoven University of Technology PO Box 53, 56 MB, Eindhoven, The Netherlands dhennes@tuenl Karl Tuyls Eindhoven University of Technology PO Box 53, 56

More information

DECENTRALIZED LEARNING IN GENERAL-SUM MATRIX GAMES: AN L R I LAGGING ANCHOR ALGORITHM. Xiaosong Lu and Howard M. Schwartz

DECENTRALIZED LEARNING IN GENERAL-SUM MATRIX GAMES: AN L R I LAGGING ANCHOR ALGORITHM. Xiaosong Lu and Howard M. Schwartz International Journal of Innovative Computing, Information and Control ICIC International c 20 ISSN 349-498 Volume 7, Number, January 20 pp. 0 DECENTRALIZED LEARNING IN GENERAL-SUM MATRIX GAMES: AN L R

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca

More information

Improving the performance of Continuous Action Reinforcement Learning Automata

Improving the performance of Continuous Action Reinforcement Learning Automata Improving the performance of Continuous Action Reinforcement Learning Automata Abdel Rodríguez,2, Matteo Gagliolo 2, Peter Vrancx 2, Ricardo Grau, and Ann Nowé 2 Bioinformatics Lab, Universidad Central

More information

Learning to Coordinate Efficiently: A Model-based Approach

Learning to Coordinate Efficiently: A Model-based Approach Journal of Artificial Intelligence Research 19 (2003) 11-23 Submitted 10/02; published 7/03 Learning to Coordinate Efficiently: A Model-based Approach Ronen I. Brafman Computer Science Department Ben-Gurion

More information

Multiagent Learning Using a Variable Learning Rate

Multiagent Learning Using a Variable Learning Rate Multiagent Learning Using a Variable Learning Rate Michael Bowling, Manuela Veloso Computer Science Department, Carnegie Mellon University, Pittsburgh, PA 15213-3890 Abstract Learning to act in a multiagent

More information

Convergence and No-Regret in Multiagent Learning

Convergence and No-Regret in Multiagent Learning Convergence and No-Regret in Multiagent Learning Michael Bowling Department of Computing Science University of Alberta Edmonton, Alberta Canada T6G 2E8 bowling@cs.ualberta.ca Abstract Learning in a multiagent

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

A Modified Q-Learning Algorithm for Potential Games

A Modified Q-Learning Algorithm for Potential Games Preprints of the 19th World Congress The International Federation of Automatic Control A Modified Q-Learning Algorithm for Potential Games Yatao Wang Lacra Pavel Edward S. Rogers Department of Electrical

More information

A Reinforcement Learning (Nash-R) Algorithm for Average Reward Irreducible Stochastic Games

A Reinforcement Learning (Nash-R) Algorithm for Average Reward Irreducible Stochastic Games Learning in Average Reward Stochastic Games A Reinforcement Learning (Nash-R) Algorithm for Average Reward Irreducible Stochastic Games Jun Li Jun.Li@warnerbros.com Kandethody Ramachandran ram@cas.usf.edu

More information

A Polynomial-time Nash Equilibrium Algorithm for Repeated Games

A Polynomial-time Nash Equilibrium Algorithm for Repeated Games A Polynomial-time Nash Equilibrium Algorithm for Repeated Games Michael L. Littman mlittman@cs.rutgers.edu Rutgers University Peter Stone pstone@cs.utexas.edu The University of Texas at Austin Main Result

More information

Multiagent Value Iteration in Markov Games

Multiagent Value Iteration in Markov Games Multiagent Value Iteration in Markov Games Amy Greenwald Brown University with Michael Littman and Martin Zinkevich Stony Brook Game Theory Festival July 21, 2005 Agenda Theorem Value iteration converges

More information

Optimism in the Face of Uncertainty Should be Refutable

Optimism in the Face of Uncertainty Should be Refutable Optimism in the Face of Uncertainty Should be Refutable Ronald ORTNER Montanuniversität Leoben Department Mathematik und Informationstechnolgie Franz-Josef-Strasse 18, 8700 Leoben, Austria, Phone number:

More information

A Multiagent Reinforcement Learning Algorithm with Non-linear Dynamics

A Multiagent Reinforcement Learning Algorithm with Non-linear Dynamics Journal of Artificial Intelligence Research 33 (28) 52-549 Submitted 6/8; published 2/8 A Multiagent Reinforcement Learning Algorithm with Non-linear Dynamics Sherief Abdallah Faculty of Informatics The

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Formal models of interaction Daniel Hennes 27.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Taxonomy of domains Models of

More information

Learning Equilibrium as a Generalization of Learning to Optimize

Learning Equilibrium as a Generalization of Learning to Optimize Learning Equilibrium as a Generalization of Learning to Optimize Dov Monderer and Moshe Tennenholtz Faculty of Industrial Engineering and Management Technion Israel Institute of Technology Haifa 32000,

More information

arxiv:cs/ v1 [cs.lg] 25 May 2001

arxiv:cs/ v1 [cs.lg] 25 May 2001 Learning to Cooperate via Policy Search arxiv:cs/0105032v1 [cs.lg] 25 May 2001 Leonid Peshkin pesha@ai.mit.edu MIT AI Laboratory 545 Technology Square Cambridge, MA 02139 Abstract Kee-Eung Kim kek@cs.brown.edu

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Using Expectation-Maximization for Reinforcement Learning

Using Expectation-Maximization for Reinforcement Learning NOTE Communicated by Andrew Barto and Michael Jordan Using Expectation-Maximization for Reinforcement Learning Peter Dayan Department of Brain and Cognitive Sciences, Center for Biological and Computational

More information

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396 Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction

More information

Optimal Efficient Learning Equilibrium: Imperfect Monitoring in Symmetric Games

Optimal Efficient Learning Equilibrium: Imperfect Monitoring in Symmetric Games Optimal Efficient Learning Equilibrium: Imperfect Monitoring in Symmetric Games Ronen I. Brafman Department of Computer Science Stanford University Stanford, CA 94305 brafman@cs.stanford.edu Moshe Tennenholtz

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN Reinforcement Learning for Continuous Action using Stochastic Gradient Ascent Hajime KIMURA, Shigenobu KOBAYASHI Tokyo Institute of Technology, 4259 Nagatsuda, Midori-ku Yokohama 226-852 JAPAN Abstract:

More information

Optimal Tuning of Continual Online Exploration in Reinforcement Learning

Optimal Tuning of Continual Online Exploration in Reinforcement Learning Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens {youssef.achbany, francois.fouss, luh.yen, alain.pirotte,

More information

A reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation

A reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation A reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation Hajime Fujita and Shin Ishii, Nara Institute of Science and Technology 8916 5 Takayama, Ikoma, 630 0192 JAPAN

More information

Convergence to Pareto Optimality in General Sum Games via Learning Opponent s Preference

Convergence to Pareto Optimality in General Sum Games via Learning Opponent s Preference Convergence to Pareto Optimality in General Sum Games via Learning Opponent s Preference Dipyaman Banerjee Department of Math & CS University of Tulsa Tulsa, OK, USA dipyaman@gmail.com Sandip Sen Department

More information

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari

More information

Robust Learning Equilibrium

Robust Learning Equilibrium Robust Learning Equilibrium Itai Ashlagi Dov Monderer Moshe Tennenholtz Faculty of Industrial Engineering and Management Technion Israel Institute of Technology Haifa 32000, Israel Abstract We introduce

More information

A System Theoretic Perspective of Learning and Optimization

A System Theoretic Perspective of Learning and Optimization A System Theoretic Perspective of Learning and Optimization Xi-Ren Cao* Hong Kong University of Science and Technology Clear Water Bay, Kowloon, Hong Kong eecao@ee.ust.hk Abstract Learning and optimization

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

Procedia Computer Science 00 (2011) 000 6

Procedia Computer Science 00 (2011) 000 6 Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-

More information

Evaluation of multi armed bandit algorithms and empirical algorithm

Evaluation of multi armed bandit algorithms and empirical algorithm Acta Technica 62, No. 2B/2017, 639 656 c 2017 Institute of Thermomechanics CAS, v.v.i. Evaluation of multi armed bandit algorithms and empirical algorithm Zhang Hong 2,3, Cao Xiushan 1, Pu Qiumei 1,4 Abstract.

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

QUICR-learning for Multi-Agent Coordination

QUICR-learning for Multi-Agent Coordination QUICR-learning for Multi-Agent Coordination Adrian K. Agogino UCSC, NASA Ames Research Center Mailstop 269-3 Moffett Field, CA 94035 adrian@email.arc.nasa.gov Kagan Tumer NASA Ames Research Center Mailstop

More information

Satisfaction Equilibrium: Achieving Cooperation in Incomplete Information Games

Satisfaction Equilibrium: Achieving Cooperation in Incomplete Information Games Satisfaction Equilibrium: Achieving Cooperation in Incomplete Information Games Stéphane Ross and Brahim Chaib-draa Department of Computer Science and Software Engineering Laval University, Québec (Qc),

More information

Ergodicity and Non-Ergodicity in Economics

Ergodicity and Non-Ergodicity in Economics Abstract An stochastic system is called ergodic if it tends in probability to a limiting form that is independent of the initial conditions. Breakdown of ergodicity gives rise to path dependence. We illustrate

More information

Gradient descent for symmetric and asymmetric multiagent reinforcement learning

Gradient descent for symmetric and asymmetric multiagent reinforcement learning Web Intelligence and Agent Systems: An international journal 3 (25) 17 3 17 IOS Press Gradient descent for symmetric and asymmetric multiagent reinforcement learning Ville Könönen Neural Networks Research

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning 1 Reinforcement Learning Mainly based on Reinforcement Learning An Introduction by Richard Sutton and Andrew Barto Slides are mainly based on the course material provided by the

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

Cyclic Equilibria in Markov Games

Cyclic Equilibria in Markov Games Cyclic Equilibria in Markov Games Martin Zinkevich and Amy Greenwald Department of Computer Science Brown University Providence, RI 02912 {maz,amy}@cs.brown.edu Michael L. Littman Department of Computer

More information

Policy Gradient Reinforcement Learning Without Regret

Policy Gradient Reinforcement Learning Without Regret Policy Gradient Reinforcement Learning Without Regret by Travis Dick A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science Department of Computing Science University

More information

Learning in Zero-Sum Team Markov Games using Factored Value Functions

Learning in Zero-Sum Team Markov Games using Factored Value Functions Learning in Zero-Sum Team Markov Games using Factored Value Functions Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 27708 mgl@cs.duke.edu Ronald Parr Department of Computer

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement

More information

Partially observable Markov decision processes. Department of Computer Science, Czech Technical University in Prague

Partially observable Markov decision processes. Department of Computer Science, Czech Technical University in Prague Partially observable Markov decision processes Jiří Kléma Department of Computer Science, Czech Technical University in Prague https://cw.fel.cvut.cz/wiki/courses/b4b36zui/prednasky pagenda Previous lecture:

More information

Temporal Difference Learning & Policy Iteration

Temporal Difference Learning & Policy Iteration Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.

More information

Policy Gradient Reinforcement Learning for Robotics

Policy Gradient Reinforcement Learning for Robotics Policy Gradient Reinforcement Learning for Robotics Michael C. Koval mkoval@cs.rutgers.edu Michael L. Littman mlittman@cs.rutgers.edu May 9, 211 1 Introduction Learning in an environment with a continuous

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Selecting Efficient Correlated Equilibria Through Distributed Learning. Jason R. Marden

Selecting Efficient Correlated Equilibria Through Distributed Learning. Jason R. Marden 1 Selecting Efficient Correlated Equilibria Through Distributed Learning Jason R. Marden Abstract A learning rule is completely uncoupled if each player s behavior is conditioned only on his own realized

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms * Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of

More information

McGill University Department of Electrical and Computer Engineering

McGill University Department of Electrical and Computer Engineering McGill University Department of Electrical and Computer Engineering ECSE 56 - Stochastic Control Project Report Professor Aditya Mahajan Team Decision Theory and Information Structures in Optimal Control

More information

Connections Between Cooperative Control and Potential Games Illustrated on the Consensus Problem

Connections Between Cooperative Control and Potential Games Illustrated on the Consensus Problem Proceedings of the European Control Conference 2007 Kos, Greece, July 2-5, 2007 Connections Between Cooperative Control and Potential Games Illustrated on the Consensus Problem Jason R. Marden, Gürdal

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

Learning ε-pareto Efficient Solutions With Minimal Knowledge Requirements Using Satisficing

Learning ε-pareto Efficient Solutions With Minimal Knowledge Requirements Using Satisficing Learning ε-pareto Efficient Solutions With Minimal Knowledge Requirements Using Satisficing Jacob W. Crandall and Michael A. Goodrich Computer Science Department Brigham Young University Provo, UT 84602

More information

Lecture 8: Policy Gradient

Lecture 8: Policy Gradient Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve

More information

Best-Response Multiagent Learning in Non-Stationary Environments

Best-Response Multiagent Learning in Non-Stationary Environments Best-Response Multiagent Learning in Non-Stationary Environments Michael Weinberg Jeffrey S. Rosenschein School of Engineering and Computer Science Heew University Jerusalem, Israel fmwmw,jeffg@cs.huji.ac.il

More information

Learning Near-Pareto-Optimal Conventions in Polynomial Time

Learning Near-Pareto-Optimal Conventions in Polynomial Time Learning Near-Pareto-Optimal Conventions in Polynomial Time Xiaofeng Wang ECE Department Carnegie Mellon University Pittsburgh, PA 15213 xiaofeng@andrew.cmu.edu Tuomas Sandholm CS Department Carnegie Mellon

More information

Recursive Learning Automata Approach to Markov Decision Processes

Recursive Learning Automata Approach to Markov Decision Processes 1 Recursive Learning Automata Approach to Markov Decision Processes Hyeong Soo Chang, Michael C. Fu, Jiaqiao Hu, and Steven I. Marcus Abstract We present a sampling algorithm, called Recursive Automata

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Game-Theoretic Learning:

Game-Theoretic Learning: Game-Theoretic Learning: Regret Minimization vs. Utility Maximization Amy Greenwald with David Gondek, Amir Jafari, and Casey Marks Brown University University of Pennsylvania November 17, 2004 Background

More information

Multi-Agent Learning with Policy Prediction

Multi-Agent Learning with Policy Prediction Multi-Agent Learning with Policy Prediction Chongjie Zhang Computer Science Department University of Massachusetts Amherst, MA 3 USA chongjie@cs.umass.edu Victor Lesser Computer Science Department University

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

The Reinforcement Learning Problem

The Reinforcement Learning Problem The Reinforcement Learning Problem Slides based on the book Reinforcement Learning by Sutton and Barto Formalizing Reinforcement Learning Formally, the agent and environment interact at each of a sequence

More information

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5 Table of contents 1 Introduction 2 2 Markov Decision Processes 2 3 Future Cumulative Reward 3 4 Q-Learning 4 4.1 The Q-value.............................................. 4 4.2 The Temporal Difference.......................................

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

REINFORCEMENT LEARNING

REINFORCEMENT LEARNING REINFORCEMENT LEARNING Larry Page: Where s Google going next? DeepMind's DQN playing Breakout Contents Introduction to Reinforcement Learning Deep Q-Learning INTRODUCTION TO REINFORCEMENT LEARNING Contents

More information

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning

More information

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati Today } What is machine learning? } Where is it used? } Types of machine learning

More information

CS 570: Machine Learning Seminar. Fall 2016

CS 570: Machine Learning Seminar. Fall 2016 CS 570: Machine Learning Seminar Fall 2016 Class Information Class web page: http://web.cecs.pdx.edu/~mm/mlseminar2016-2017/fall2016/ Class mailing list: cs570@cs.pdx.edu My office hours: T,Th, 2-3pm or

More information

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G. In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and J. Alspector, (Eds.). Morgan Kaufmann Publishers, San Fancisco, CA. 1994. Convergence of Indirect Adaptive Asynchronous

More information

Theoretical Advantages of Lenient Q-learners: An Evolutionary Game Theoretic Perspective

Theoretical Advantages of Lenient Q-learners: An Evolutionary Game Theoretic Perspective Theoretical Advantages of Lenient Q-learners: An Evolutionary Game Theoretic erspective Liviu anait Google Inc 604 Arizona Ave, Santa Monica, CA, USA liviu@google.com Karl Tuyls Maastricht University MiCC-IKAT,

More information

Finding Optimal Strategies for Influencing Social Networks in Two Player Games. MAJ Nick Howard, USMA Dr. Steve Kolitz, Draper Labs Itai Ashlagi, MIT

Finding Optimal Strategies for Influencing Social Networks in Two Player Games. MAJ Nick Howard, USMA Dr. Steve Kolitz, Draper Labs Itai Ashlagi, MIT Finding Optimal Strategies for Influencing Social Networks in Two Player Games MAJ Nick Howard, USMA Dr. Steve Kolitz, Draper Labs Itai Ashlagi, MIT Problem Statement Given constrained resources for influencing

More information

Average Reward Optimization Objective In Partially Observable Domains

Average Reward Optimization Objective In Partially Observable Domains Average Reward Optimization Objective In Partially Observable Domains Yuri Grinberg School of Computer Science, McGill University, Canada Doina Precup School of Computer Science, McGill University, Canada

More information

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

Decision Theory: Q-Learning

Decision Theory: Q-Learning Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning

More information

Incremental Policy Learning: An Equilibrium Selection Algorithm for Reinforcement Learning Agents with Common Interests

Incremental Policy Learning: An Equilibrium Selection Algorithm for Reinforcement Learning Agents with Common Interests Brigham Young University BYU ScholarsArchive All Faculty Publications 2004-07-01 Incremental Policy Learning: An Equilibrium Selection Algorithm for Reinforcement Learning Agents with Common Interests

More information

Reinforcement learning an introduction

Reinforcement learning an introduction Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,

More information

Players as Serial or Parallel Random Access Machines. Timothy Van Zandt. INSEAD (France)

Players as Serial or Parallel Random Access Machines. Timothy Van Zandt. INSEAD (France) Timothy Van Zandt Players as Serial or Parallel Random Access Machines DIMACS 31 January 2005 1 Players as Serial or Parallel Random Access Machines (EXPLORATORY REMARKS) Timothy Van Zandt tvz@insead.edu

More information

Sequential Optimality and Coordination in Multiagent Systems

Sequential Optimality and Coordination in Multiagent Systems Sequential Optimality and Coordination in Multiagent Systems Craig Boutilier Department of Computer Science University of British Columbia Vancouver, B.C., Canada V6T 1Z4 cebly@cs.ubc.ca Abstract Coordination

More information

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

(Deep) Reinforcement Learning

(Deep) Reinforcement Learning Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015

More information

Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department

Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department Dialogue management: Parametric approaches to policy optimisation Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department 1 / 30 Dialogue optimisation as a reinforcement learning

More information

Reinforcement Learning: An Introduction

Reinforcement Learning: An Introduction Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

1 Problem Formulation

1 Problem Formulation Book Review Self-Learning Control of Finite Markov Chains by A. S. Poznyak, K. Najim, and E. Gómez-Ramírez Review by Benjamin Van Roy This book presents a collection of work on algorithms for learning

More information

Consistency of Fuzzy Model-Based Reinforcement Learning

Consistency of Fuzzy Model-Based Reinforcement Learning Consistency of Fuzzy Model-Based Reinforcement Learning Lucian Buşoniu, Damien Ernst, Bart De Schutter, and Robert Babuška Abstract Reinforcement learning (RL) is a widely used paradigm for learning control.

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

Decentralized reinforcement learning control of a robotic manipulator

Decentralized reinforcement learning control of a robotic manipulator Delft University of Technology Delft Center for Systems and Control Technical report 6-6 Decentralized reinforcement learning control of a robotic manipulator L. Buşoniu, B. De Schutter, and R. Babuška

More information