Multiagent (Deep) Reinforcement Learning

Similar documents
Counterfactual Multi-Agent Policy Gradients

arxiv: v1 [cs.ne] 4 Dec 2015

REINFORCEMENT LEARNING

Deep Q-Learning with Recurrent Neural Networks

Deep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department

(Deep) Reinforcement Learning

Q-Learning in Continuous State Action Spaces

Reinforcement Learning II. George Konidaris

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Reinforcement Learning II. George Konidaris

CS230: Lecture 9 Deep Reinforcement Learning

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Supplementary Material: Control of Memory, Active Perception, and Action in Minecraft

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Policy Gradient Methods. February 13, 2017

Reinforcement Learning

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Deep Reinforcement Learning

An Off-policy Policy Gradient Theorem Using Emphatic Weightings

Deep Reinforcement Learning via Policy Optimization

Learning Control for Air Hockey Striking using Deep Reinforcement Learning

Reinforcement Learning

CSC321 Lecture 22: Q-Learning

Supplementary Material for Asynchronous Methods for Deep Reinforcement Learning

Reinforcement Learning for NLP

Reinforcement Learning and Deep Reinforcement Learning

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Multiagent Learning. Foundations and Recent Trends. Stefano Albrecht and Peter Stone

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Deep Reinforcement Learning. Scratching the surface

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Decision Theory: Q-Learning

CS599 Lecture 1 Introduction To RL

arxiv: v2 [cs.lg] 7 Mar 2018

Notes on Tabular Methods

Reinforcement Learning

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Q-learning. Tambet Matiisen

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Driving a car with low dimensional input features

Approximation Methods in Reinforcement Learning

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. George Konidaris

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Reinforcement. Function Approximation. Learning with KATJA HOFMANN. Researcher, MSR Cambridge

Grundlagen der Künstlichen Intelligenz

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Multiagent Value Iteration in Markov Games

David Silver, Google DeepMind

Lecture 1: March 7, 2018

Machine Learning I Reinforcement Learning

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Reinforcement Learning

arxiv: v2 [cs.lg] 2 Oct 2018

Lecture 23: Reinforcement Learning

Forward Actor-Critic for Nonlinear Function Approximation in Reinforcement Learning

15-780: ReinforcementLearning

Introduction to Reinforcement Learning

Lecture 8: Policy Gradient

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning

CS 570: Machine Learning Seminar. Fall 2016

Trust Region Policy Optimization

Reinforcement Learning

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Reinforcement Learning. Introduction

Reinforcement learning an introduction

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Reinforcement Learning

Approximate Q-Learning. Dan Weld / University of Washington

Reinforcement Learning Part 2

Lecture 3: Markov Decision Processes

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Reinforcement Learning

Lecture 10 - Planning under Uncertainty (III)

Comparing Deep Reinforcement Learning and Evolutionary Methods in Continuous Control

, and rewards and transition matrices as shown below:

Machine Learning I Continuous Reinforcement Learning

Reinforcement Learning

arxiv: v4 [cs.ai] 10 Mar 2017

Reinforcement Learning: An Introduction

Reinforcement Learning and NLP

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Temporal difference learning

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

Weighted Double Q-learning

Game-Theoretic Cooperative Lane Changing Using Data-Driven Models

Reinforcement Learning

Variance Reduction for Policy Gradient Methods. March 13, 2017

Variational Deep Q Network

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Short Course: Multiagent Systems. Multiagent Systems. Lecture 1: Basics Agents Environments. Reinforcement Learning. This course is about:

arxiv: v1 [cs.lg] 4 Feb 2016

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Actor-Critic Reinforcement Learning with Neural Networks in Continuous Games

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Transcription:

Multiagent (Deep) Reinforcement Learning MARTIN PILÁT (MARTIN.PILAT@MFF.CUNI.CZ)

Reinforcement learning The agent needs to learn to perform tasks in environment No prior knowledge about the effects of tasks Maximized its utility Mountain Car problem Typical RL toy problem Agent (car) has three actions left, right, none Goal get up the mountain (yellow flag) Weak engine cannot just go to the right, needs to gain speed by going downhill first

Reinforcement learning Formally defined using a Markov Decision Process (MDP) (S, A, R, p) s t S state space a t A action space r t R reward space p(s, r s, a) probability that performing action a in state s leads to state s and gives reward r Agent s goal: maximize discounted returns G t = R t+1 + γr t+2 + γ 2 R t+3 = R t+1 + γg t+1 Agent learns its policy: π(a t = a S t = s) Gives a probability to use action a in state s State value function: V π s = E π G t S t = s Action value function: Q π s, a = E π G t S t = s, A t = a]

Q-Learning Learns the Q function directly using the Bellman s equations Q s t, a t 1 α Q s t, a t + α(r t + γmax a Q s t+1, a ) During learning sampling policy is used (e.g. the ε-greedy policy use a random action with probability ε, otherwise choose the best action) Traditionally, Q is represented as a (sparse) matrix Problems In many problems, state space (or action space) is continuous must perform some kind of discretization Can be unstable Watkins, Christopher JCH, and Peter Dayan. "Q-learning." Machine learning 8, no. 3-4 (1992): 279-292.

Deep Q-Learning Q function represented as a deep neural network Experience replay stores previous experience (state, action, new state, reward) in a replay buffer used for training Target network Separate network that is rarely updated Optimizes loss function L θ = E r + γ max Q s, a; θ a i Q s, a; θ 2 θ, θ - parameters of the network and target network Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, et al. Human-Level Control through Deep Reinforcement Learning. Nature 518, no. 7540 (February 2015): 529 33. https://doi.org/10.1038/nature14236.

Deep Q-Learning Successfully used to play single player Atari games Complex input states video of the game Action space quite simple discrete Rewards changes in game score Better than human-level performance Human-level measured against expert who played the game for around 20 episodes of max. 5 minutes after 2 hours of practice for each game. Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, et al. Human-Level Control through Deep Reinforcement Learning. Nature 518, no. 7540 (February 2015): 529 33. https://doi.org/10.1038/nature14236.

Actor-Critic Methods The actor (policy) is trained using a gradient that depends on a critic (estimate of value function) Critic is a value function After each action checks if things have gone better or worse than expected Evaluation is the error δ t = r t+1 + γv s t+1 V(s t ) Is used to evaluate the action selected by actor If δ is positive (outcome was better than expected) probability of selecting a t should be strengthened (otherwise lowered) Both actor and critic can be approximated using NN Policy (π(s, a)) update - Δθ = α θ (log π θ (s, a))q(s, a) Value (q(s, a)) update - Δw = β R s, a + γq s t+1, a t+1 q s t, a t w q(s t, a t ) Works in continuous action spaces

Multiagent Learning Learning in multi-agent environments more complex need to coordinate with other agents Example level-based foraging ( ) Goal is to collect all items as fast as possible Can collect item, if sum of agent levels is greater than item level

Goals of Learning Minmax profile For zero-sum games (π i, π j ) is minimax profile if U i π i, π j = U j (π i, π j ) Guaranteed utility against worst-case opponent Nash equilibrium Profile π 1,, π n is Nash equilibrium if i π i : U i π i, π i U i (π) No agent can improve utility unilaterally deviating from profile (every agent plays best-response to other agents) Correlated equilibrium Agents observe signal x i with joint distribution ξ x 1,, x n Profile π 1,, π n from recommended actions NE is special type of CE no correlation (e.g. recommended action) is correlated equilibrium if no agent can improve its expected utility by deviating

Goals of Learning Pareto optimum Profile π 1,, π n is Pareto-optimal if there is not other profile π such that i: U i π U_i(π) and i: U i π > U i (π) Cannot improve one agent without making other agent worse Social Welfare & Fairness Welfare of profile is sum of utilities of agents, fairness is product of utilities Profile is welfare or fairness optimal if it has the maximum possible welfare/fairness No-Regret Given history H t = a 0,, a t 1 agent i s regret for not having taken action a i is R i a i = t u i t ai,a i Policy π i achieves no-regret if a i : lim t 1 t R i a i H t 0. u i ( a t i, a t i )

Joint Action Learning Learns Q-values for joint actions a A joint action of all agents a = (a 1,, a n ), where a i is the action of agent i Q t+1 a t, s t = 1 α Q t a t, s t + αu i t u i t - utility received after joint action a t Uses opponent model to compute expected utilities of action E a i = a i P a i Q t+1 ( a i, a i, s t+1 ) joint action learning E a i = a i P a i a i Q t+1 (a i, a i, s t+1 ) conditional joint action learning Opponent models predicted from history as relative frequencies of action played (conditional frequencies in CJAL) ε greedy sampling

Policy Hill Climbing Learn policy π i directly Hill-climbing in policy space π i t+1 = π i t s i t, a i t + δ if a i t is the best action according to Q(s t, a i t ) π i t+1 = π i t s i t, a i t 1 A i 1 otherwise Parameter δ is adaptive larger if winning and lower if losing

Counterfactual Multi-agent Policy Gradients Centralized training and de-centralized execution (more information available in training) Critic conditions on the current observed state and the actions of all agents Actors condition on their observed state Credit assignment based on difference rewards Reward of agent i ~ the difference between the reward received by the system if joint action a was used, and reward received if agent i would have used a default action Requires assignment of default actions to agents COMA marginalize over all possible actions of agent i Used to train micro-management of units in StarCraft Foerster, Jakob, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual Multi-Agent Policy Gradients. ArXiv:1705.08926 [Cs], May 24, 2017. http://arxiv.org/abs/1705.08926.

Counterfactual Multi-agent Policy Gradients Foerster, Jakob, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual Multi-Agent Policy Gradients. ArXiv:1705.08926 [Cs], May 24, 2017. http://arxiv.org/abs/1705.08926.

Ad hoc Teamwork Typically whole team of agents provided by single organization/team. There is some pre-coordination (communication, coordination, ) Ad hoc teamwork Team of agents provided by different organization need to cooperate RoboCup Drop-In Competition mixed players from different teams Many algorithms not suitable for ad hoc teamwork Need many iterations of game typically limited amount of time Designed for self-play (all agents use the same strategy) no control over other agents in ad hoc teamwork

Ad hoc Teamwork Type-based methods Assume different types of agents Based on interaction history compute belief over types of other agents Play own actions based on beliefs Can also add parameters to types

Other problems in MAL Analysis of emergent behaviors Typically no new learning algorithms, but single-agent learning algorithms evaluated in multi-agent environment Emergent language Learn agents to use some language E.g. signaling game two agents are show two images, one of them (sender) is told the target and can send a message (from fixed vocabulary) to the receiver; both agents receive a positive reward if the receiver identifies the correct image Learning communication Agent can typically exchange vectors of numbers for communication Maximization of shared utility by means of communication in partially observable environment Learning cooperation Agent modelling agents

References and Further Reading Foerster, Jakob, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual Multi-Agent Policy Gradients. ArXiv:1705.08926 [Cs], May 24, 2017. http://arxiv.org/abs/1705.08926. Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, et al. Human-Level Control through Deep Reinforcement Learning. Nature 518, no. 7540 (February 2015): 529 33. https://doi.org/10.1038/nature14236. Albrecht, Stefano, and Peter Stone. Multiagent Learning - Foundations and Recent Trends. http://www.cs.utexas.edu/~larg/ijcai17_tutorial/ Nice presentation about general multi-agent learning (slides available) Open AI Gym. https://gym.openai.com/ Environments for reinforcement learning Lillicrap, Timothy P., Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous Control with Deep Reinforcement Learning. ArXiv:1509.02971 [Cs, Stat], September 9, 2015. http://arxiv.org/abs/1509.02971. Actor-Critic method for reinforcement learning with continuous actions Hernandez-Leal, Pablo, Bilal Kartal, and Matthew E. Taylor. Is Multiagent Deep Reinforcement Learning the Answer or the Question? A Brief Survey. ArXiv:1810.05587 [Cs], October 12, 2018. http://arxiv.org/abs/1810.05587. A survey on multiagent deep reinforcement learning Lazaridou, Angeliki, Alexander Peysakhovich, and Marco Baroni. Multi-Agent Cooperation and the Emergence of (Natural) Language. ArXiv:1612.07182 [Cs], December 21, 2016. http://arxiv.org/abs/1612.07182. Emergence of language in multiagent communication