Counterfactual Multi-Agent Policy Gradients
|
|
- Emery Carter
- 6 years ago
- Views:
Transcription
1 Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science University of Oxford joint work with Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, and Nantas Nardelli July 6, 2017 Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31
2 Single-Agent Paradigm Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31
3 Multi-Agent Paradigm Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31
4 Multi-Agent Systems are Everywhere Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31
5 Types of Multi-Agent Systems Cooperative: Shared team reward Coordination problem Competitive: Zero-sum games Individual opposing rewards Minimax equilibria Mixed: General-sum games Nash equilibria What is the question? Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31
6 Coordination Problems are Everywhere Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31
7 Multi-Agent MDP All agents see the global state s Individual actions: u a U State transitions: P(s s, u) : S U S [0, 1] Shared team reward: r(s, u) : S U R Equivalent to an MDP with a factored action space Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31
8 Dec-POMDP Observation function: O(s, a) : S A Z Action-observation history: τ a T (Z U) Decentralised policies: π a (u a τ a ) : T U [0, 1] Natural decentralisation: communication and sensory constraints Artificial decentralisation: coping with joint action space Centralised learning of decentralised policies Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31
9 Key Challenges Curse of dimensionality in actions Multi-agent credit assignment Modelling other agents information state Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31
10 Single-Agent Policy Gradient Methods Optimise π θ with gradient ascent on expected return: J θ = E s ρ π (s),u π θ (s, ) [r(s, u)] Good when: Greedification is hard, e.g., continuous actions Policy is simpler than value function Policy gradient theorem [Sutton et al. 2000]: θ J θ = E s ρ π (s),u π θ (s, ) [ θ log π θ (u s)q π (s, u)] REINFORCE [Williams 1992]: T θ J θ g(τ) = θ log π θ (u t s t )R t t=0 Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31
11 Single-Agent Actor-Critic Methods [Sutton et al. 00] Reduce variance in g (τ ) by learning a critic Q(s, u): g (τ ) = T X θ log πθ (ut st )Q(st, ut ) t=0 Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31
12 Single-Agent Baselines Further reduce variance with a baseline b(s): T g(τ) = θ log π θ (u t s t )(Q(s t, u t ) b(s t )) t=0 b(s) = V (s) = Q(s, u) b(s) = A(s, a), the advantage function: T g(τ) = θ log π θ (u t s t )A(s t, u t ) t=0 TD-error r t + γv (s t+1 ) V (s) is an unbiased estimate of A(s t, u t ): T g(τ) = θ log π θ (u t s t )(r t + γv (s t+1 ) V (s t )) t=0 Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31
13 Single-Agent Deep Actor-Critic Methods Actor and critic are both deep neural networks Convolutional and recurrent layers Actor and critic share layers Both trained with stochastic gradient descent Actor trained on policy gradient Critic trained on TD(λ) or Sarsa(λ): L t (ψ) = (y (λ) C( t, ψ)) 2 y (λ) = (1 λ) G (n) t = n=1 λ n 1 G (n) t n γ k 1 r t+k + γ n C( t+n, ψ) k=1 Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31
14 Independent Actor-Critic Inspired by independent Q-learning [Tan 1993] Each agent learns independently with its own actor and critic Treats other agents as part of the environment Speed learning with parameter sharing Different inputs, including a, induce different behaviour Still independent: critics condition only on τ a and u a Variants: IAC-V: TD-error gradient using V (τ a ) IAC-Q: Advantage-based gradient using A(τ a, u a ) = Q(τ a, u a ) V (τ a ) Limitations: Nonstationary learning Hard to learn to coordinate Multi-agent credit assignment Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31
15 Counterfactual Multi-Agent Policy Gradients Centralised critic: stabilise learning to coordinate Counterfactual baseline: tackle multi-agent credit assignment Efficient critic representation: scale to large NNs Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31
16 Centralised Critic Centralisation Hard greedification actor-critic T g a (τ) = θ log π θ (ut a τt a )(r t + γv (s t+1 ) V (s t )) t=0 A 1 t (h 1, ) critic (h 2, ) A 2 t h 1 actor 1 u 1 t u 2 t actor 2 h 2 o 1 t u 1 t s t r t o 2 u 2 t t environment Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31
17 Wonderful Life Utility [Wolpert & Tumer 2000] Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31
18 Difference Rewards [Tumer & Agogino 2007] Per-agent shaped reward: where c a is a default action D a (s, u) = r(s, u) r(s, (u a, c a )) Important property: D a (s, (u a, u a )) > D a (s, u) = r(s, (u a, u a )) > r(s, (u a, a)) Limitations: Need (extra) simulation to estimate counterfactual r(s, (u a, c a )) Need expertise to choose c a Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31
19 Counterfactual Baseline Use Q(s, u) to estimate difference rewards: T g a (τ) = θ log π θ (ut a τt a )A a (s t, u t ) t=0 A a (s, u) = Q(s, u) u a π a (u a τ a )Q(s, (u a, u a )) Baseline marginalises out u a Critic obviates need for extra simulations Marginalised action obviates need for default Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31
20 Efficient Critic Representation h a t, ) A a t (u a t, a t ) COMA h a t U (h a t ) {Q(u a =1, u -a t,..),..,q(ua = U, u -a t,..)}, u a t-1 ) (u-a t, s t, oa t, a, u t-1 ) ) (c) Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31
21 Starcraft Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31
22 Starcraft Micromanagement [Synnaeve et al. 2016] Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31
23 Centralised Performance Local Field of View (FoV) map heur. IAC-V IAC-Q cnt-v cnt-qv COMA mean best Full FoV, Central Control heur. DQN GMEZO 3m m w d 3z Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31
24 Decentralised Starcraft Micromanagement x Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31
25 Heuristic Performance Local Field of View (FoV) map heur. IAC-V IAC-Q cnt-v cnt-qv COMA mean best Full FoV, Central Control heur. DQN GMEZO 3m m w d 3z Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31
26 Baseline Algorithms IAC-V: independent actor-critic with V (τ a ) IAC-Q: independent actor-critic with A(τ a, u a ) = Q(τ a, u a ) V (τ a ) Central-V: centralised critic V (s) with TD-error-based gradient Central-QV: Centralised critics Q(s, u) and V (s) Advantage gradient A(s, u) = Q(s, u) V (s) COMA but with b(s) = V (s) Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31
27 Results (3m, 5m, 5w, 2d-3z) Average Win % Average Win % COMA central-v central-qv heuristic 20k 40k 60k 80k 100k120k140k # Episodes 5k 10k 15k 20k 25k 30k 35k # Episodes Average Win % Average Win % IAC-V IAC-Q 10k 20k 30k 40k 50k 60k 70k # Episodes 5k 10k 15k 20k 25k 30k 35k 40k # Episodes Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31
28 Compared to Centralised Controllers Local Field of View (FoV) map heur. IAC-V IAC-Q cnt-v cnt-qv COMA mean best Full FoV, Central Control heur. DQN GMEZO 3m m w d 3z Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31
29 Future Work Factored centralised critics for many agents Multi-agent exploration Starcraft macromanagement Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31
30 Paper Counterfactual Multi-Agent Policy Gradients Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31
31 Thank You Microsoft! This work was made possible thanks to a generous donation of Azure cloud credits from Microsoft. Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31
Multiagent (Deep) Reinforcement Learning
Multiagent (Deep) Reinforcement Learning MARTIN PILÁT (MARTIN.PILAT@MFF.CUNI.CZ) Reinforcement learning The agent needs to learn to perform tasks in environment No prior knowledge about the effects of
More informationLecture 8: Policy Gradient
Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve
More information(Deep) Reinforcement Learning
Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015
More informationVariance Reduction for Policy Gradient Methods. March 13, 2017
Variance Reduction for Policy Gradient Methods March 13, 2017 Reward Shaping Reward Shaping Reward Shaping Reward shaping: r(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ Theorem: r admits
More informationREINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning
REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari
More informationReinforcement Learning and NLP
1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value
More informationIntroduction to Reinforcement Learning. CMPT 882 Mar. 18
Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and
More informationDeep Reinforcement Learning
Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 50 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015
More information1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5
Table of contents 1 Introduction 2 2 Markov Decision Processes 2 3 Future Cumulative Reward 3 4 Q-Learning 4 4.1 The Q-value.............................................. 4 4.2 The Temporal Difference.......................................
More informationQ-Learning in Continuous State Action Spaces
Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental
More informationTemporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI
Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning
More informationUsing Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *
Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms
More informationREINFORCEMENT LEARNING
REINFORCEMENT LEARNING Larry Page: Where s Google going next? DeepMind's DQN playing Breakout Contents Introduction to Reinforcement Learning Deep Q-Learning INTRODUCTION TO REINFORCEMENT LEARNING Contents
More informationQUICR-learning for Multi-Agent Coordination
QUICR-learning for Multi-Agent Coordination Adrian K. Agogino UCSC, NASA Ames Research Center Mailstop 269-3 Moffett Field, CA 94035 adrian@email.arc.nasa.gov Kagan Tumer NASA Ames Research Center Mailstop
More informationReinforcement Learning
Reinforcement Learning Policy gradients Daniel Hennes 26.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Policy based reinforcement learning So far we approximated the action value
More informationReinforcement Learning
Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of
More informationDeep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory
Deep Reinforcement Learning Jeremy Morton (jmorton2) November 7, 2016 SISL Stanford Intelligent Systems Laboratory Overview 2 1 Motivation 2 Neural Networks 3 Deep Reinforcement Learning 4 Deep Learning
More informationApproximation Methods in Reinforcement Learning
2018 CS420, Machine Learning, Lecture 12 Approximation Methods in Reinforcement Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html Reinforcement
More informationPolicy Gradient Methods. February 13, 2017
Policy Gradient Methods February 13, 2017 Policy Optimization Problems maximize E π [expression] π Fixed-horizon episodic: T 1 Average-cost: lim T 1 T r t T 1 r t Infinite-horizon discounted: γt r t Variable-length
More information15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted
15-889e Policy Search: Gradient Methods Emma Brunskill All slides from David Silver (with EB adding minor modificafons), unless otherwise noted Outline 1 Introduction 2 Finite Difference Policy Gradient
More informationIntroduction of Reinforcement Learning
Introduction of Reinforcement Learning Deep Reinforcement Learning Reference Textbook: Reinforcement Learning: An Introduction http://incompleteideas.net/sutton/book/the-book.html Lectures of David Silver
More informationCS599 Lecture 2 Function Approximation in RL
CS599 Lecture 2 Function Approximation in RL Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. Overview of function approximation (FA)
More informationReinforcement Learning
Reinforcement Learning Policy Search: Actor-Critic and Gradient Policy search Mario Martin CS-UPC May 28, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 28, 2018 / 63 Goal of this lecture So far
More informationINF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018
Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)
More informationReinforcement Learning
Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart
More informationTrust Region Policy Optimization
Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017 Yixin Lin (Duke) TRPO March 28, 2017 1 / 21 Overview 1 Preliminaries Markov Decision Processes Policy iteration
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Reinforcement learning II Daniel Hennes 11.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns
More informationReinforcement Learning In Continuous Time and Space
Reinforcement Learning In Continuous Time and Space presentation of paper by Kenji Doya Leszek Rybicki lrybicki@mat.umk.pl 18.07.2008 Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous
More informationDeep Reinforcement Learning: Policy Gradients and Q-Learning
Deep Reinforcement Learning: Policy Gradients and Q-Learning John Schulman Bay Area Deep Learning School September 24, 2016 Introduction and Overview Aim of This Talk What is deep RL, and should I use
More informationDialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department
Dialogue management: Parametric approaches to policy optimisation Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department 1 / 30 Dialogue optimisation as a reinforcement learning
More informationReinforcement Learning: An Introduction
Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is
More informationDecoupling deep learning and reinforcement learning for stable and efficient deep policy gradient algorithms
Decoupling deep learning and reinforcement learning for stable and efficient deep policy gradient algorithms Alfredo Vicente Clemente Master of Science in Computer Science Submission date: August 2017
More informationDeep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department
Deep reinforcement learning Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department 1 / 25 In this lecture... Introduction to deep reinforcement learning Value-based Deep RL Deep
More informationLecture 9: Policy Gradient II (Post lecture) 2
Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver
More informationTemporal difference learning
Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).
More informationOlivier Sigaud. September 21, 2012
Supervised and Reinforcement Learning Tools for Motor Learning Models Olivier Sigaud Université Pierre et Marie Curie - Paris 6 September 21, 2012 1 / 64 Introduction Who is speaking? 2 / 64 Introduction
More informationMisc. Topics and Reinforcement Learning
1/79 Misc. Topics and Reinforcement Learning http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html Acknowledgement: some parts of this slides is based on Prof. Mengdi Wang s, Prof. Dimitri Bertsekas and Prof.
More informationLearning-based physical layer communications for multi-agent collaboration
Learning-based physical layer communications for multi-agent collaboration 1 Arsham Mostaani 1, Osvaldo Simeone 2, Symeon Chatzinotas 1 and Bjorn Ottersten 1 1 SIGCOM laboratory, SnT Interdisciplinary
More informationCSC321 Lecture 22: Q-Learning
CSC321 Lecture 22: Q-Learning Roger Grosse Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21 Overview Second of 3 lectures on reinforcement learning Last time: policy gradient (e.g. REINFORCE) Optimize
More informationReinforcement Learning II. George Konidaris
Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2017 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes
More informationNotes on Reinforcement Learning
1 Introduction Notes on Reinforcement Learning Paulo Eduardo Rauber 2014 Reinforcement learning is the study of agents that act in an environment with the goal of maximizing cumulative reward signals.
More informationPOMDPs and Policy Gradients
POMDPs and Policy Gradients MLSS 2006, Canberra Douglas Aberdeen Canberra Node, RSISE Building Australian National University 15th February 2006 Outline 1 Introduction What is Reinforcement Learning? Types
More informationROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6.
ROB 537: Learning-Based Control Week 5, Lecture 1 Policy Gradient, Eligibility Traces, Transfer Learning (MaC Taylor Announcements: Project background due Today HW 3 Due on 10/30 Midterm Exam on 11/6 Reading:
More informationLecture 8: Policy Gradient I 2
Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver and John Schulman
More informationReinforcement Learning II. George Konidaris
Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2018 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes
More informationReinforcement Learning
Reinforcement Learning Function approximation Daniel Hennes 19.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns Forward and backward view Function
More informationLecture 9: Policy Gradient II 1
Lecture 9: Policy Gradient II 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John
More informationMultiagent Value Iteration in Markov Games
Multiagent Value Iteration in Markov Games Amy Greenwald Brown University with Michael Littman and Martin Zinkevich Stony Brook Game Theory Festival July 21, 2005 Agenda Theorem Value iteration converges
More informationDavid Silver, Google DeepMind
Tutorial: Deep Reinforcement Learning David Silver, Google DeepMind Outline Introduction to Deep Learning Introduction to Reinforcement Learning Value-Based Deep RL Policy-Based Deep RL Model-Based Deep
More informationCS Deep Reinforcement Learning HW2: Policy Gradients due September 19th 2018, 11:59 pm
CS294-112 Deep Reinforcement Learning HW2: Policy Gradients due September 19th 2018, 11:59 pm 1 Introduction The goal of this assignment is to experiment with policy gradient and its variants, including
More informationReinforcement Learning via Policy Optimization
Reinforcement Learning via Policy Optimization Hanxiao Liu November 22, 2017 1 / 27 Reinforcement Learning Policy a π(s) 2 / 27 Example - Mario 3 / 27 Example - ChatBot 4 / 27 Applications - Video Games
More informationDeep Reinforcement Learning via Policy Optimization
Deep Reinforcement Learning via Policy Optimization John Schulman July 3, 2017 Introduction Deep Reinforcement Learning: What to Learn? Policies (select next action) Deep Reinforcement Learning: What to
More informationActor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017
Actor-critic methods Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department February 21, 2017 1 / 21 In this lecture... The actor-critic architecture Least-Squares Policy Iteration
More informationReinforcement Learning as Classification Leveraging Modern Classifiers
Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions
More informationReinforcement Learning. George Konidaris
Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom
More informationReinforcement Learning for NLP
Reinforcement Learning for NLP Advanced Machine Learning for NLP Jordan Boyd-Graber REINFORCEMENT OVERVIEW, POLICY GRADIENT Adapted from slides by David Silver, Pieter Abbeel, and John Schulman Advanced
More informationCS599 Lecture 1 Introduction To RL
CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming
More informationHuman-level control through deep reinforcement. Liia Butler
Humanlevel control through deep reinforcement Liia Butler But first... A quote "The question of whether machines can think... is about as relevant as the question of whether submarines can swim" Edsger
More informationQ-learning. Tambet Matiisen
Q-learning Tambet Matiisen (based on chapter 11.3 of online book Artificial Intelligence, foundations of computational agents by David Poole and Alan Mackworth) Stochastic gradient descent Experience
More informationReinforcement Learning. Spring 2018 Defining MDPs, Planning
Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state
More informationA Residual Gradient Fuzzy Reinforcement Learning Algorithm for Differential Games
International Journal of Fuzzy Systems manuscript (will be inserted by the editor) A Residual Gradient Fuzzy Reinforcement Learning Algorithm for Differential Games Mostafa D Awheda Howard M Schwartz Received:
More informationELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki
ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches Ville Kyrki 9.10.2017 Today Direct policy learning via policy gradient. Learning goals Understand basis and limitations
More informationProximal Policy Optimization (PPO)
Proximal Policy Optimization (PPO) default reinforcement learning algorithm at OpenAI Policy Gradient On-policy Off-policy Add constraint DeepMind https://youtu.be/gn4nrcc9twq OpenAI https://blog.openai.com/o
More informationA reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation
A reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation Hajime Fujita and Shin Ishii, Nara Institute of Science and Technology 8916 5 Takayama, Ikoma, 630 0192 JAPAN
More informationReal Time Value Iteration and the State-Action Value Function
MS&E338 Reinforcement Learning Lecture 3-4/9/18 Real Time Value Iteration and the State-Action Value Function Lecturer: Ben Van Roy Scribe: Apoorva Sharma and Tong Mu 1 Review Last time we left off discussing
More informationReinforcement Learning and Deep Reinforcement Learning
Reinforcement Learning and Deep Reinforcement Learning Ashis Kumer Biswas, Ph.D. ashis.biswas@ucdenver.edu Deep Learning November 5, 2018 1 / 64 Outlines 1 Principles of Reinforcement Learning 2 The Q
More informationLecture 7: Value Function Approximation
Lecture 7: Value Function Approximation Joseph Modayil Outline 1 Introduction 2 3 Batch Methods Introduction Large-Scale Reinforcement Learning Reinforcement learning can be used to solve large problems,
More informationReinforcement Learning
Reinforcement Learning Function Approximation Continuous state/action space, mean-square error, gradient temporal difference learning, least-square temporal difference, least squares policy iteration Vien
More informationarxiv: v1 [cs.ai] 5 Nov 2017
arxiv:1711.01569v1 [cs.ai] 5 Nov 2017 Markus Dumke Department of Statistics Ludwig-Maximilians-Universität München markus.dumke@campus.lmu.de Abstract Temporal-difference (TD) learning is an important
More informationChapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1
Chapter 7: Eligibility Traces R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Midterm Mean = 77.33 Median = 82 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
More informationLecture 3: Markov Decision Processes
Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov
More informationContributions to deep reinforcement learning and its applications in smartgrids
1/60 Contributions to deep reinforcement learning and its applications in smartgrids Vincent François-Lavet University of Liege, Belgium September 11, 2017 Motivation 2/60 3/60 Objective From experience
More informationCS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study
CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Assignment #1 Roll-out: nice example paper: X.
More informationReinforcement Learning Part 2
Reinforcement Learning Part 2 Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ From previous tutorial Reinforcement Learning Exploration No supervision Agent-Reward-Environment
More informationDeep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)
CSCE 636 Neural Networks Instructor: Yoonsuck Choe Deep Learning What Is Deep Learning? Learning higher level abstractions/representations from data. Motivation: how the brain represents sensory information
More informationDeep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)
CSCE 636 Neural Networks Instructor: Yoonsuck Choe Deep Learning What Is Deep Learning? Learning higher level abstractions/representations from data. Motivation: how the brain represents sensory information
More informationLearning Dexterity Matthias Plappert SEPTEMBER 6, 2018
Learning Dexterity Matthias Plappert SEPTEMBER 6, 2018 OpenAI OpenAI is a non-profit AI research company, discovering and enacting the path to safe artificial general intelligence. OpenAI OpenAI is a non-profit
More informationDeterministic Policy Gradient, Advanced RL Algorithms
NPFL122, Lecture 9 Deterministic Policy Gradient, Advanced RL Algorithms Milan Straka December 1, 218 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics
More informationDecision Theory: Q-Learning
Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning
More informationMachine Learning I Continuous Reinforcement Learning
Machine Learning I Continuous Reinforcement Learning Thomas Rückstieß Technische Universität München January 7/8, 2010 RL Problem Statement (reminder) state s t+1 ENVIRONMENT reward r t+1 new step r t
More informationCOMP9444 Neural Networks and Deep Learning 10. Deep Reinforcement Learning. COMP9444 c Alan Blair, 2017
COMP9444 Neural Networks and Deep Learning 10. Deep Reinforcement Learning COMP9444 17s2 Deep Reinforcement Learning 1 Outline History of Reinforcement Learning Deep Q-Learning for Atari Games Actor-Critic
More informationLecture 23: Reinforcement Learning
Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:
More informationImplicit Incremental Natural Actor Critic
Implicit Incremental Natural Actor Critic Ryo Iwaki and Minoru Asada Osaka University, -1, Yamadaoka, Suita city, Osaka, Japan {ryo.iwaki,asada}@ams.eng.osaka-u.ac.jp Abstract. The natural policy gradient
More informationGame-Theoretic Cooperative Lane Changing Using Data-Driven Models
Game-Theoretic Cooperative Lane Changing Using Data-Driven Models Guohui Ding, Sina Aghli, Christoffer Heckman, and Lijun Chen Abstract Self-driving cars are being regularly deployed in the wild, and with
More informationBalancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm
Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu
More informationReinforcement Learning. Machine Learning, Fall 2010
Reinforcement Learning Machine Learning, Fall 2010 1 Administrativia This week: finish RL, most likely start graphical models LA2: due on Thursday LA3: comes out on Thursday TA Office hours: Today 1:30-2:30
More informationarxiv: v1 [cs.lg] 20 Jun 2018
RUDDER: Return Decomposition for Delayed Rewards arxiv:1806.07857v1 [cs.lg 20 Jun 2018 Jose A. Arjona-Medina Michael Gillhofer Michael Widrich Thomas Unterthiner Sepp Hochreiter LIT AI Lab Institute for
More informationPayments System Design Using Reinforcement Learning: A Progress Report
Payments System Design Using Reinforcement Learning: A Progress Report A. Desai 1 H. Du 1 R. Garratt 2 F. Rivadeneyra 1 1 Bank of Canada 2 University of California Santa Barbara 16th Payment and Settlement
More informationLearning Control for Air Hockey Striking using Deep Reinforcement Learning
Learning Control for Air Hockey Striking using Deep Reinforcement Learning Ayal Taitler, Nahum Shimkin Faculty of Electrical Engineering Technion - Israel Institute of Technology May 8, 2017 A. Taitler,
More informationReinforcement Learning in a Nutshell
Reinforcement Learning in a Nutshell V. Heidrich-Meisner 1, M. Lauer 2, C. Igel 1 and M. Riedmiller 2 1- Institut für Neuroinformatik, Ruhr-Universität Bochum, Germany 2- Neuroinformatics Group, University
More informationA Survey on Methods and Applications of Deep Reinforcement Learning
A Survey on Methods and Applications of Deep Reinforcement Learning Siyi Li Department of Computer Science and Engineering The Hong Kong University of Science and Technology sliay@cse.ust.hk Supervisor
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationReinforcement learning
Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error
More informationReinforcement Learning
Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation
More informationAnimal learning theory
Animal learning theory Based on [Sutton and Barto, 1990, Dayan and Abbott, 2001] Bert Kappen [Sutton and Barto, 1990] Classical conditioning: - A conditioned stimulus (CS) and unconditioned stimulus (US)
More informationOptimization of a Sequential Decision Making Problem for a Rare Disease Diagnostic Application.
Optimization of a Sequential Decision Making Problem for a Rare Disease Diagnostic Application. Rémi Besson with Erwan Le Pennec and Stéphanie Allassonnière École Polytechnique - 09/07/2018 Prenatal Ultrasound
More informationMarkov Decision Processes and Solving Finite Problems. February 8, 2017
Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:
More informationThe Reinforcement Learning Problem
The Reinforcement Learning Problem Slides based on the book Reinforcement Learning by Sutton and Barto Formalizing Reinforcement Learning Formally, the agent and environment interact at each of a sequence
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationMarks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:
Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,
More information