Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Similar documents
Learning Control for Air Hockey Striking using Deep Reinforcement Learning

REINFORCEMENT LEARNING

Reinforcement Learning for NLP

(Deep) Reinforcement Learning

Q-Learning in Continuous State Action Spaces

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Decision Theory: Q-Learning

CS230: Lecture 9 Deep Reinforcement Learning

, and rewards and transition matrices as shown below:

CSC321 Lecture 22: Q-Learning

Introduction to Reinforcement Learning

Deep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department

Reinforcement Learning Part 2

Reinforcement Learning and Deep Reinforcement Learning

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Reinforcement Learning

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Reinforcement Learning and NLP

Reinforcement. Function Approximation. Learning with KATJA HOFMANN. Researcher, MSR Cambridge

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

David Silver, Google DeepMind

Q-learning. Tambet Matiisen

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Reinforcement Learning: An Introduction

Lecture 7: Value Function Approximation

Reinforcement Learning

Multiagent (Deep) Reinforcement Learning

Trust Region Policy Optimization

Reinforcement learning

Reinforcement Learning: the basics

Grundlagen der Künstlichen Intelligenz

Notes on Reinforcement Learning

Reinforcement Learning. George Konidaris

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Deep Reinforcement Learning

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Deep Reinforcement Learning via Policy Optimization

Reinforcement Learning. Introduction

15-780: ReinforcementLearning

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Reinforcement Learning

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Lecture 10 - Planning under Uncertainty (III)

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Markov Decision Processes and Solving Finite Problems. February 8, 2017

An Introduction to Reinforcement Learning

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Lecture 1: March 7, 2018

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Basics of reinforcement learning

arxiv: v1 [cs.ai] 5 Nov 2017

Reinforcement Learning

Lecture 8: Policy Gradient

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Artificial Intelligence & Sequential Decision Problems

Reinforcement Learning

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Approximate Q-Learning. Dan Weld / University of Washington

Reinforcement learning

Lecture 9: Policy Gradient II 1

Machine Learning I Reinforcement Learning

Reinforcement Learning as Variational Inference: Two Recent Approaches

Reinforcement Learning

Internet Monetization

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

Optimization of a Sequential Decision Making Problem for a Rare Disease Diagnostic Application.

Artificial Intelligence

Human-level control through deep reinforcement. Liia Butler

Approximation Methods in Reinforcement Learning

Lecture 9: Policy Gradient II (Post lecture) 2

State Space Abstractions for Reinforcement Learning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

Reinforcement Learning via Policy Optimization

Contributions to deep reinforcement learning and its applications in smartgrids

Reinforcement Learning

Reinforcement Learning and Control

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Lecture 23: Reinforcement Learning

Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More. March 8, 2017

Reinforcement Learning

Real Time Value Iteration and the State-Action Value Function

Notes on Tabular Methods

16.410/413 Principles of Autonomy and Decision Making

Reinforcement Learning

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Decision Theory: Markov Decision Processes

1 MDP Value Iteration Algorithm

Decoupling deep learning and reinforcement learning for stable and efficient deep policy gradient algorithms

Planning in Markov Decision Processes

Reinforcement Learning

Sequential Decision Problems

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Reinforcement Learning as Classification Leveraging Modern Classifiers

Transcription:

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016) Yoonho Lee Department of Computer Science and Engineering Pohang University of Science and Technology October 11, 2016

Outline Reinforcement Learning Definition of RL Mathematical formulations Algorithms Neural Fitted Q Iteration Deep Q Network Double Deep Q Network Prioritized Experience Replay Dueling Network

Outline Reinforcement Learning Definition of RL Mathematical formulations Algorithms Neural Fitted Q Iteration Deep Q Network Double Deep Q Network Prioritized Experience Replay Dueling Network

Definition of RL general setting of RL

Definition of RL atari setting

Outline Reinforcement Learning Definition of RL Mathematical formulations Algorithms Neural Fitted Q Iteration Deep Q Network Double Deep Q Network Prioritized Experience Replay Dueling Network

Mathematical formulations objective of RL Definition Return G t is the cumulative discounted reward from time t G t = r t+1 + γr t+2 + γ 2 r t+3 +...

Mathematical formulations objective of RL Definition Return G t is the cumulative discounted reward from time t G t = r t+1 + γr t+2 + γ 2 r t+3 +... Definition A policy π is a function that selects actions given states π(s) = a The goal of RL is to find π that maximizes G 0

Mathematical formulations Q-Value G t = γ k r t+i+1 i=0 Definition The action-value (Q-value) function Q π (s, a) is the expectation of G t under taking action a, and then following policy π afterwards Q π (s, a) = E π [G t S t = s, A t = a, A t+i = π(s t+i ) i N] How good is action a in state s?

Mathematical formulations Optimal Q-value Definition The optimal Q-value function Q (s, a) is the maximum Q-value over all policies Q (s, a) = max π Q π(s, a)

Mathematical formulations Optimal Q-value Definition The optimal Q-value function Q (s, a) is the maximum Q-value over all policies Q (s, a) = max π Q π(s, a) Theorem There exists a policy π such that Q π (s, a) = Q (s, a) for all s, a Thus, it suffices to find Q

Mathematical formulations Q-Learning Bellman Optimality Equation Q (s, a) satisfies the following equation: Q (s, a) = R(s, a) + γ s P(s s, a)max a Q (s, a )

Mathematical formulations Q-Learning Bellman Optimality Equation Q (s, a) satisfies the following equation: Q (s, a) = R(s, a) + γ s P(s s, a)max a Q (s, a ) Q-Learning Let a be ɛ-greedy w.r.t. Q, and a be optimal w.r.t. Q. Q converges to Q if we iteratively apply the following update: Q(s, a) α(r(s, a) + γq(s, a )) + (1 α)q(s, a)

Mathematical formulations Other approcahes to RL Value-Based RL Estimate Q (s, a) Deep Q Network Policy-Based RL Search directly for optimal policy π DDPG, TRPO... Model-Based RL Use the (learned or given) transition model of environment Tree Search, DYNA...

Outline Reinforcement Learning Definition of RL Mathematical formulations Algorithms Neural Fitted Q Iteration Deep Q Network Double Deep Q Network Prioritized Experience Replay Dueling Network

Neural Fitted Q Iteration Supervised learning on (s,a,r,s )

Neural Fitted Q Iteration input

Neural Fitted Q Iteration target equation

Neural Fitted Q Iteration Shortcomings Exploration is independent of experience Exploitation does not occur at all Policy evaluation does not occur at all Even in exact(table lookup) case, not guaranteed to converge to Q

Outline Reinforcement Learning Definition of RL Mathematical formulations Algorithms Neural Fitted Q Iteration Deep Q Network Double Deep Q Network Prioritized Experience Replay Dueling Network

Deep Q Network action policy Choose an ɛ-greedy policy w.r.t. Q

Deep Q Network network freezing Get a copy of the network every C steps for stability

Deep Q Network

Deep Q Network

Deep Q Network performance

Deep Q Network overoptimism What happens when we overestimate Q?

Deep Q Network Problems Overestimation of the Q function at any s spills over to actions that lead to s Double DQN Sampling transitions uniformly from D is inefficient Prioritized Experience Replay

Outline Reinforcement Learning Definition of RL Mathematical formulations Algorithms Neural Fitted Q Iteration Deep Q Network Double Deep Q Network Prioritized Experience Replay Dueling Network

Double Deep Q Network target We can write the DQN target as: Y DQN t = R t+1 + γq(s t+1, argmaxq(s t+1, a; θt ); θt ) a Double DQN s target is: Y DDQN t = R t+1 + γq(s t+1, argmaxq(s t+1, a; θ t ); θt ) a This has the effect of decoupling action selection and action evaluation

Double Deep Q Network performance Much stabler with very little change in code

Outline Reinforcement Learning Definition of RL Mathematical formulations Algorithms Neural Fitted Q Iteration Deep Q Network Double Deep Q Network Prioritized Experience Replay Dueling Network

Prioritized Experience Replay Update more surprising experiences more frequently

Outline Reinforcement Learning Definition of RL Mathematical formulations Algorithms Neural Fitted Q Iteration Deep Q Network Double Deep Q Network Prioritized Experience Replay Dueling Network

Dueling Network The scalar approximates V, and the vector approximates A

Dueling Network forward pass equation The exact forward pass equation is: Q(s, a; θ, α, β) = V (s; θ, β) + (A(s, a; θ, α) maxa(s, a ; θ, α)) a The following module was found to be more stable without losing much performance: Q(s, a; θ, α, β) = V (s; θ, β) + (A(s, a; θ, α) 1 A A(s, a ; θ, α)) a

Dueling Network attention Advantage network only pays attention when current action crucial

Dueling Network performance Achieves state of the art in the Atari domain among DQN algorithms

Dueling Network Summary Since this is an improvement only in network architecture, methods that improve DQN(e.g. Double DQN) are all applicable here as well Solves problem of V and A typically being of different scale Updates Q values more frequently than a single-stream DQN, where only a single Q value is updated for each observation Implicitly splits the credit assignment problem into a recursive binary problem of now or later

Dueling Network Shortcomings Only works for A < Still not able to solve tasks involving long-term planning Better than DQN, but sample complexity is still high ɛ-greedy exploration is essentially random guessing