Deep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department

Similar documents
Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

David Silver, Google DeepMind

Reinforcement. Function Approximation. Learning with KATJA HOFMANN. Researcher, MSR Cambridge

Driving a car with low dimensional input features

(Deep) Reinforcement Learning

Supplementary Material for Asynchronous Methods for Deep Reinforcement Learning

Reinforcement Learning for NLP

Approximate Q-Learning. Dan Weld / University of Washington

Deep Q-Learning with Recurrent Neural Networks

Reinforcement Learning

Policy Gradient Methods. February 13, 2017

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Reinforcement Learning

arxiv: v1 [cs.lg] 28 Dec 2016

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

arxiv: v1 [cs.lg] 20 Sep 2018

COMP9444 Neural Networks and Deep Learning 10. Deep Reinforcement Learning. COMP9444 c Alan Blair, 2017

Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department

The Supervised Learning Approach To Estimating Heterogeneous Causal Regime Effects

CS230: Lecture 9 Deep Reinforcement Learning

Stochastic Video Prediction with Deep Conditional Generative Models

Variance Reduction for Policy Gradient Methods. March 13, 2017

Variational Deep Q Network

THE PREDICTRON: END-TO-END LEARNING AND PLANNING

Q-Learning in Continuous State Action Spaces

Lecture 8: Policy Gradient

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

arxiv: v2 [cs.lg] 10 Jul 2017

Multiagent (Deep) Reinforcement Learning

Deep Reinforcement Learning

Natural Value Approximators: Learning when to Trust Past Estimates

Learning Control for Air Hockey Striking using Deep Reinforcement Learning

Comparing Deep Reinforcement Learning and Evolutionary Methods in Continuous Control

Deep Reinforcement Learning via Policy Optimization

The Predictron: End-To-End Learning and Planning

arxiv: v1 [cs.ne] 4 Dec 2015

Deep Reinforcement Learning. Scratching the surface

REINFORCEMENT LEARNING

Weighted Double Q-learning

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

CSC321 Lecture 22: Q-Learning

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning

Approximation Methods in Reinforcement Learning

Decoupling deep learning and reinforcement learning for stable and efficient deep policy gradient algorithms

arxiv: v2 [cs.ai] 23 Nov 2018

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Reinforcement Learning

Reinforcement Learning

Multi-Batch Experience Replay for Fast Convergence of Continuous Action Control

Lecture 9: Policy Gradient II 1

arxiv: v4 [cs.ai] 10 Mar 2017

Reinforcement Learning Part 2

SAMPLE-EFFICIENT DEEP REINFORCEMENT LEARN-

arxiv: v1 [cs.lg] 5 Dec 2015

Count-Based Exploration in Feature Space for Reinforcement Learning

arxiv: v4 [cs.lg] 14 Oct 2018

Human-level control through deep reinforcement. Liia Butler

Forward Actor-Critic for Nonlinear Function Approximation in Reinforcement Learning

An Off-policy Policy Gradient Theorem Using Emphatic Weightings

Reinforcement Learning. Odelia Schwartz 2016

arxiv: v2 [cs.lg] 8 Aug 2018

Autonomous Braking System via Deep Reinforcement Learning

Lecture 9: Policy Gradient II (Post lecture) 2

Grundlagen der Künstlichen Intelligenz

Reinforcement Learning and NLP

Optimization of a Sequential Decision Making Problem for a Rare Disease Diagnostic Application.

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Off-Policy Actor-Critic

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

Linear Feature Encoding for Reinforcement Learning

arxiv: v1 [cs.lg] 30 Dec 2016

Q-learning. Tambet Matiisen

The Predictron: End-To-End Learning and Planning

Reinforcement Learning

Lecture 8: Policy Gradient I 2

Machine Learning I Continuous Reinforcement Learning

Trust Region Evolution Strategies

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

arxiv: v2 [cs.lg] 2 Oct 2018

Introduction to Reinforcement Learning

Combining PPO and Evolutionary Strategies for Better Policy Search

THIS MANUSCRIPT IS A PRE-PRINT VERSION! 1. On-line Building Energy Optimization using Deep Reinforcement Learning

Introduction of Reinforcement Learning

arxiv: v2 [cs.lg] 21 Jul 2017

ECE521 Lectures 9 Fully Connected Neural Networks

Trust Region Policy Optimization

Lecture 7: Value Function Approximation

Source Traces for Temporal Difference Learning

Reinforcement Learning

Lecture 10 - Planning under Uncertainty (III)

arxiv: v1 [cs.ai] 27 Oct 2017

Efficient Exploration through Bayesian Deep Q-Networks

Reinforcement Learning: the basics

Game-Theoretic Cooperative Lane Changing Using Data-Driven Models

Stein Variational Policy Gradient

Reinforcement Learning

Reinforcement learning an introduction

Transcription:

Deep reinforcement learning Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department 1 / 25

In this lecture... Introduction to deep reinforcement learning Value-based Deep RL Deep Q-network Policy-based Deep RL Advantage actor-critic Model-based Deep RL 2 / 25

Deep reinforcement learning Reinforcement learning where the value function, the policy, or the model is approximated via a neural network is deep reinforcement learning. Neural network approximates a function as a non-linear function which is preferred in reinforcement learning. However, the approximation does not give any interpretation and the estimate is a local optimum which is not always desirable. 3 / 25

Deep representations A deep representation is a composition of many functions Its gradient can be backpropagated by the chain rule 4 / 25

Deep neural networks Neural network transforms input vector x into an output y: h 0 = g 0 (W 0 x T + b 0 ) h i = g i (W i h T i 1 + b i ), 0 < i < m y = g m (W m h T m 1 + b m ) where g i (differentiable) activation functions hyperbolic tangent tanh or sigmoid σ, 0 i m W i, b i parameters to be estimated, 0 i m It is trained to minimise the loss function L = y y 2 with stochastic gradient descent in the regression case. In the classification case, it minimises the cross entropy i y i log y i. 5 / 25

Weight sharing Recurrent neural network shares weights between time-steps output t0 output t1 output tn hidden layer t0 hidden layer t1 hidden layer tn Input feature vector t0 Input feature vector t1 Input feature vector tn Convolutional neural network shares weights between local regions input feature vector x hidden layer h1 hidden layer h2 6 / 25

Q-networks Q-networks approximate the Q-function as a neural network There are two architectures: 1. Q-network takes an input s, a and produces Q(s, a) 2. Q-network takes an input s and produces a vector Q(s, a 1 ),, Q(s, a k ) Q(s,a) Q(s,a1) Q(s,a2) Q(s,ak) s a s 7 / 25

Deep Q-network Q(s, a, θ) is a neural network. MSVE = ( ) 2 r + γ max Q(s, a, θ) Q(s, a, θ) a Q-learning algorithm where Q-function estimate is a neural network This algorithm provides a biased estimate This algorithm diverges because States are correlated Targets are non-stationary 8 / 25

DQN - Experience replay In order to deal with the correlated states, the agent builds a dataset of experience and then makes random samples from the dataset. In order to deal with non-stationary targets, the agent fixes the parameters θ and then with some frequency updates them ( ) 2 MSVE = r + γ max Q(s, a, θ ) Q(s, a, θ) a 9 / 25

Atari 10 / 25

DQN for Atari [Mnih et al., 2015] End-to-end learning of values Q(s, a) from pixels s State s is stack of raw pixels from last 4 frames Action a is one of 18 joystick/button positions Reward r is change in score for that step 11 / 25

Results - Atari 12 / 25

Prioritised replay [Schaul et al., 2015] Related to prioritised sweeping in Dyna-Q framework Instead of randomly selecting experience order the experience by some measure of priority The priority is typically proportional to the TD-error δ = r + γ max Q(s, a, θ ) Q(s, a, θ) a 13 / 25

Double DQN [van Hasselt et al., 2015] Remove upward bias caused by max a Q(s, a, θ ) The idea is to produce two Q-networks 1. Current Q-network θ is used to select actions 2. Older Q-network θ is used to evaluate actions ( ) 2 MSVE = r + γq(s, arg max Q(s, a, θ), θ ) Q(s, a, θ) a 14 / 25

Dueling Q-network [Wang et al., 2015] Dueling Q-network combined two streams to produce Q-function: 1. one for state values 2. another for advantage function The network learns state values for which actions have no effect Dueling architecture can more quickly identify correct action in the case of redundancy 15 / 25

Dueling Q-network Traditional DQN and dueling DQN architecture The value stream learns to pay attention to the road. The advantage stream learns to pay attention only when there are cars immediately in front 16 / 25

Asynchronous deep reinforcement learning Exploits multithreading of standard CPU Execute many instances of agent in parallel Network parameters shared between threads Parallelism decorrelates data Viable alternative to experience replay 17 / 25

Policy approximation Policy π is a neural network parametrised with ω R n, π(a, s, ω) Performance measure J(ω) is the value of the initial state V π(ω) (s 0 ) = E π(ω) [r 0 + γr 1 + γ 2 r 2, + ] The update of the parameters is ω t+1 = ω t + α J(ω t ) And the gradient is given by the policy gradient theorem J(ω) = E π [ γ t R t ω log π(a s t, ω) ] This gives REINFORCE algorithm for a neural network policy 18 / 25

Natural actor-critic with neural network approximations Approximate the advantage function as a neural network γ t A(s, a, θ) Approximate the policy as a neural network π(a, s, ω) Critic evaluation Choose θ and J to minimise ( t γt A(s t, a t, θ) + J R) 2 Actor update ω ω + αθ using compatible function approximation, where θ is natural gradient of J(ω) 19 / 25

Advantage actor-critic [Mnih et al., 2016] Approximate the policy as a neural network π(a, s, ω) Define the objective J(ω) = V π(ω) (s 0 ) = E π(ω) [r 0 + γr 1 + γ 2 r 2, + ] Update ω with J(ω) J(ω) = E π [γ t (R t V (s t, θ)) ω log π(a t, s t, ω)] Approximate the value function as a neural network V (s, θ) Define the loss L(θ) = γ t (R t V (s t, θ)) 2 Update θ with L(θ) Compatible function approximation: J(ω) depends on the current estimate of V (s, θ) 20 / 25

Advantage actor-critic Algorithm 1 Advantage actor-critic 1: Input: neural network parametrisation of π(ω) 2: Input: neural network parametrisation of V (θ) 3: repeat 4: Initialise θ, ω, V (terminal, θ) = 0 5: Initialise s 0 6: Obtain an episode s 0, a 0, r 1,, r T, s T according to π(ω) 7: R T = 0 8: for t = T downto 0 do 9: R t 1 = r t + γv (s t, θ) 10: J = J + γ t (R t V (s t, θ)) ω log π(a t, s t, ω) 11: L = L + γ t θ (R t V (s t, θ)) 2 12: end for 13: ω = ω + α J 14: θ = θ + β L 15: until convergence 21 / 25

Model-based Deep RL Dyna-Q framework can be used where transitions probabilities, rewards and the Q-function are all approximated by a neural network. Challenging to plan due to compounding errors Errors in the transition model compound over the trajectory Planning trajectories differ from executed trajectories At end of long, unusual trajectory, rewards are totally wrong 22 / 25

Summary Neural networks can be used to approximate the value function, the policy or the model in reinforcement learning. Any algorithms that assumes a parametric approximation can be applied with neural networks However, vanilla versions might not always converge due to biased estimates and correlated samples With methods such as prioritised replay, double Q-network or duelling networks the stability can be achieved Neural networks can also be applied to actor-critic methods Using them for model-based method does not always work well due to compounding errors 23 / 25

References I Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., and Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. CoRR, abs/1602.01783. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540):529 533. Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2015). Prioritized experience replay. CoRR, abs/1511.05952. 24 / 25

References II van Hasselt, H., Guez, A., and Silver, D. (2015). Deep reinforcement learning with double q-learning. CoRR, abs/1509.06461. Wang, Z., de Freitas, N., and Lanctot, M. (2015). Dueling network architectures for deep reinforcement learning. CoRR, abs/1511.06581. 25 / 25