Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Similar documents
CS230: Lecture 9 Deep Reinforcement Learning

Approximate Q-Learning. Dan Weld / University of Washington

(Deep) Reinforcement Learning

Deep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Reinforcement Learning Part 2

CSC321 Lecture 22: Q-Learning

Reinforcement Learning for NLP

Reinforcement. Function Approximation. Learning with KATJA HOFMANN. Researcher, MSR Cambridge

Lecture 9: Policy Gradient II (Post lecture) 2

REINFORCEMENT LEARNING

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Human-level control through deep reinforcement. Liia Butler

Lecture 9: Policy Gradient II 1

Reinforcement Learning

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

David Silver, Google DeepMind

Deep Reinforcement Learning. Scratching the surface

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Deep Reinforcement Learning

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Q-Learning in Continuous State Action Spaces

Reinforcement Learning

Reinforcement Learning and NLP

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Multiagent (Deep) Reinforcement Learning

Introduction to Reinforcement Learning

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Introduction of Reinforcement Learning

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Learning Control for Air Hockey Striking using Deep Reinforcement Learning

Reinforcement Learning

Trust Region Policy Optimization

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

Variance Reduction for Policy Gradient Methods. March 13, 2017

Policy Gradient Methods. February 13, 2017

Approximation Methods in Reinforcement Learning

Reinforcement Learning via Policy Optimization

Neural Map. Structured Memory for Deep RL. Emilio Parisotto

Proximal Policy Optimization (PPO)

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

CS885 Reinforcement Learning Lecture 7a: May 23, 2018

Deep Reinforcement Learning via Policy Optimization

Neural Networks: Backpropagation

Q-learning. Tambet Matiisen

Lecture 17: Neural Networks and Deep Learning

Reinforcement Learning and Deep Reinforcement Learning

Reinforcement Learning. Introduction

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Behavior Policy Gradient Supplemental Material

Reinforcement Learning

Lecture 1: March 7, 2018

Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More. March 8, 2017

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Lecture 10 - Planning under Uncertainty (III)

Machine Learning (CSE 446): Neural Networks

Notes on policy gradients and the log derivative trick for reinforcement learning

Reinforcement Learning. Machine Learning, Fall 2010

Payments System Design Using Reinforcement Learning: A Progress Report

CS Deep Reinforcement Learning HW2: Policy Gradients due September 19th 2018, 11:59 pm

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Deep Feedforward Networks

Reinforcement Learning as Variational Inference: Two Recent Approaches

15-780: ReinforcementLearning

Reinforcement Learning

Grundlagen der Künstlichen Intelligenz

Artificial Intelligence

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Foundations of Artificial Intelligence

CS 598 Statistical Reinforcement Learning. Nan Jiang

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

Reinforcement Learning II. George Konidaris

CS599 Lecture 1 Introduction To RL

Lecture 8: Policy Gradient I 2

Optimization of a Sequential Decision Making Problem for a Rare Disease Diagnostic Application.

Lecture 7: Value Function Approximation

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

A summary of Deep Learning without Poor Local Minima

Reinforcement Learning II. George Konidaris

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Lecture 8: Policy Gradient

COMP9444 Neural Networks and Deep Learning 10. Deep Reinforcement Learning. COMP9444 c Alan Blair, 2017

CS 570: Machine Learning Seminar. Fall 2016

Decision Theory: Q-Learning

Combining PPO and Evolutionary Strategies for Better Policy Search

Lecture 23: Reinforcement Learning

Reinforcement learning an introduction

From perceptrons to word embeddings. Simon Šuster University of Groningen

Deterministic Policy Gradient, Advanced RL Algorithms

Multi-Batch Experience Replay for Fast Convergence of Continuous Action Control

Machine Learning (CSE 446): Backpropagation

An Introduction to Reinforcement Learning

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Intro to Neural Networks and Deep Learning

Reinforcement Learning

Transcription:

Deep Reinforcement Learning Jeremy Morton (jmorton2) November 7, 2016 SISL Stanford Intelligent Systems Laboratory

Overview 2 1 Motivation 2 Neural Networks 3 Deep Reinforcement Learning 4 Deep Learning Frameworks 5 Additional Resources Learning to Play Atari Games

Reinforcement Learning Challenges 3 1 Balancing Exploration with Exploitation 2 Credit Assignment 3 Generalizing from experience How can we generalize effectively in large (and possibly continuous) state and action spaces? Driving from Images Continuous Action Space

Global Approximation 4 Recall global approximation: Q(s, a) = θ a β(s) What is β(s)? If β(s) = s, then this is may not be a good approximation of Q(s, a). But how can we learn appropriate features and weights so that this approximation is accurate?

Neural Networks: Overview 5 Here, we have a supervised learning problem and would like to approximate y.

Neural Networks: Overview 5 i 1 = w 1x + b 1

Neural Networks: Overview 5 o 1 = σ(w 1x + b 1 )

Neural Networks: Overview 5 o 1 = σ(w 1x + b 1 )

Neural Networks: Overview 5 ŷ = w 3 σ(w 2 σ(w 1x + b 1 ) + b 2 ) + b 3

Neural Networks: Overview 5 Neural networks transform inputs via a set of matrix multiplications and element-wise nonlinearities. ŷ = W 3 σ(w 2 σ(w 1 x + b 1 ) + b 2 ) + b 3

Neural Network Properties 6 Neural networks are universal function approximators. Recall the properties of the sigmoid function: we can control its sharpness and shift it. 1 0.8 0.6 0.4 0.2 0 10 5 0 5 10 σ(x) 1 0.8 0.6 0.4 0.2 0 10 5 0 5 10 σ(1000x) 1 0.8 0.6 0.4 0.2 0 10 5 0 5 10 σ(1000x 5000)

Neural Network Properties 6 Neural networks are universal function approximators. Imagine we wanted to approximate f (x) = x: 3 2 1 0 0 0.5 1 1.5 2 2.5 3

Neural Network Properties 6 Neural networks are universal function approximators. Imagine you have a neural network with three neurons that take in x and output the following: 1 0.8 0.6 0.4 0.2 0 0 0.5 1 1.5 2 2.5 3 σ(1000x 1000) 0 0.2 0.4 0.6 0.8 1 0 0.5 1 1.5 2 2.5 3 σ(1000x 2000) 2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 3 2σ(1000x 2000)

Neural Network Properties 6 Neural networks are universal function approximators. Now sum the neuron outputs together: 3 2 1 0 0 0.5 1 1.5 2 2.5 3

Neural Network Properties 6 Neural networks are universal function approximators. As more neurons are added, the approximation to f (x) improves. 3 2 1 0 0 0.5 1 1.5 2 2.5 3 Good explanation here.

Neural Network Properties 7 Neural networks can be trained efficiently through backpropagation. Example for network with single hidden layer: ŷ = W 2 σ(w 1 x + b 1 ) + b 2

Neural Network Properties 7 Neural networks can be trained efficiently through backpropagation. Example for network with single hidden layer: ŷ = W 2 σ(w 1 x + b 1 ) + b 2 If loss function L is a function of ŷ then by chain rule: L = L ( ) ŷ W 2 ŷ W 2 = L ŷ σ(w 1x + b 1 )

Neural Network Properties 7 Neural networks can be trained efficiently through backpropagation. Example for network with single hidden layer: ŷ = W 2 σ(w 1 x + b 1 ) + b 2 If loss function L is a function of ŷ then by chain rule: L = L ( ) ŷ W 2 ŷ W 2 = L ŷ σ(w 1x + b 1 ) Finally, can update W 2 using gradient descent: W 2 W 2 α L W 2

Deep Q-Learning 8 V. Mnih, et al. (2013). Playing Atari with Deep Reinforcement Learning. Available at https://arxiv.org/abs/1312.5602

Deep Q-Learning: Objective 9 Recall the Bellman equation: Q (s, a) = IE s E [ ] R(s, a) + γ max Q (s, a ) a We estimate the value of Q (s, a) using a neural network with parameters θ. At iteration i, the loss function is given by the temporal difference error: L i (θ i ) = IE s,a ρ( );s E [ ( ) ] 2 R(s, a) + γ max Q(s, a ; θ i 1 ) Q(s, a; θ i ) a where θ i 1 are the parameter values from the previous iteration.

Deep Q-Learning: Challenges 10 Often in supervised learning it is assumed that successive samples are iid However, in reinforcement learning successive samples are highly correlated To combat this, transitions were stored in replay memory D During training, random transitions {s t, a t, r t, s t+1 } were sampled from D and used to train the network

Deep Q-Learning: Challenges 11 When performing regression in supervised learning, it is common to use the least-squares loss: L = 1 (y ŷ)2 2 where y is the target value and ŷ is the model prediction. In the case of Q-learning, this target value is: y = r t + γ max Q(s t+1, a ; θ). a Because of this, the network target value changes as the network is trained. The deep Q-learning algorithm uses a target network whose parameters are changed less frequently in order to have a stationary target value.

Deep Q-Learning: Challenges 12 Input images are 210 160 pixels with three color channels, so huge inputs and state space Images are downsampled, converted to grayscale, and cropped to be 84 84 pixels. Images are static, so many states are perceptually aliased The input to the network is a sequence of 4 preprocessed frames.

Deep Q-Learning: Network Architecture 13

Deep Q-Learning: Algorithm 14

Deep Q-Learning: Results 15

Policy Gradient Methods 16 In Deep Q-learning, a neural network is used to approximate Q (s, a).

Policy Gradient Methods 16 In Deep Q-learning, a neural network is used to approximate Q (s, a). In policy gradient methods, neural networks are instead used to represent a policy π θ (a s).

Policy Gradient Methods 16 In Deep Q-learning, a neural network is used to approximate Q (s, a). In policy gradient methods, neural networks are instead used to represent a policy π θ (a s). Here, the policy is stochastic, i.e. it provides a distribution over possible actions.

Policy Gradient Methods 16 In Deep Q-learning, a neural network is used to approximate Q (s, a). In policy gradient methods, neural networks are instead used to represent a policy π θ (a s). Here, the policy is stochastic, i.e. it provides a distribution over possible actions. Parameters θ optimized to maximize expectation over future rewards IE a πθ [Q π θ (s, a)].

Policy Gradient Methods 16 In Deep Q-learning, a neural network is used to approximate Q (s, a). In policy gradient methods, neural networks are instead used to represent a policy π θ (a s). Here, the policy is stochastic, i.e. it provides a distribution over possible actions. Parameters θ optimized to maximize expectation over future rewards IE a πθ [Q π θ (s, a)]. Can be applied to both discrete and continuous action spaces.

Calculating Gradients 17 How do we find gradients for the policy parameters? θ IE a πθ [Q π θ (s, a)] = θ π θ (a s)q π θ (s, a)da

Calculating Gradients 17 How do we find gradients for the policy parameters? θ IE a πθ [Q π θ (s, a)] = θ π θ (a s)q π θ (s, a)da = θ π θ (a s)q π θ (s, a)da

Calculating Gradients 17 How do we find gradients for the policy parameters? θ IE a πθ [Q π θ (s, a)] = θ π θ (a s)q π θ (s, a)da = θ π θ (a s)q π θ (s, a)da = θ π θ (a s) π θ (a s) π θ(a s)q π θ (s, a)da

Calculating Gradients 17 How do we find gradients for the policy parameters? θ IE a πθ [Q π θ (s, a)] = θ π θ (a s)q π θ (s, a)da = θ π θ (a s)q π θ (s, a)da θ π θ (a s) = π θ (a s) π θ(a s)q π θ (s, a)da = θ log π θ (a s)π θ (a s)q π θ (s, a)da

Calculating Gradients 17 How do we find gradients for the policy parameters? θ IE a πθ [Q π θ (s, a)] = θ π θ (a s)q π θ (s, a)da = θ π θ (a s)q π θ (s, a)da θ π θ (a s) = π θ (a s) π θ(a s)q π θ (s, a)da = θ log π θ (a s)π θ (a s)q π θ (s, a)da = IE a πθ [ θ log π θ (a s)q π θ (s, a)]

Policy Gradients: Estimating Q 18 Given a sample trajectory {s 1, a 1, r 1,... s T 1, a T 1, r T }, can create an empirical estimate of Q: T ˆQ π θ (s t, a t ) = γ k 1 r k Then perform the following parameter update at each time step: k=t θ θ + α θ log π θ (a t s t ) ˆQ π θ (s t, a t ). This is known as the REINFORCE algorithm. It is also possible to use a second neural network to estimate Q. This is known as an actor-critic algorithm.

Policy Gradients: Demonstration 19 Full video here.

Further Reading 20 J. Schulman, et al. (2015). Trust Region Policy Optimization. Available at http://arxiv.org/abs/1502.05477. J. Schulman, et al. (2015). High-Dimensional Continuous Control Using Generalized Advantage Estimation. Available at https://arxiv.org/abs/1506.02438. T. Lillicrap, et al. (2015). Continuous control with deep reinforcement learning. Available at https://arxiv.org/abs/1509.02971. D. Silver, et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), pp.484-489. (AlphaGo paper) Y. Duan, et al. (2016). Benchmarking Deep Reinforcement Learning for Continuous Control. Available at https://arxiv.org/abs/1604.06778. (Good Survey)

Deep Learning Frameworks 21 Comparison taken from CS 231n course slides. Full presentation here. Also, check out Keras.

Additional Resources 22 Neural Networks: CS 231n Course Notes (http://cs231n.stanford.edu) Michael Nielsen s free textbook Neural Networks and Deep Learning (http://neuralnetworksanddeeplearning.com) Deep Learning by Bengio et al. (http://www.deeplearningbook.org) Deep Reinforcement Learning: CS 294 Course Notes (UC Berkeley) (http://rll.berkeley.edu/deeprlcourse/) David Silver s RL Course Notes (http://www0.cs.ucl.ac.uk/staff/d.silver/web/teaching.html) OpenAI Gym (https://gym.openai.com)

Thank You 23 Questions?