Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Deep Reinforcement Learning Jeremy Morton (jmorton2) November 7, 2016 SISL Stanford Intelligent Systems Laboratory

Overview 2 1 Motivation 2 Neural Networks 3 Deep Reinforcement Learning 4 Deep Learning Frameworks 5 Additional Resources Learning to Play Atari Games

Reinforcement Learning Challenges 3 1 Balancing Exploration with Exploitation 2 Credit Assignment 3 Generalizing from experience How can we generalize effectively in large (and possibly continuous) state and action spaces? Driving from Images Continuous Action Space

Global Approximation 4 Recall global approximation: Q(s, a) = θ a β(s) What is β(s)? If β(s) = s, then this is may not be a good approximation of Q(s, a). But how can we learn appropriate features and weights so that this approximation is accurate?

Neural Networks: Overview 5 Here, we have a supervised learning problem and would like to approximate y.

Neural Networks: Overview 5 i 1 = w 1x + b 1

Neural Networks: Overview 5 o 1 = σ(w 1x + b 1 )

Neural Networks: Overview 5 ŷ = w 3 σ(w 2 σ(w 1x + b 1 ) + b 2 ) + b 3

Neural Networks: Overview 5 Neural networks transform inputs via a set of matrix multiplications and element-wise nonlinearities. ŷ = W 3 σ(w 2 σ(w 1 x + b 1 ) + b 2 ) + b 3

Neural Network Properties 6 Neural networks are universal function approximators. Recall the properties of the sigmoid function: we can control its sharpness and shift it. 1 0.8 0.6 0.4 0.2 0 10 5 0 5 10 σ(x) 1 0.8 0.6 0.4 0.2 0 10 5 0 5 10 σ(1000x) 1 0.8 0.6 0.4 0.2 0 10 5 0 5 10 σ(1000x 5000)

Neural Network Properties 6 Neural networks are universal function approximators. Imagine we wanted to approximate f (x) = x: 3 2 1 0 0 0.5 1 1.5 2 2.5 3

Neural Network Properties 6 Neural networks are universal function approximators. Imagine you have a neural network with three neurons that take in x and output the following: 1 0.8 0.6 0.4 0.2 0 0 0.5 1 1.5 2 2.5 3 σ(1000x 1000) 0 0.2 0.4 0.6 0.8 1 0 0.5 1 1.5 2 2.5 3 σ(1000x 2000) 2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 3 2σ(1000x 2000)

Neural Network Properties 6 Neural networks are universal function approximators. Now sum the neuron outputs together: 3 2 1 0 0 0.5 1 1.5 2 2.5 3

Neural Network Properties 6 Neural networks are universal function approximators. As more neurons are added, the approximation to f (x) improves. 3 2 1 0 0 0.5 1 1.5 2 2.5 3 Good explanation here.

Neural Network Properties 7 Neural networks can be trained efficiently through backpropagation. Example for network with single hidden layer: ŷ = W 2 σ(w 1 x + b 1 ) + b 2

Neural Network Properties 7 Neural networks can be trained efficiently through backpropagation. Example for network with single hidden layer: ŷ = W 2 σ(w 1 x + b 1 ) + b 2 If loss function L is a function of ŷ then by chain rule: L = L ( ) ŷ W 2 ŷ W 2 = L ŷ σ(w 1x + b 1 )

Deep Q-Learning 8 V. Mnih, et al. (2013). Playing Atari with Deep Reinforcement Learning. Available at https://arxiv.org/abs/1312.5602

Deep Q-Learning: Objective 9 Recall the Bellman equation: Q (s, a) = IE s E [ ] R(s, a) + γ max Q (s, a ) a We estimate the value of Q (s, a) using a neural network with parameters θ. At iteration i, the loss function is given by the temporal difference error: L i (θ i ) = IE s,a ρ( );s E [ ( ) ] 2 R(s, a) + γ max Q(s, a ; θ i 1 ) Q(s, a; θ i ) a where θ i 1 are the parameter values from the previous iteration.

Deep Q-Learning: Challenges 10 Often in supervised learning it is assumed that successive samples are iid However, in reinforcement learning successive samples are highly correlated To combat this, transitions were stored in replay memory D During training, random transitions {s t, a t, r t, s t+1 } were sampled from D and used to train the network

Deep Q-Learning: Challenges 11 When performing regression in supervised learning, it is common to use the least-squares loss: L = 1 (y ŷ)2 2 where y is the target value and ŷ is the model prediction. In the case of Q-learning, this target value is: y = r t + γ max Q(s t+1, a ; θ). a Because of this, the network target value changes as the network is trained. The deep Q-learning algorithm uses a target network whose parameters are changed less frequently in order to have a stationary target value.

Deep Q-Learning: Challenges 12 Input images are 210 160 pixels with three color channels, so huge inputs and state space Images are downsampled, converted to grayscale, and cropped to be 84 84 pixels. Images are static, so many states are perceptually aliased The input to the network is a sequence of 4 preprocessed frames.

Deep Q-Learning: Network Architecture 13

Deep Q-Learning: Algorithm 14

Deep Q-Learning: Results 15

Policy Gradient Methods 16 In Deep Q-learning, a neural network is used to approximate Q (s, a).

Policy Gradient Methods 16 In Deep Q-learning, a neural network is used to approximate Q (s, a). In policy gradient methods, neural networks are instead used to represent a policy π θ (a s).

Policy Gradient Methods 16 In Deep Q-learning, a neural network is used to approximate Q (s, a). In policy gradient methods, neural networks are instead used to represent a policy π θ (a s). Here, the policy is stochastic, i.e. it provides a distribution over possible actions. Parameters θ optimized to maximize expectation over future rewards IE a πθ [Q π θ (s, a)].

Calculating Gradients 17 How do we find gradients for the policy parameters? θ IE a πθ [Q π θ (s, a)] = θ π θ (a s)q π θ (s, a)da

Calculating Gradients 17 How do we find gradients for the policy parameters? θ IE a πθ [Q π θ (s, a)] = θ π θ (a s)q π θ (s, a)da = θ π θ (a s)q π θ (s, a)da

Calculating Gradients 17 How do we find gradients for the policy parameters? θ IE a πθ [Q π θ (s, a)] = θ π θ (a s)q π θ (s, a)da = θ π θ (a s)q π θ (s, a)da = θ π θ (a s) π θ (a s) π θ(a s)q π θ (s, a)da

Calculating Gradients 17 How do we find gradients for the policy parameters? θ IE a πθ [Q π θ (s, a)] = θ π θ (a s)q π θ (s, a)da = θ π θ (a s)q π θ (s, a)da θ π θ (a s) = π θ (a s) π θ(a s)q π θ (s, a)da = θ log π θ (a s)π θ (a s)q π θ (s, a)da

Policy Gradients: Estimating Q 18 Given a sample trajectory {s 1, a 1, r 1,... s T 1, a T 1, r T }, can create an empirical estimate of Q: T ˆQ π θ (s t, a t ) = γ k 1 r k Then perform the following parameter update at each time step: k=t θ θ + α θ log π θ (a t s t ) ˆQ π θ (s t, a t ). This is known as the REINFORCE algorithm. It is also possible to use a second neural network to estimate Q. This is known as an actor-critic algorithm.

Policy Gradients: Demonstration 19 Full video here.

Further Reading 20 J. Schulman, et al. (2015). Trust Region Policy Optimization. Available at http://arxiv.org/abs/1502.05477. J. Schulman, et al. (2015). High-Dimensional Continuous Control Using Generalized Advantage Estimation. Available at https://arxiv.org/abs/1506.02438. T. Lillicrap, et al. (2015). Continuous control with deep reinforcement learning. Available at https://arxiv.org/abs/1509.02971. D. Silver, et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), pp.484-489. (AlphaGo paper) Y. Duan, et al. (2016). Benchmarking Deep Reinforcement Learning for Continuous Control. Available at https://arxiv.org/abs/1604.06778. (Good Survey)

Deep Learning Frameworks 21 Comparison taken from CS 231n course slides. Full presentation here. Also, check out Keras.

Additional Resources 22 Neural Networks: CS 231n Course Notes (http://cs231n.stanford.edu) Michael Nielsen s free textbook Neural Networks and Deep Learning (http://neuralnetworksanddeeplearning.com) Deep Learning by Bengio et al. (http://www.deeplearningbook.org) Deep Reinforcement Learning: CS 294 Course Notes (UC Berkeley) (http://rll.berkeley.edu/deeprlcourse/) David Silver s RL Course Notes (http://www0.cs.ucl.ac.uk/staff/d.silver/web/teaching.html) OpenAI Gym (https://gym.openai.com)

Thank You 23 Questions?