CS230: Lecture 9 Deep Reinforcement Learning

CS230: Lecture 9 Deep Reinforcement Learning Kian Katanforoosh Menti code: 21 90 15

Today s outline I. Motivation II. Recycling is good: an introduction to RL III. Deep Q-Learning IV. Application of Deep Q-Learning: Breakout (Atari) V. Tips to train Deep Q-Network VI. Advanced topics

I. Motivation AlphaGo Human Level Control through Deep Reinforcement Learning [Silver, Schrittwieser, Simonyan et al. (2017): Mastering the game of Go without human knowledge] [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning]

I. Motivation How would you solve Go with classic supervised learning? Input output class The move of the professional player issues: - Ground truth probably wrongly defined. - Two many states in this Game. - We will likely not generalize. Source: https://deepmind.com/blog/ alphago-zero-learning-scratch/ Why RL? Examples of RL applications Delayed labels Robotics Games Advertisement Making sequences of decisions What is RL? Automatically learn to make good sequences of decision

Today s outline I. Motivation II. Recycling is good: an introduction to RL III. Deep Q-Networks IV. Application of Deep Q-Network: Breakout (Atari) V. Tips to train Deep Q-Network VI. Advanced topics

II. Recycling is good: an introduction to RL Problem statement State 1 State 2 (initial) State 3 State 4 State 5 Goal: maximize the return (rewards) START Number of states: 5 initial normal terminal Types of states: Define reward r in every state Agent s Possible actions: +2 0 0 +1 +10 Additional rule: garbage collector coming in 3min, it takes Best strategy to follow if γ = 1 1min to move between states How to define the long-term return? Discounted return R = γ t r t = r 0 + γ r 1 + γ 2 r 2 +... t=0

II. Recycling is good: an introduction to RL Problem statement What do we want to learn? State 1 State 2 (initial) State 3 State 4 State 5 START Define reward r in every state how good is it to take action 1 in state 2 Q-table Q = #actions Q 11 Q 12 Q 21 Q 22 Q 31 Q 32 Q 41 Q 42 Q 51 Q 52 #states +2 0 0 +1 +10 S1 S2 S3 S4 S5 Assuming γ = 0.9 Discounted return R = How? γ t r = r + γ r + γ 2 r +... t 0 1 2 t=0 S2 +2 +0 S1 S3 +0 +1 S2 S4 +0 +10 S3 S5

II. Recycling is good: an introduction to RL Problem statement What do we want to learn? State 1 State 2 (initial) State 3 State 4 State 5 START Define reward r in every state how good is it to take action 1 in state 2 Q-table Q = #actions Q 11 Q 12 Q 21 Q 22 Q 31 Q 32 Q 41 Q 42 Q 51 Q 52 #states +2 0 0 +1 +10 S1 S2 S3 S4 S5 Discounted return R = Assuming γ = 0.9 How? + 8.1 (= 0 + 0.9 x 9) γ t r = r + γ r + γ 2 r +... t 0 1 2 t=0 S2 + 9 (= 0 + 0.9 x 10) +2 S1 S3 +0 + 10 (= 1 + 10 x 0.9) S2 S4 + 9 +10 (= 0 + 0.9 x 10) S3 S5

II. Recycling is good: an introduction to RL Problem statement What do we want to learn? State 1 State 2 (initial) State 3 State 4 State 5 START Define reward r in every state how good is it to take action 1 in state 2 Q-table Q = #actions Q 11 Q 12 Q 21 Q 22 Q 31 Q 32 Q 41 Q 42 Q 51 Q 52 #states +2 0 0 +1 +10 S1 S2 S3 S4 S5 Discounted return R = Assuming γ = 0.9 How? + 8.1 (= 0 + 0.9 x 9) γ t r = r + γ r + γ 2 r +... t 0 1 2 t=0 S2 + 9 (= 0 + 0.9 x 10) +2 S1 S3 (= 0 + 0.9 x 9) + 8.1 + 10 (= 1 + 10 x 0.9) S2 S4 + 9 +10 (= 0 + 0.9 x 10) S3 S5

II. Recycling is good: an introduction to RL Problem statement What do we want to learn? State 1 State 2 (initial) State 3 State 4 State 5 START Define reward r in every state how good is it to take action 1 in state 2 Q-table Q = #actions 0 2 8.1 9 0 0 9 10 10 0 #states +2 0 0 +1 +10 S1 S2 S3 S4 S5 Discounted return R = Assuming γ = 0.9 How? + 8.1 (= 0 + 0.9 x 9) γ t r = r + γ r + γ 2 r +... t 0 1 2 t=0 S2 + 9 (= 0 + 0.9 x 10) +2 S1 S3 (= 0 + 0.9 x 9) + 8.1 + 10 (= 1 + 10 x 0.9) S2 S4 + 9 +10 (= 0 + 0.9 x 10) S3 S5

II. Recycling is good: an introduction to RL Problem statement What do we want to learn? State 1 State 2 (initial) State 3 State 4 State 5 START Define reward r in every state how good is it to take action 1 in state 2 Q-table Q = #actions 0 2 8.1 9 0 0 9 10 10 0 #states +2 0 0 +1 +10 Bellman equation (optimality equation) Best strategy to follow if γ = 0.9 Q * (s,a) = r + γ max(q * (s',a')) a' When state and actions space are too big, this method has huge memory cost Policy π (s) = argmax a (Q * (s,a)) Function telling us our best strategy

What we ve learned so far: - Vocabulary: environment, agent, state, action, reward, total return, discount factor. - Q-table: matrix of entries representing how good is it to take action a in state s - Policy: function telling us what s the best strategy to adopt - Bellman equation satisfied by the optimal Q-table

III. Deep Q-Learning Main idea: find a Q-function to replace the Q-table Problem statement Neural Network State 1 State 2 (initial) State 3 State 4 State 5 START Q = Q-table #actions 0 2 8.1 9 0 0 9 10 10 0 #states s = 0 1 0 0 0 [1] a 1 a [1] 2 [1] a 3 [1] a 4 Then compute loss, backpropagate. a 1 [2] a 1 [3] [2] a 2 [3] a 1 [2] a 3 Q(s, ) Q(s, ) How to compute the loss?

III. Deep Q-Learning Q * (s,a) = r + γ max(q * (s',a')) a' 0 a 1 [1] s = 1 0 0 0 a 2 [1] [1] a 3 [1] a 4 a 1 [2] a 2 [2] a 3 [2] a 1 [3] a 1 [3] Q(s, ) Q(s, ) Loss function L = ( y Q(s, )) 2 Target value Case: Q(s, ) > Q(s, ) Case: Q(s, ) < Q(s, ) Immediate reward for taking action in state s Discounted maximum future reward when you are in state s next [Francisco S. Melo: Convergence of Q-learning: a simple proof] y = r + γ max(q(s next,a')) a' Hold fixed for backprop y = r + γ max a' Immediate Reward for taking action in state s (Q(s next,a')) Hold fixed for backprop Discounted maximum future reward when you are in state s next

III. Deep Q-Learning 0 a 1 [1] s = 1 0 0 0 a 2 [1] [1] a 3 [1] a 4 a 1 [2] a 2 [2] a 3 [2] a 1 [3] a 1 [3] Q(s, ) Q(s, ) Loss function (regression) L = ( y Q(s, )) 2 Target value Case: Q(s, ) > Q(s, ) Case: Q(s, ) < Q(s, ) y = r + γ max(q(s next,a')) a' y = r + γ max(q(s next,a')) a' Backpropagation L Compute and update W using stochastic gradient descent W

Recap DQN Implementation: - Initialize your Q-network parameters State 1 State 2 (initial) State 3 State 4 State 5 START - Loop over episodes: - Start from initial state s - Loop over time-steps: - Forward propagate s in the Q-network - Execute action a (that has the maximum Q(s,a) output of Q-network) - Observe reward r and next state s - Compute targets y by forward propagating state s in the Q-network, then compute loss. - Update parameters with gradient descent

Today s outline I. Motivation II. Recycling is good: an introduction to RL III. Deep Q-Networks IV. Application of Deep Q-Network: Breakout (Atari) V. Tips to train Deep Q-Network VI. Advanced topics

IV. Deep Q-Learning application: Breakout (Atari) Goal: play breakout, i.e. destroy all the bricks. Demo input of Q-network Output of Q-network Q-values s = Q(s, ) Q(s, ) Q(s, ) Would that work? [Video credits to Two minute papers: Google DeepMind's Deep Q-learning playing Atari Breakout https://www.youtube.com/watch?v=v1eynij0rnk] [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning]

IV. Deep Q-Learning application: Breakout (Atari) Goal: play breakout, i.e. destroy all the bricks. Demo input of Q-network Output of Q-network Q-values s = Q(s, ) Q(s, ) Q(s, ) Preprocessing - Convert to grayscale What is done in preprocessing? φ(s) [Video credits to Two minute papers: Google DeepMind's Deep Q-learning playing Atari Breakout https://www.youtube.com/watch?v=v1eynij0rnk] [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning] - Reduce dimensions (h,w) - History (4 frames)

IV. Deep Q-Learning application: Breakout (Atari) input of Q-network φ(s) = Deep Q-network architecture? Q(s, ) φ(s) CONV ReLU CONV ReLU CONV ReLU FC (RELU) FC (LINEAR) Q(s, ) Q(s, ) [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning]

Recap (+ preprocessing + terminal state) Some training challenges: DQN Implementation: - Initialize your Q-network parameters - Keep track of terminal step - Experience replay - Epsilon greedy action choice - Loop over episodes: - Start from initial state s - Loop over time-steps: φ(s) φ(s) (Exploration / Exploitation tradeoff) - Forward propagate s in the Q-network φ(s) - Execute action a (that has the maximum Q(s,a) output of Q-network) - Observe reward r and next state s - - Use s to create φ(s') φ(s') - Compute targets y by forward propagating state s in the Q-network, then compute loss. - Update parameters with gradient descent [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning]

Recap (+ preprocessing + terminal state) DQN Implementation: - Initialize your Q-network parameters Some training challenges: - Keep track of terminal step - Experience replay - Loop over episodes: - Start from initial state s φ(s) - Epsilon greedy action choice (Exploration / Exploitation tradeoff) - Create a boolean to detect terminal states: terminal = False - Loop over time-steps: φ(s) - Forward propagate s in the Q-network φ(s) if terminal = False : y = r + γ max(q(s',a')) a' if terminal = True : y = r (break) - Execute action a (that has the maximum Q(s,a) output of Q-network) - Observe reward r and next state s - Use s to create φ(s') φ(s') - Check if s is a terminal state. Compute targets y by forward propagating state s in the Q-network, then compute loss. - Update parameters with gradient descent [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning]

IV - DQN training challenges Experience replay 1 experience (leads to one iteration of gradient descent) Current method is to start from Experience Replay initial state s and follow: E1 E1 φ(s) a r φ(s') E2 E2 E1 E2 φ(s') a' r ' φ(s'') E3 E3 E3 φ(s'') a'' r '' φ(s''')... Replay memory (D) Training: E1 E2 E3 Training: E1 sample(e1, E2) sample(e1, E2, E3) sample(e1, E2, E3, E4) Can be used with mini batch gradient descent [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning] Advantages of experience replay?

Recap (+ experience replay) DQN Implementation: - Initialize your Q-network parameters - Initialize replay memory D - Loop over episodes: Some training challenges: - Keep track of terminal step - Experience replay - Epsilon greedy action choice - Start from initial state φ(s) (Exploration / Exploitation tradeoff) - Create a boolean to detect terminal states: terminal = False - Loop over time-steps: - Forward propagate in the Q-network - Execute action a (that has the maximum Q( φ(s),a) output of Q-network) - Observe reward r and next state s - Use s to create φ(s') φ(s) The transition resulting from this is added to D, and will not necessarily be used in this iteration s update! - Add experience (φ(s),a,r,φ(s')) to replay memory (D) Update using sampled transitions - Sample random mini-batch of transitions from D - Check if s is a terminal state. Compute targets y by forward propagating state φ(s') in the Q-network, then compute loss. - Update parameters with gradient descent [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning]

Exploration vs. Exploitation Terminal state S2 R = +0 Just after initializing the a 1 Q-network, we get: Initial state Terminal state Q(S1,a 1 ) = 0.5 a 2 S1 S3 R = +1 Q(S1,a 2 ) = 0.4 a 3 Q(S1,a 3 ) = 0.3 Terminal state S4 R = +1000

Exploration vs. Exploitation Terminal state a 1 a2 S2 R = +0 Just after initializing the Q-network, we get: Initial state Terminal state Q(S1,a 1 ) = 0.5 0 S1 S3 R = +1 Q(S1,a 2 ) = 0.4 a 3 Q(S1,a 3 ) = 0.3 Terminal state S4 R = +1000

Exploration vs. Exploitation Terminal state a 1 a2 S2 R = +0 Just after initializing the Q-network, we get: Initial state Terminal state Q(S1,a 1 ) = 0.5 0 S1 S3 R = +1 Q(S1,a 2 ) = 0.4 1 a 3 Q(S1,a 3 ) = 0.3 Terminal state S4 R = +1000

Recap (+ epsilon greedy action) DQN Implementation: - Initialize your Q-network parameters - Initialize replay memory D - Loop over episodes: - Start from initial state φ(s) - Create a boolean to detect terminal states: terminal = False - Loop over time-steps: - With probability epsilon, take random action a. - Otherwise: - Forward propagate φ(s) in the Q-network - Execute action a (that has the maximum Q( φ(s),a) output of Q-network). - Observe reward r and next state s - Use s to create φ(s') - Add experience (φ(s),a,r,φ(s')) to replay memory (D) - Sample random mini-batch of transitions from D - Check if s is a terminal state. Compute targets y by forward propagating state φ(s') in the Q-network, then compute loss. - Update parameters with gradient descent [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning]

Overall recap DQN Implementation: - Initialize your Q-network parameters - Initialize replay memory D - Loop over episodes: - Preprocessing - Detect terminal state - Experience replay - Start from initial state φ(s) - Epsilon greedy action - Create a boolean to detect terminal states: terminal = False - Loop over time-steps: - With probability epsilon, take random action a. - Otherwise: - Forward propagate φ(s) in the Q-network - Execute action a (that has the maximum Q( φ(s),a) output of Q-network). - Observe rewards r and next state s - Use s to create φ(s') - Add experience (φ(s),a,r,φ(s')) to replay memory (D) - Sample random mini-batch of transitions from D - Check if s is a terminal state. Compute targets y by forward propagating state φ(s') in the Q-network, then compute loss. - Update parameters with gradient descent [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning]

Results [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning] [Credits: DeepMind, DQN Breakout - https://www.youtube.com/watch?v=tmpftpjtdgg]

Difference between with and without human knowledge Imitation learning [Source: Bellemare et al. (2016): Unifying Count-Based Exploration and Intrinsic Motivation] [Ho et al. (2016): Generative Adversarial Imitation Learning]

Other Atari games Pong SeaQuest Space Invaders [Chia-Hsuan Lee, Atari Seaquest Double DQN Agent - https://www.youtube.com/ watch?v=nirmkc5uvwu] [moooopan, Deep Q-Network Plays Atari 2600 Pong - https://www.youtube.com/watch? v=p88r2_3ywpa] [DeepMind: DQN SPACE INVADERS - https:// www.youtube.com/watch?v=w2caghuiofy&t=2s] [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning]

Today s outline I. Motivation II. Recycling is good: an introduction to RL III. Deep Q-Networks IV. Application of Deep Q-Network: Breakout (Atari) V. Tips to train Deep Q-Network VI. Advanced topics

VI - Advanced topics Alpha Go [DeepMind Blog] [Silver, Schrittwieser, Simonyan et al. (2017): Mastering the game of Go without human knowledge]

VI - Advanced topics Policy Gradient Methods PPO TRPO [Open AI Blog] [TRPO] [Schulman et al. (2017): Trust Region Policy Optimization] [Schulman et al. (2017): Proximal Policy Optimization]

VI - Advanced topics Competitive self-play [Bansal et al. (2017): Emergent Complexity via multi-agent competition] [OpenAI Blog: Competitive self-play]

VI - Advanced topics Meta learning [Finn et al. (2017): Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks]

VI - Advanced topics Exploration in Reinforcement Learning [Bellemare et al. (2017):Unifying Count-Based Exploration and Intrinsic Motivation]

VI - Advanced topics Imitation learning [Ho et al. (2016): Generative Adversarial Imitation Learning]

Announcements For Wednesday 12/05, 11am: No assignment this week, but work on your projects and meet your mentors! This Friday: TA Sections: reading research papers.

VI - Advanced topics Imitation learning [Source: Bellemare et al. (2016): Unifying Count-Based Exploration and Intrinsic Motivation] [Ho et al. (2016): Generative Adversarial Imitation Learning]

VI - Advanced topics Auxiliary task