CS230: Lecture 9 Deep Reinforcement Learning

Similar documents
Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Approximate Q-Learning. Dan Weld / University of Washington

CSC321 Lecture 22: Q-Learning

REINFORCEMENT LEARNING

Reinforcement Learning Part 2

Introduction to Reinforcement Learning

Reinforcement Learning. Yishay Mansour Tel-Aviv University

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

Human-level control through deep reinforcement. Liia Butler

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

David Silver, Google DeepMind

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Lecture 1: March 7, 2018

(Deep) Reinforcement Learning

Q-Learning in Continuous State Action Spaces

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

CS 570: Machine Learning Seminar. Fall 2016

Deep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department

Reinforcement Learning

Reinforcement Learning. Machine Learning, Fall 2010

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Approximation Methods in Reinforcement Learning

15-780: ReinforcementLearning

Reinforcement Learning

Reinforcement. Function Approximation. Learning with KATJA HOFMANN. Researcher, MSR Cambridge

Deep Reinforcement Learning. Scratching the surface

CS599 Lecture 1 Introduction To RL

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

CS885 Reinforcement Learning Lecture 7a: May 23, 2018

Multiagent (Deep) Reinforcement Learning

Reinforcement Learning. Introduction

Reinforcement Learning

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Reinforcement Learning. George Konidaris

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

CS 188: Artificial Intelligence

Learning Control for Air Hockey Striking using Deep Reinforcement Learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Reinforcement Learning II. George Konidaris

Introduction of Reinforcement Learning

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Q-learning. Tambet Matiisen

Reinforcement Learning and NLP

Reinforcement Learning II. George Konidaris

Reinforcement Learning

Lecture 10 - Planning under Uncertainty (III)

Proximal Policy Optimization (PPO)

Reinforcement Learning and Control

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Reinforcement learning an introduction

Deep Reinforcement Learning: Policy Gradients and Q-Learning

INTRODUCTION TO EVOLUTION STRATEGY ALGORITHMS. James Gleeson Eric Langlois William Saunders

Trust Region Policy Optimization

Learning Dexterity Matthias Plappert SEPTEMBER 6, 2018

Deep Reinforcement Learning

Lecture 9: Policy Gradient II 1

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Value Function Methods. CS : Deep Reinforcement Learning Sergey Levine

Lecture 3: The Reinforcement Learning Problem

Policy Gradient Methods. February 13, 2017

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Reinforcement Learning and Deep Reinforcement Learning

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Grundlagen der Künstlichen Intelligenz

Lecture 7: Value Function Approximation

Variance Reduction for Policy Gradient Methods. March 13, 2017

Lecture 23: Reinforcement Learning

CS 4100 // artificial intelligence. Recap/midterm review!

Reinforcement Learning

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

Reinforcement Learning Active Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

CS Machine Learning Qualifying Exam

Chapter 3: The Reinforcement Learning Problem

Lecture 9: Policy Gradient II (Post lecture) 2

Reinforcement Learning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

6 Reinforcement Learning

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Reinforcement Learning for NLP

Chapter 3: The Reinforcement Learning Problem

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Payments System Design Using Reinforcement Learning: A Progress Report

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

COMP9444 Neural Networks and Deep Learning 10. Deep Reinforcement Learning. COMP9444 c Alan Blair, 2017

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

CS 598 Statistical Reinforcement Learning. Nan Jiang

Variational Deep Q Network

Markov Decision Processes Chapter 17. Mausam

Reinforcement Learning

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

Deep Learning Lab Course 2017 (Deep Learning Practical)

Reinforcement Learning

ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6.

Transcription:

CS230: Lecture 9 Deep Reinforcement Learning Kian Katanforoosh Menti code: 21 90 15

Today s outline I. Motivation II. Recycling is good: an introduction to RL III. Deep Q-Learning IV. Application of Deep Q-Learning: Breakout (Atari) V. Tips to train Deep Q-Network VI. Advanced topics

I. Motivation AlphaGo Human Level Control through Deep Reinforcement Learning [Silver, Schrittwieser, Simonyan et al. (2017): Mastering the game of Go without human knowledge] [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning]

I. Motivation How would you solve Go with classic supervised learning? Input output class The move of the professional player issues: - Ground truth probably wrongly defined. - Two many states in this Game. - We will likely not generalize. Source: https://deepmind.com/blog/ alphago-zero-learning-scratch/ Why RL? Examples of RL applications Delayed labels Robotics Games Advertisement Making sequences of decisions What is RL? Automatically learn to make good sequences of decision

Today s outline I. Motivation II. Recycling is good: an introduction to RL III. Deep Q-Networks IV. Application of Deep Q-Network: Breakout (Atari) V. Tips to train Deep Q-Network VI. Advanced topics

II. Recycling is good: an introduction to RL Problem statement State 1 State 2 (initial) State 3 State 4 State 5 Goal: maximize the return (rewards) START Number of states: 5 initial normal terminal Types of states: Define reward r in every state Agent s Possible actions: +2 0 0 +1 +10 Additional rule: garbage collector coming in 3min, it takes Best strategy to follow if γ = 1 1min to move between states How to define the long-term return? Discounted return R = γ t r t = r 0 + γ r 1 + γ 2 r 2 +... t=0

II. Recycling is good: an introduction to RL Problem statement What do we want to learn? State 1 State 2 (initial) State 3 State 4 State 5 START Define reward r in every state how good is it to take action 1 in state 2 Q-table Q = #actions Q 11 Q 12 Q 21 Q 22 Q 31 Q 32 Q 41 Q 42 Q 51 Q 52 #states +2 0 0 +1 +10 S1 S2 S3 S4 S5 Assuming γ = 0.9 Discounted return R = How? γ t r = r + γ r + γ 2 r +... t 0 1 2 t=0 S2 +2 +0 S1 S3 +0 +1 S2 S4 +0 +10 S3 S5

II. Recycling is good: an introduction to RL Problem statement What do we want to learn? State 1 State 2 (initial) State 3 State 4 State 5 START Define reward r in every state how good is it to take action 1 in state 2 Q-table Q = #actions Q 11 Q 12 Q 21 Q 22 Q 31 Q 32 Q 41 Q 42 Q 51 Q 52 #states +2 0 0 +1 +10 S1 S2 S3 S4 S5 Assuming γ = 0.9 Discounted return R = How? γ t r = r + γ r + γ 2 r +... t 0 1 2 t=0 S2 +2 +0 S1 S3 +0 + 10 (= 1 + 10 x 0.9) S2 S4 +0 +10 S3 S5

II. Recycling is good: an introduction to RL Problem statement What do we want to learn? State 1 State 2 (initial) State 3 State 4 State 5 START Define reward r in every state how good is it to take action 1 in state 2 Q-table Q = #actions Q 11 Q 12 Q 21 Q 22 Q 31 Q 32 Q 41 Q 42 Q 51 Q 52 #states +2 0 0 +1 +10 S1 S2 S3 S4 S5 Assuming γ = 0.9 Discounted return R = How? γ t r = r + γ r + γ 2 r +... t 0 1 2 t=0 S2 +2 + 9 (= 0 + 0.9 x 10) S1 S3 +0 + 10 (= 1 + 10 x 0.9) S2 S4 +0 +10 S3 S5

II. Recycling is good: an introduction to RL Problem statement What do we want to learn? State 1 State 2 (initial) State 3 State 4 State 5 START Define reward r in every state how good is it to take action 1 in state 2 Q-table Q = #actions Q 11 Q 12 Q 21 Q 22 Q 31 Q 32 Q 41 Q 42 Q 51 Q 52 #states +2 0 0 +1 +10 S1 S2 S3 S4 S5 Assuming γ = 0.9 Discounted return R = How? γ t r = r + γ r + γ 2 r +... t 0 1 2 t=0 S2 +2 + 9 (= 0 + 0.9 x 10) S1 S3 +0 + 10 (= 1 + 10 x 0.9) S2 S4 + 9 +10 (= 0 + 0.9 x 10) S3 S5

II. Recycling is good: an introduction to RL Problem statement What do we want to learn? State 1 State 2 (initial) State 3 State 4 State 5 START Define reward r in every state how good is it to take action 1 in state 2 Q-table Q = #actions Q 11 Q 12 Q 21 Q 22 Q 31 Q 32 Q 41 Q 42 Q 51 Q 52 #states +2 0 0 +1 +10 S1 S2 S3 S4 S5 Discounted return R = Assuming γ = 0.9 How? + 8.1 (= 0 + 0.9 x 9) γ t r = r + γ r + γ 2 r +... t 0 1 2 t=0 S2 + 9 (= 0 + 0.9 x 10) +2 S1 S3 +0 + 10 (= 1 + 10 x 0.9) S2 S4 + 9 +10 (= 0 + 0.9 x 10) S3 S5

II. Recycling is good: an introduction to RL Problem statement What do we want to learn? State 1 State 2 (initial) State 3 State 4 State 5 START Define reward r in every state how good is it to take action 1 in state 2 Q-table Q = #actions Q 11 Q 12 Q 21 Q 22 Q 31 Q 32 Q 41 Q 42 Q 51 Q 52 #states +2 0 0 +1 +10 S1 S2 S3 S4 S5 Discounted return R = Assuming γ = 0.9 How? + 8.1 (= 0 + 0.9 x 9) γ t r = r + γ r + γ 2 r +... t 0 1 2 t=0 S2 + 9 (= 0 + 0.9 x 10) +2 S1 S3 (= 0 + 0.9 x 9) + 8.1 + 10 (= 1 + 10 x 0.9) S2 S4 + 9 +10 (= 0 + 0.9 x 10) S3 S5

II. Recycling is good: an introduction to RL Problem statement What do we want to learn? State 1 State 2 (initial) State 3 State 4 State 5 START Define reward r in every state how good is it to take action 1 in state 2 Q-table Q = #actions 0 2 8.1 9 0 0 9 10 10 0 #states +2 0 0 +1 +10 S1 S2 S3 S4 S5 Discounted return R = Assuming γ = 0.9 How? + 8.1 (= 0 + 0.9 x 9) γ t r = r + γ r + γ 2 r +... t 0 1 2 t=0 S2 + 9 (= 0 + 0.9 x 10) +2 S1 S3 (= 0 + 0.9 x 9) + 8.1 + 10 (= 1 + 10 x 0.9) S2 S4 + 9 +10 (= 0 + 0.9 x 10) S3 S5

II. Recycling is good: an introduction to RL Problem statement What do we want to learn? State 1 State 2 (initial) State 3 State 4 State 5 START Define reward r in every state how good is it to take action 1 in state 2 Q-table Q = #actions 0 2 8.1 9 0 0 9 10 10 0 #states +2 0 0 +1 +10 Bellman equation (optimality equation) Best strategy to follow if γ = 0.9 Q * (s,a) = r + γ max(q * (s',a')) a' When state and actions space are too big, this method has huge memory cost Policy π (s) = argmax a (Q * (s,a)) Function telling us our best strategy

What we ve learned so far: - Vocabulary: environment, agent, state, action, reward, total return, discount factor. - Q-table: matrix of entries representing how good is it to take action a in state s - Policy: function telling us what s the best strategy to adopt - Bellman equation satisfied by the optimal Q-table

Today s outline I. Motivation II. Recycling is good: an introduction to RL III. Deep Q-Learning IV. Application of Deep Q-Learning: Breakout (Atari) V. Tips to train Deep Q-Network VI. Advanced topics

III. Deep Q-Learning Main idea: find a Q-function to replace the Q-table Problem statement Neural Network State 1 State 2 (initial) State 3 State 4 State 5 START Q = Q-table #actions 0 2 8.1 9 0 0 9 10 10 0 #states s = 0 1 0 0 0 [1] a 1 a [1] 2 [1] a 3 [1] a 4 Then compute loss, backpropagate. a 1 [2] a 1 [3] [2] a 2 [3] a 1 [2] a 3 Q(s, ) Q(s, ) How to compute the loss?

III. Deep Q-Learning Q * (s,a) = r + γ max(q * (s',a')) a' 0 a 1 [1] s = 1 0 0 0 a 2 [1] [1] a 3 [1] a 4 a 1 [2] a 2 [2] a 3 [2] a 1 [3] a 1 [3] Q(s, ) Q(s, ) Loss function L = ( y Q(s, )) 2 Target value Case: Q(s, ) > Q(s, ) Case: Q(s, ) < Q(s, ) Immediate reward for taking action in state s Discounted maximum future reward when you are in state s next [Francisco S. Melo: Convergence of Q-learning: a simple proof] y = r + γ max(q(s next,a')) a' Hold fixed for backprop y = r + γ max a' Immediate Reward for taking action in state s (Q(s next,a')) Hold fixed for backprop Discounted maximum future reward when you are in state s next

III. Deep Q-Learning 0 a 1 [1] s = 1 0 0 0 a 2 [1] [1] a 3 [1] a 4 a 1 [2] a 2 [2] a 3 [2] a 1 [3] a 1 [3] Q(s, ) Q(s, ) Loss function (regression) L = ( y Q(s, )) 2 Target value Case: Q(s, ) > Q(s, ) Case: Q(s, ) < Q(s, ) y = r + γ max(q(s next,a')) a' y = r + γ max(q(s next,a')) a' Backpropagation L Compute and update W using stochastic gradient descent W

Recap DQN Implementation: - Initialize your Q-network parameters State 1 State 2 (initial) State 3 State 4 State 5 START - Loop over episodes: - Start from initial state s - Loop over time-steps: - Forward propagate s in the Q-network - Execute action a (that has the maximum Q(s,a) output of Q-network) - Observe reward r and next state s - Compute targets y by forward propagating state s in the Q-network, then compute loss. - Update parameters with gradient descent

Today s outline I. Motivation II. Recycling is good: an introduction to RL III. Deep Q-Networks IV. Application of Deep Q-Network: Breakout (Atari) V. Tips to train Deep Q-Network VI. Advanced topics

IV. Deep Q-Learning application: Breakout (Atari) Goal: play breakout, i.e. destroy all the bricks. Demo input of Q-network Output of Q-network Q-values s = Q(s, ) Q(s, ) Q(s, ) Would that work? [Video credits to Two minute papers: Google DeepMind's Deep Q-learning playing Atari Breakout https://www.youtube.com/watch?v=v1eynij0rnk] [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning]

IV. Deep Q-Learning application: Breakout (Atari) Goal: play breakout, i.e. destroy all the bricks. Demo input of Q-network Output of Q-network Q-values s = Q(s, ) Q(s, ) Q(s, ) Preprocessing - Convert to grayscale What is done in preprocessing? φ(s) [Video credits to Two minute papers: Google DeepMind's Deep Q-learning playing Atari Breakout https://www.youtube.com/watch?v=v1eynij0rnk] [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning] - Reduce dimensions (h,w) - History (4 frames)

IV. Deep Q-Learning application: Breakout (Atari) input of Q-network φ(s) = Deep Q-network architecture? Q(s, ) φ(s) CONV ReLU CONV ReLU CONV ReLU FC (RELU) FC (LINEAR) Q(s, ) Q(s, ) [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning]

Recap (+ preprocessing + terminal state) Some training challenges: DQN Implementation: - Initialize your Q-network parameters - Keep track of terminal step - Experience replay - Epsilon greedy action choice - Loop over episodes: - Start from initial state s - Loop over time-steps: φ(s) φ(s) (Exploration / Exploitation tradeoff) - Forward propagate s in the Q-network φ(s) - Execute action a (that has the maximum Q(s,a) output of Q-network) - Observe reward r and next state s - - Use s to create φ(s') φ(s') - Compute targets y by forward propagating state s in the Q-network, then compute loss. - Update parameters with gradient descent [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning]

Recap (+ preprocessing + terminal state) Some training challenges: DQN Implementation: - Initialize your Q-network parameters - Keep track of terminal step - Experience replay - Epsilon greedy action choice - Loop over episodes: - Start from initial state s - Loop over time-steps: φ(s) φ(s) (Exploration / Exploitation tradeoff) - Forward propagate s in the Q-network φ(s) - Execute action a (that has the maximum Q(s,a) output of Q-network) - Observe reward r and next state s - - Use s to create φ(s') φ(s') - Compute targets y by forward propagating state s in the Q-network, then compute loss. - Update parameters with gradient descent [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning]

Recap (+ preprocessing + terminal state) DQN Implementation: - Initialize your Q-network parameters Some training challenges: - Keep track of terminal step - Experience replay - Loop over episodes: - Start from initial state s φ(s) - Epsilon greedy action choice (Exploration / Exploitation tradeoff) - Create a boolean to detect terminal states: terminal = False - Loop over time-steps: φ(s) - Forward propagate s in the Q-network φ(s) if terminal = False : y = r + γ max(q(s',a')) a' if terminal = True : y = r (break) - Execute action a (that has the maximum Q(s,a) output of Q-network) - Observe reward r and next state s - Use s to create φ(s') φ(s') - Check if s is a terminal state. Compute targets y by forward propagating state s in the Q-network, then compute loss. - Update parameters with gradient descent [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning]

IV - DQN training challenges Experience replay 1 experience (leads to one iteration of gradient descent) Current method is to start from Experience Replay initial state s and follow: E1 E1 φ(s) a r φ(s') E2 E2 E1 E2 φ(s') a' r ' φ(s'') E3 E3 E3 φ(s'') a'' r '' φ(s''')... Replay memory (D) Training: E1 E2 E3 Training: E1 sample(e1, E2) sample(e1, E2, E3) sample(e1, E2, E3, E4) Can be used with mini batch gradient descent [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning] Advantages of experience replay?

Recap (+ experience replay) DQN Implementation: - Initialize your Q-network parameters - Initialize replay memory D - Loop over episodes: Some training challenges: - Keep track of terminal step - Experience replay - Epsilon greedy action choice - Start from initial state φ(s) (Exploration / Exploitation tradeoff) - Create a boolean to detect terminal states: terminal = False - Loop over time-steps: - Forward propagate in the Q-network - Execute action a (that has the maximum Q( φ(s),a) output of Q-network) - Observe reward r and next state s - Use s to create φ(s') φ(s) The transition resulting from this is added to D, and will not necessarily be used in this iteration s update! - Add experience (φ(s),a,r,φ(s')) to replay memory (D) Update using sampled transitions - Sample random mini-batch of transitions from D - Check if s is a terminal state. Compute targets y by forward propagating state φ(s') in the Q-network, then compute loss. - Update parameters with gradient descent [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning]

Exploration vs. Exploitation Terminal state S2 R = +0 Just after initializing the a 1 Q-network, we get: Initial state Terminal state Q(S1,a 1 ) = 0.5 a 2 S1 S3 R = +1 Q(S1,a 2 ) = 0.4 a 3 Q(S1,a 3 ) = 0.3 Terminal state S4 R = +1000

Exploration vs. Exploitation Terminal state a 1 a2 S2 R = +0 Just after initializing the Q-network, we get: Initial state Terminal state Q(S1,a 1 ) = 0.5 0 S1 S3 R = +1 Q(S1,a 2 ) = 0.4 a 3 Q(S1,a 3 ) = 0.3 Terminal state S4 R = +1000

Exploration vs. Exploitation Terminal state a 1 a2 S2 R = +0 Just after initializing the Q-network, we get: Initial state Terminal state Q(S1,a 1 ) = 0.5 0 S1 S3 R = +1 Q(S1,a 2 ) = 0.4 1 a 3 Q(S1,a 3 ) = 0.3 Terminal state S4 R = +1000

Exploration vs. Exploitation Terminal state a 1 a2 S2 R = +0 Just after initializing the Q-network, we get: Initial state Terminal state Q(S1,a 1 ) = 0.5 0 S1 S3 R = +1 Q(S1,a 2 ) = 0.4 1 a 3 Q(S1,a 3 ) = 0.3 Terminal state S4 R = +1000 Will never be visited, because Q(S1,a3) < Q(S1,a2)

Recap (+ epsilon greedy action) DQN Implementation: - Initialize your Q-network parameters - Initialize replay memory D - Loop over episodes: - Start from initial state φ(s) - Create a boolean to detect terminal states: terminal = False - Loop over time-steps: - With probability epsilon, take random action a. - Otherwise: - Forward propagate φ(s) in the Q-network - Execute action a (that has the maximum Q( φ(s),a) output of Q-network). - Observe reward r and next state s - Use s to create φ(s') - Add experience (φ(s),a,r,φ(s')) to replay memory (D) - Sample random mini-batch of transitions from D - Check if s is a terminal state. Compute targets y by forward propagating state φ(s') in the Q-network, then compute loss. - Update parameters with gradient descent [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning]

Overall recap DQN Implementation: - Initialize your Q-network parameters - Initialize replay memory D - Loop over episodes: - Preprocessing - Detect terminal state - Experience replay - Start from initial state φ(s) - Epsilon greedy action - Create a boolean to detect terminal states: terminal = False - Loop over time-steps: - With probability epsilon, take random action a. - Otherwise: - Forward propagate φ(s) in the Q-network - Execute action a (that has the maximum Q( φ(s),a) output of Q-network). - Observe rewards r and next state s - Use s to create φ(s') - Add experience (φ(s),a,r,φ(s')) to replay memory (D) - Sample random mini-batch of transitions from D - Check if s is a terminal state. Compute targets y by forward propagating state φ(s') in the Q-network, then compute loss. - Update parameters with gradient descent [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning]

Results [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning] [Credits: DeepMind, DQN Breakout - https://www.youtube.com/watch?v=tmpftpjtdgg]

Difference between with and without human knowledge Imitation learning [Source: Bellemare et al. (2016): Unifying Count-Based Exploration and Intrinsic Motivation] [Ho et al. (2016): Generative Adversarial Imitation Learning]

Other Atari games Pong SeaQuest Space Invaders [Chia-Hsuan Lee, Atari Seaquest Double DQN Agent - https://www.youtube.com/ watch?v=nirmkc5uvwu] [moooopan, Deep Q-Network Plays Atari 2600 Pong - https://www.youtube.com/watch? v=p88r2_3ywpa] [DeepMind: DQN SPACE INVADERS - https:// www.youtube.com/watch?v=w2caghuiofy&t=2s] [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning]

Today s outline I. Motivation II. Recycling is good: an introduction to RL III. Deep Q-Networks IV. Application of Deep Q-Network: Breakout (Atari) V. Tips to train Deep Q-Network VI. Advanced topics

VI - Advanced topics Alpha Go [DeepMind Blog] [Silver, Schrittwieser, Simonyan et al. (2017): Mastering the game of Go without human knowledge]

VI - Advanced topics Policy Gradient Methods PPO TRPO [Open AI Blog] [TRPO] [Schulman et al. (2017): Trust Region Policy Optimization] [Schulman et al. (2017): Proximal Policy Optimization]

VI - Advanced topics Competitive self-play [Bansal et al. (2017): Emergent Complexity via multi-agent competition] [OpenAI Blog: Competitive self-play]

VI - Advanced topics Meta learning [Finn et al. (2017): Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks]

VI - Advanced topics Exploration in Reinforcement Learning [Bellemare et al. (2017):Unifying Count-Based Exploration and Intrinsic Motivation]

VI - Advanced topics Imitation learning [Ho et al. (2016): Generative Adversarial Imitation Learning]

Announcements For Wednesday 12/05, 11am: No assignment this week, but work on your projects and meet your mentors! This Friday: TA Sections: reading research papers.

VI - Advanced topics Imitation learning [Source: Bellemare et al. (2016): Unifying Count-Based Exploration and Intrinsic Motivation] [Ho et al. (2016): Generative Adversarial Imitation Learning]

VI - Advanced topics Auxiliary task