CS230: Lecture 9 Deep Reinforcement Learning

Size: px

Start display at page:

Download "CS230: Lecture 9 Deep Reinforcement Learning"

Randall Gray
5 years ago
Views:

1 CS230: Lecture 9 Deep Reinforcement Learning Kian Katanforoosh Menti code:

2 Today s outline I. Motivation II. Recycling is good: an introduction to RL III. Deep Q-Learning IV. Application of Deep Q-Learning: Breakout (Atari) V. Tips to train Deep Q-Network VI. Advanced topics

(2017): Mastering the game of Go without human knowledge] [Mnih,

3 I. Motivation AlphaGo Human Level Control through Deep Reinforcement Learning [Silver, Schrittwieser, Simonyan et al. (2017): Mastering the game of Go without human knowledge] [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning]

I. Motivation How would you solve Go with classic supervised learning?

probably wrongly defined. - Two many states in this Game.

com/blog/ alphago-zero-learning-scratch/ Why RL?

4 I. Motivation How would you solve Go with classic supervised learning? Input output class The move of the professional player issues: - Ground truth probably wrongly defined. - Two many states in this Game. - We will likely not generalize. Source: alphago-zero-learning-scratch/ Why RL? Examples of RL applications Delayed labels Robotics Games Advertisement Making sequences of decisions What is RL? Automatically learn to make good sequences of decision

5 Today s outline I. Motivation II. Recycling is good: an introduction to RL III. Deep Q-Networks IV. Application of Deep Q-Network: Breakout (Atari) V. Tips to train Deep Q-Network VI. Advanced topics

6 II. Recycling is good: an introduction to RL Problem statement State 1 State 2 (initial) State 3 State 4 State 5 Goal: maximize the return (rewards) START Number of states: 5 initial normal terminal Types of states: Define reward r in every state Agent s Possible actions: Additional rule: garbage collector coming in 3min, it takes Best strategy to follow if γ = 1 1min to move between states How to define the long-term return? Discounted return R = γ t r t = r 0 + γ r 1 + γ 2 r t=0

7 II. Recycling is good: an introduction to RL Problem statement What do we want to learn? State 1 State 2 (initial) State 3 State 4 State 5 START Define reward r in every state how good is it to take action 1 in state 2 Q-table Q = #actions Q 11 Q 12 Q 21 Q 22 Q 31 Q 32 Q 41 Q 42 Q 51 Q 52 #states S1 S2 S3 S4 S5 Assuming γ = 0.9 Discounted return R = How? γ t r = r + γ r + γ 2 r +... t t=0 S S1 S S2 S S3 S5

8 II. Recycling is good: an introduction to RL Problem statement What do we want to learn? State 1 State 2 (initial) State 3 State 4 State 5 START Define reward r in every state how good is it to take action 1 in state 2 Q-table Q = #actions Q 11 Q 12 Q 21 Q 22 Q 31 Q 32 Q 41 Q 42 Q 51 Q 52 #states S1 S2 S3 S4 S5 Assuming γ = 0.9 Discounted return R = How? γ t r = r + γ r + γ 2 r +... t t=0 S S1 S (= x 0.9) S2 S S3 S5

9 II. Recycling is good: an introduction to RL Problem statement What do we want to learn? State 1 State 2 (initial) State 3 State 4 State 5 START Define reward r in every state how good is it to take action 1 in state 2 Q-table Q = #actions Q 11 Q 12 Q 21 Q 22 Q 31 Q 32 Q 41 Q 42 Q 51 Q 52 #states S1 S2 S3 S4 S5 Assuming γ = 0.9 Discounted return R = How? γ t r = r + γ r + γ 2 r +... t t=0 S (= x 10) S1 S (= x 0.9) S2 S S3 S5

10 II. Recycling is good: an introduction to RL Problem statement What do we want to learn? State 1 State 2 (initial) State 3 State 4 State 5 START Define reward r in every state how good is it to take action 1 in state 2 Q-table Q = #actions Q 11 Q 12 Q 21 Q 22 Q 31 Q 32 Q 41 Q 42 Q 51 Q 52 #states S1 S2 S3 S4 S5 Assuming γ = 0.9 Discounted return R = How? γ t r = r + γ r + γ 2 r +... t t=0 S (= x 10) S1 S (= x 0.9) S2 S (= x 10) S3 S5

11 II. Recycling is good: an introduction to RL Problem statement What do we want to learn? State 1 State 2 (initial) State 3 State 4 State 5 START Define reward r in every state how good is it to take action 1 in state 2 Q-table Q = #actions Q 11 Q 12 Q 21 Q 22 Q 31 Q 32 Q 41 Q 42 Q 51 Q 52 #states S1 S2 S3 S4 S5 Discounted return R = Assuming γ = 0.9 How? (= x 9) γ t r = r + γ r + γ 2 r +... t t=0 S2 + 9 (= x 10) +2 S1 S (= x 0.9) S2 S (= x 10) S3 S5

12 II. Recycling is good: an introduction to RL Problem statement What do we want to learn? State 1 State 2 (initial) State 3 State 4 State 5 START Define reward r in every state how good is it to take action 1 in state 2 Q-table Q = #actions Q 11 Q 12 Q 21 Q 22 Q 31 Q 32 Q 41 Q 42 Q 51 Q 52 #states S1 S2 S3 S4 S5 Discounted return R = Assuming γ = 0.9 How? (= x 9) γ t r = r + γ r + γ 2 r +... t t=0 S2 + 9 (= x 10) +2 S1 S3 (= x 9) (= x 0.9) S2 S (= x 10) S3 S5

13 II. Recycling is good: an introduction to RL Problem statement What do we want to learn? State 1 State 2 (initial) State 3 State 4 State 5 START Define reward r in every state how good is it to take action 1 in state 2 Q-table Q = #actions #states S1 S2 S3 S4 S5 Discounted return R = Assuming γ = 0.9 How? (= x 9) γ t r = r + γ r + γ 2 r +... t t=0 S2 + 9 (= x 10) +2 S1 S3 (= x 9) (= x 0.9) S2 S (= x 10) S3 S5

14 II. Recycling is good: an introduction to RL Problem statement What do we want to learn? State 1 State 2 (initial) State 3 State 4 State 5 START Define reward r in every state how good is it to take action 1 in state 2 Q-table Q = #actions #states Bellman equation (optimality equation) Best strategy to follow if γ = 0.9 Q * (s,a) = r + γ max(q * (s',a')) a' When state and actions space are too big, this method has huge memory cost Policy π (s) = argmax a (Q * (s,a)) Function telling us our best strategy

15 What we ve learned so far: - Vocabulary: environment, agent, state, action, reward, total return, discount factor. - Q-table: matrix of entries representing how good is it to take action a in state s - Policy: function telling us what s the best strategy to adopt - Bellman equation satisfied by the optimal Q-table

16 Today s outline I. Motivation II. Recycling is good: an introduction to RL III. Deep Q-Learning IV. Application of Deep Q-Learning: Breakout (Atari) V. Tips to train Deep Q-Network VI. Advanced topics

2 8.1 9 0 0 9 10 10 0 #states s = 0 1 0 0 0 [1] a 1 a [1] 2 [1] a 3 [1] a 4 Then compute

17 III. Deep Q-Learning Main idea: find a Q-function to replace the Q-table Problem statement Neural Network State 1 State 2 (initial) State 3 State 4 State 5 START Q = Q-table #actions #states s = [1] a 1 a [1] 2 [1] a 3 [1] a 4 Then compute loss, backpropagate. a 1 [2] a 1 [3] [2] a 2 [3] a 1 [2] a 3 Q(s, ) Q(s, ) How to compute the loss?

18 III. Deep Q-Learning Q * (s,a) = r + γ max(q * (s',a')) a' 0 a 1 [1] s = a 2 [1] [1] a 3 [1] a 4 a 1 [2] a 2 [2] a 3 [2] a 1 [3] a 1 [3] Q(s, ) Q(s, ) Loss function L = ( y Q(s, )) 2 Target value Case: Q(s, ) > Q(s, ) Case: Q(s, ) < Q(s, ) Immediate reward for taking action in state s Discounted maximum future reward when you are in state s next [Francisco S. Melo: Convergence of Q-learning: a simple proof] y = r + γ max(q(s next,a')) a' Hold fixed for backprop y = r + γ max a' Immediate Reward for taking action in state s (Q(s next,a')) Hold fixed for backprop Discounted maximum future reward when you are in state s next

19 III. Deep Q-Learning 0 a 1 [1] s = a 2 [1] [1] a 3 [1] a 4 a 1 [2] a 2 [2] a 3 [2] a 1 [3] a 1 [3] Q(s, ) Q(s, ) Loss function (regression) L = ( y Q(s, )) 2 Target value Case: Q(s, ) > Q(s, ) Case: Q(s, ) < Q(s, ) y = r + γ max(q(s next,a')) a' y = r + γ max(q(s next,a')) a' Backpropagation L Compute and update W using stochastic gradient descent W

20 Recap DQN Implementation: - Initialize your Q-network parameters State 1 State 2 (initial) State 3 State 4 State 5 START - Loop over episodes: - Start from initial state s - Loop over time-steps: - Forward propagate s in the Q-network - Execute action a (that has the maximum Q(s,a) output of Q-network) - Observe reward r and next state s - Compute targets y by forward propagating state s in the Q-network, then compute loss. - Update parameters with gradient descent

21 Today s outline I. Motivation II. Recycling is good: an introduction to RL III. Deep Q-Networks IV. Application of Deep Q-Network: Breakout (Atari) V. Tips to train Deep Q-Network VI. Advanced topics

22 IV. Deep Q-Learning application: Breakout (Atari) Goal: play breakout, i.e. destroy all the bricks. Demo input of Q-network Output of Q-network Q-values s = Q(s, ) Q(s, ) Q(s, ) Would that work? [Video credits to Two minute papers: Google DeepMind's Deep Q-learning playing Atari Breakout [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning]

23 IV. Deep Q-Learning application: Breakout (Atari) Goal: play breakout, i.e. destroy all the bricks. Demo input of Q-network Output of Q-network Q-values s = Q(s, ) Q(s, ) Q(s, ) Preprocessing - Convert to grayscale What is done in preprocessing? φ(s) [Video credits to Two minute papers: Google DeepMind's Deep Q-learning playing Atari Breakout [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning] - Reduce dimensions (h,w) - History (4 frames)

24 IV. Deep Q-Learning application: Breakout (Atari) input of Q-network φ(s) = Deep Q-network architecture? Q(s, ) φ(s) CONV ReLU CONV ReLU CONV ReLU FC (RELU) FC (LINEAR) Q(s, ) Q(s, ) [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning]

25 Recap (+ preprocessing + terminal state) Some training challenges: DQN Implementation: - Initialize your Q-network parameters - Keep track of terminal step - Experience replay - Epsilon greedy action choice - Loop over episodes: - Start from initial state s - Loop over time-steps: φ(s) φ(s) (Exploration / Exploitation tradeoff) - Forward propagate s in the Q-network φ(s) - Execute action a (that has the maximum Q(s,a) output of Q-network) - Observe reward r and next state s - - Use s to create φ(s') φ(s') - Compute targets y by forward propagating state s in the Q-network, then compute loss. - Update parameters with gradient descent [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning]

26 Recap (+ preprocessing + terminal state) Some training challenges: DQN Implementation: - Initialize your Q-network parameters - Keep track of terminal step - Experience replay - Epsilon greedy action choice - Loop over episodes: - Start from initial state s - Loop over time-steps: φ(s) φ(s) (Exploration / Exploitation tradeoff) - Forward propagate s in the Q-network φ(s) - Execute action a (that has the maximum Q(s,a) output of Q-network) - Observe reward r and next state s - - Use s to create φ(s') φ(s') - Compute targets y by forward propagating state s in the Q-network, then compute loss. - Update parameters with gradient descent [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning]

27 Recap (+ preprocessing + terminal state) DQN Implementation: - Initialize your Q-network parameters Some training challenges: - Keep track of terminal step - Experience replay - Loop over episodes: - Start from initial state s φ(s) - Epsilon greedy action choice (Exploration / Exploitation tradeoff) - Create a boolean to detect terminal states: terminal = False - Loop over time-steps: φ(s) - Forward propagate s in the Q-network φ(s) if terminal = False : y = r + γ max(q(s',a')) a' if terminal = True : y = r (break) - Execute action a (that has the maximum Q(s,a) output of Q-network) - Observe reward r and next state s - Use s to create φ(s') φ(s') - Check if s is a terminal state. Compute targets y by forward propagating state s in the Q-network, then compute loss. - Update parameters with gradient descent [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning]

IV - DQN training challenges Experience replay 1 experience (leads to one iteration of gradient descent) Current method is to start from Experience Replay initial state s and follow: E1 E1 φ(s) a r

28 IV - DQN training challenges Experience replay 1 experience (leads to one iteration of gradient descent) Current method is to start from Experience Replay initial state s and follow: E1 E1 φ(s) a r φ(s') E2 E2 E1 E2 φ(s') a' r ' φ(s'') E3 E3 E3 φ(s'') a'' r '' φ(s''')... Replay memory (D) Training: E1 E2 E3 Training: E1 sample(e1, E2) sample(e1, E2, E3) sample(e1, E2, E3, E4) Can be used with mini batch gradient descent [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning] Advantages of experience replay?

29 Recap (+ experience replay) DQN Implementation: - Initialize your Q-network parameters - Initialize replay memory D - Loop over episodes: Some training challenges: - Keep track of terminal step - Experience replay - Epsilon greedy action choice - Start from initial state φ(s) (Exploration / Exploitation tradeoff) - Create a boolean to detect terminal states: terminal = False - Loop over time-steps: - Forward propagate in the Q-network - Execute action a (that has the maximum Q( φ(s),a) output of Q-network) - Observe reward r and next state s - Use s to create φ(s') φ(s) The transition resulting from this is added to D, and will not necessarily be used in this iteration s update! - Add experience (φ(s),a,r,φ(s')) to replay memory (D) Update using sampled transitions - Sample random mini-batch of transitions from D - Check if s is a terminal state. Compute targets y by forward propagating state φ(s') in the Q-network, then compute loss. - Update parameters with gradient descent [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning]

30 Exploration vs. Exploitation Terminal state S2 R = +0 Just after initializing the a 1 Q-network, we get: Initial state Terminal state Q(S1,a 1 ) = 0.5 a 2 S1 S3 R = +1 Q(S1,a 2 ) = 0.4 a 3 Q(S1,a 3 ) = 0.3 Terminal state S4 R = +1000

31 Exploration vs. Exploitation Terminal state a 1 a2 S2 R = +0 Just after initializing the Q-network, we get: Initial state Terminal state Q(S1,a 1 ) = S1 S3 R = +1 Q(S1,a 2 ) = 0.4 a 3 Q(S1,a 3 ) = 0.3 Terminal state S4 R = +1000

32 Exploration vs. Exploitation Terminal state a 1 a2 S2 R = +0 Just after initializing the Q-network, we get: Initial state Terminal state Q(S1,a 1 ) = S1 S3 R = +1 Q(S1,a 2 ) = a 3 Q(S1,a 3 ) = 0.3 Terminal state S4 R = +1000

33 Exploration vs. Exploitation Terminal state a 1 a2 S2 R = +0 Just after initializing the Q-network, we get: Initial state Terminal state Q(S1,a 1 ) = S1 S3 R = +1 Q(S1,a 2 ) = a 3 Q(S1,a 3 ) = 0.3 Terminal state S4 R = Will never be visited, because Q(S1,a3) < Q(S1,a2)

34 Recap (+ epsilon greedy action) DQN Implementation: - Initialize your Q-network parameters - Initialize replay memory D - Loop over episodes: - Start from initial state φ(s) - Create a boolean to detect terminal states: terminal = False - Loop over time-steps: - With probability epsilon, take random action a. - Otherwise: - Forward propagate φ(s) in the Q-network - Execute action a (that has the maximum Q( φ(s),a) output of Q-network). - Observe reward r and next state s - Use s to create φ(s') - Add experience (φ(s),a,r,φ(s')) to replay memory (D) - Sample random mini-batch of transitions from D - Check if s is a terminal state. Compute targets y by forward propagating state φ(s') in the Q-network, then compute loss. - Update parameters with gradient descent [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning]

35 Overall recap DQN Implementation: - Initialize your Q-network parameters - Initialize replay memory D - Loop over episodes: - Preprocessing - Detect terminal state - Experience replay - Start from initial state φ(s) - Epsilon greedy action - Create a boolean to detect terminal states: terminal = False - Loop over time-steps: - With probability epsilon, take random action a. - Otherwise: - Forward propagate φ(s) in the Q-network - Execute action a (that has the maximum Q( φ(s),a) output of Q-network). - Observe rewards r and next state s - Use s to create φ(s') - Add experience (φ(s),a,r,φ(s')) to replay memory (D) - Sample random mini-batch of transitions from D - Check if s is a terminal state. Compute targets y by forward propagating state φ(s') in the Q-network, then compute loss. - Update parameters with gradient descent [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning]

36 Results [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning] [Credits: DeepMind, DQN Breakout -

37 Difference between with and without human knowledge Imitation learning [Source: Bellemare et al. (2016): Unifying Count-Based Exploration and Intrinsic Motivation] [Ho et al. (2016): Generative Adversarial Imitation Learning]

Other Atari games Pong SeaQuest Space Invaders

2600 Pong - https://www.youtube.com/watch?

38 Other Atari games Pong SeaQuest Space Invaders [Chia-Hsuan Lee, Atari Seaquest Double DQN Agent - watch?v=nirmkc5uvwu] [moooopan, Deep Q-Network Plays Atari 2600 Pong - v=p88r2_3ywpa] [DeepMind: DQN SPACE INVADERS [Mnih, Kavukcuoglu, Silver et al. (2015): Human Level Control through Deep Reinforcement Learning]

39 Today s outline I. Motivation II. Recycling is good: an introduction to RL III. Deep Q-Networks IV. Application of Deep Q-Network: Breakout (Atari) V. Tips to train Deep Q-Network VI. Advanced topics

40 VI - Advanced topics Alpha Go [DeepMind Blog] [Silver, Schrittwieser, Simonyan et al. (2017): Mastering the game of Go without human knowledge]

41 VI - Advanced topics Policy Gradient Methods PPO TRPO [Open AI Blog] [TRPO] [Schulman et al. (2017): Trust Region Policy Optimization] [Schulman et al. (2017): Proximal Policy Optimization]

42 VI - Advanced topics Competitive self-play [Bansal et al. (2017): Emergent Complexity via multi-agent competition] [OpenAI Blog: Competitive self-play]

43 VI - Advanced topics Meta learning [Finn et al. (2017): Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks]

44 VI - Advanced topics Exploration in Reinforcement Learning [Bellemare et al. (2017):Unifying Count-Based Exploration and Intrinsic Motivation]

45 VI - Advanced topics Imitation learning [Ho et al. (2016): Generative Adversarial Imitation Learning]

46 Announcements For Wednesday 12/05, 11am: No assignment this week, but work on your projects and meet your mentors! This Friday: TA Sections: reading research papers.

47 VI - Advanced topics Imitation learning [Source: Bellemare et al. (2016): Unifying Count-Based Exploration and Intrinsic Motivation] [Ho et al. (2016): Generative Adversarial Imitation Learning]

48 VI - Advanced topics Auxiliary task

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory Deep Reinforcement Learning Jeremy Morton (jmorton2) November 7, 2016 SISL Stanford Intelligent Systems Laboratory Overview 2 1 Motivation 2 Neural Networks 3 Deep Reinforcement Learning 4 Deep Learning