CS885 Reinforcement Learning Lecture 7a: May 23, 2018

Size: px

Start display at page:

Download "CS885 Reinforcement Learning Lecture 7a: May 23, 2018"

Monica Sherman
5 years ago
Views:

1 CS885 Reinforcement Learning Lecture 7a: May 23, 2018 Policy Gradient Methods [SutBar] Sec , 13.7 [SigBuf] Sec , [RusNor] Sec CS885 Spring 2018 Pascal Poupart 1

2 Outline Stochastic policy gradient REINFORCE algorithm AlphaGo CS885 Spring 2018 Pascal Poupart 2

3 Model-free Policy-based Methods Q-learning Model-free value-based method No explicit policy representation Policy gradient Model-free policy-based method No explicit value function representation CS885 Spring 2018 Pascal Poupart 3

4 Stochastic Policy Consider stochastic policy! " # $ = Pr(# $; +) parametrized by +. Finitely many discrete actions -./ 0 1,3;" Softmax:! " # $ = /(0 1,3 6 ;" ) where h $, #; + might be linear in +: h $, #; + = ($, #) or non-linear in +: h $, #; + = :;<=#>?;@($, #; +) Continuous actions: Gaussian:! " # $ =?(# A $; +, Σ $; + ) CS885 Spring 2018 Pascal Poupart 4

5 Supervised Learning Consider a stochastic policy! " ($ &) Data: state-action pairs { & ), $ ), &,, $,, } Maximize log likelihood of the data / = $123$4 " 5 6 log! " ($ 6 & 6 ) Gradient update / 6:) / 6 + = 6 > " log! " ($ 6 & 6 ) CS885 Spring 2018 Pascal Poupart 5

6 Reinforcement Learning Consider a stochastic policy! " ($ &) Data: state-action-reward triples { & ), $ ), + ), &,, $,, +,, } Maximize discounted sum of rewards / = $+23$4 " " [+ 6 & 6, $ 6 ] Gradient update / 6;) / 6 + > 6 7 6? " log! " ($ 6 & 6 ) where? 6 = G DEF 7 D + 6;D CS885 Spring 2018 Pascal Poupart 6

7 Stochastic Gradient Policy Theorem Stochastic Gradient Policy Theorem : ; < 6 7 : = 4 > 6? 6 7,? < 6 (7): stationary state distribution when executing policy parametrized by 6 7,? : discounted sum of rewards when starting in 7, executing? and following the policy parametrized by D thereafter. CS885 Spring 2018 Pascal Poupart 7

8 Derivation!" # $ =! ' ( # ) $ * # $, ) $. = '! ( # ) $ * # $, ) + ( # ) $!* # $, ) = '! ( # ) $ * # $, ) + ( # ) $! 0 1,2 Pr $5, 6 $, ) 6 + 7" # $ 5 = '! ( # ) $ * # $, ) + ( # ) $ Pr $ 5 $, )!V # (s 5 ) = ' [! ( # ) $ * # $, ) + ( # ) $ Pr $ 5 $, ) ' 1[! ( # ) $ * > $ 5, ) 5 + ( # ) $ Pr $ 55 $ 5, ) 5!V # (s 55 )] A E BCD 7 B Pr($ G; I, J) '! ( # ) $ * # G, ) Probability of reaching G from $ at time step I!" # $ D A E BCD 7 B Pr($ D G; I, J) '! ( # ) $ * # ($, )) 0 L # ($) '! ( # ) $ * # ($, )) CS885 Spring 2018 Pascal Poupart 8

9 REINFORCE: Monte Carlo Policy Gradient!" # & ' # ( ) * # (,,! - #, ( = / # 0 1 ) * # 2 1,,! - #, 2 1 = / # 0 1 ) - #, 2 1 * # 2 1,, 3 4 5, , 2 1 = / # 0 1 * # 2 1, = / # = / # ! log - # Stochastic gradient!" # ! log - #, 1 ( 1 CS885 Spring 2018 Pascal Poupart 9

10 REINFORCE Algorithm (stochastic policy) REINFORCE(! ", $ % ) Initialize $ % to anything Loop forever (for each episode) Generate episode s ", a ", r ", s ), a ), * ),,!,, -,, *, with $ % Loop for each step of the episode. = 0, 1,, 2 3 4,94 78" : 7 * 4;7 Update policy: < < + > : 4 3 4? log $ % - 4! 4 Return $ % CS885 Spring 2018 Pascal Poupart 10

11 Example: Game of Go (simplified) rules: Two players (black and white) Players alternate to place a stone of their color on a vacant intersection. Connected stones without any liberty (i.e., no adjacent vacant intersection) are captured and removed from the board Winner: player that controls the largest number of intersections at the end of the game CS885 Spring 2018 Pascal Poupart 11

12 Computer Go Deep RL Monte Carlo Tree Search Oct 2015: CS885 Spring 2018 Pascal Poupart 12

13 Computer Go March 2016: AlphaGo defeats Lee Sedol (9-dan) [AlphaGo] can t beat me Ke Jie (world champion) May 2017: AlphaGo defeats Ke Jie (world champion) Last year, [AlphaGo] was still quite humanlike when it played. But this year, it became like a god of Go Ke Jie (world champion) CS885 Spring 2018 Pascal Poupart 13

14 Winning Strategy Four steps: 1. Supervised Learning of Policy Networks 2. Policy gradient with Policy Networks 3. Value gradient with Value Networks 4. Searching with Policy and Value Networks CS885 Spring 2018 Pascal Poupart 14

Policy Network Train policy network to imitate Go experts based on a database of 30 million board configurations from the KGS Go Server. Policy network:!

15 Policy Network Train policy network to imitate Go experts based on a database of 30 million board configurations from the KGS Go Server. Policy network:!(# %) Input: state % (board configuration) Output: distribution over actions # (intersection on which the next stone will be placed)!(# %) CS885 Spring 2018 Pascal Poupart 15

16 Supervised Learning of the Policy Network Let! be the weights of the policy network Training: Data: suppose " is optimal in # Objective: maximize log ' ( (a s) Gradient:.! = (6 7) 0( Weight update:!! + :! CS885 Spring 2018 Pascal Poupart 16

17 Policy gradient for the Policy Network How can we update a policy network based on reinforcements instead of the optimal action? Let! " = % & % ' "(% be the discounted sum of rewards in a trajectory that starts in ) at time * by executing +. Gradient:,- =. /01 2 3(5 7) & "!.9 " Intuition rescale supervised learning gradient by! " Policy update: < - CS885 Spring 2018 Pascal Poupart 17

18 Policy gradient for the Policy Network In computer Go, program repeatedly plays games against its former self. For each game! " = $ 1 &'( 1 *+,- For each (, ", 0 " ) at turn ( of the game, assume 2 = 1 and compute Gradient: 34 = :(; =) 2 "! 5> " Policy update: A 4 CS885 Spring 2018 Pascal Poupart 18

19 Value Network Predict!(# ) (i.e., who will win game) in each state # & with a value network Input: state # (board configuration) Output: expected discounted sum of rewards!(# & )!(# & ) CS885 Spring 2018 Pascal Poupart 19

20 Gradient Value Learning with Value Networks Let! be the weights of the value network Training: Data: (#, %) where % = ( 1 *+, 1./#0 Objective: minimize 1 2 3! # % 2 Gradient: 4! = 56! 7 5! (3! # %) Weight update:!! 94! CS885 Spring 2018 Pascal Poupart 20

21 Searching with Policy and Value Networks AlphaGo combines policy and value networks into a Monte Carlo Tree Search (MCTS) algorithm Idea: construct a search tree Node:! Edge: " We will discuss MCTS in a few lectures CS885 Spring 2018 Pascal Poupart 21

22 Competition CS885 Spring 2018 Pascal Poupart 22

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017 Deep Reinforcement Learning STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017 Outline Introduction to Reinforcement Learning AlphaGo (Deep RL for Computer Go)