Lecture 8: Policy Gradient I 2

Size: px
Start display at page:

Download "Lecture 8: Policy Gradient I 2"

Transcription

1 Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp With many slides from or derived from David Silver and John Schulman and Pieter Abbeel Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 3 Winter / 62

2 Last Time: We want RL Algorithms that Perform Optimization Delayed consequences Exploration Generalization And do it statistically and computationally efficiently Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 4 Winter / 62

3 Last Time: Generalization and Efficiency Can use structure and additional knowledge to help constrain and speed reinforcement learning Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 5 Winter / 62

4 Class Structure Last time: Imitation Learning This time: Policy Search Next time: Policy Search Cont. Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 6 Winter / 62

5 Table of Contents 1 Introduction 2 Policy Gradient 3 Score Function and Policy Gradient Theorem 4 Policy Gradient Algorithms and Reducing Variance Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 7 Winter / 62

6 Policy-Based Reinforcement Learning In the last lecture we approximated the value or action-value function using parameters θ, V θ (s) V π (s) Q θ (s, a) Q π (s, a) A policy was generated directly from the value function e.g. using ɛ-greedy In this lecture we will directly parametrize the policy π θ (s, a) = P[a s, θ] Goal is to find a policy π with the highest value function V π We will focus again on model-free reinforcement learning Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 8 Winter / 62

7 Value-Based and Policy-Based RL Value Based Learnt Value Function Implicit policy (e.g. ɛ-greedy) Policy Based No Value Function Learnt Policy Actor-Critic Learnt Value Function Learnt Policy Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 9 Winter / 62

8 Advantages of Policy-Based RL Advantages: Better convergence properties Effective in high-dimensional or continuous action spaces Can learn stochastic policies Disadvantages: Typically converge to a local rather than global optimum Evaluating a policy is typically inefficient and high variance Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 10 Winter / 62

9 Example: Rock-Paper-Scissors Two-player game of rock-paper-scissors Scissors beats paper Rock beats scissors Paper beats rock Consider policies for iterated rock-paper-scissors A deterministic policy is easily exploited A uniform random policy is optimal (i.e. Nash equilibrium) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 11 Winter / 62

10 Example: Aliased Gridword (1) The agent cannot differentiate the grey states Consider features of the following form (for all N, E, S, W) φ(s, a) = 1(wall to N, a = move E) Compare value-based RL, using an approximate value function Q θ (s, a) = f (φ(s, a), θ) To policy-based RL, using a parametrised policy π θ (s, a) = g(φ(s, a), θ) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 12 Winter / 62

11 Example: Aliased Gridworld (2) Under aliasing, an optimal deterministic policy will either move W in both grey states (shown by red arrows) move E in both grey states Either way, it can get stuck and never reach the money Value-based RL learns a near-deterministic policy e.g. greedy or ɛ-greedy So it will traverse the corridor for a long time Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 13 Winter / 62

12 Example: Aliased Gridworld (3) An optimal stochastic policy will randomly move E or W in grey states π θ (wall to N and W, move E) = 0.5 π θ (wall to N and W, move W) = 0.5 It will reach the goal state in a few steps with high probability Policy-based RL can learn the optimal stochastic policy Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 14 Winter / 62

13 Policy Objective Functions Goal: given a policy π θ (s, a) with parameters θ, find best θ But how do we measure the quality for a policy π θ? In episodic environments we can use the start value of the policy J 1 (θ) = V π θ (s 1 ) = E πθ [v 1 ] In continuing environments we can use the average value J avv (θ) = s d π θ (s)v π θ (s) where d π θ(s) is the stationary distribution of Markov chain for π θ. Or the average reward per time-step J avr (θ) = s d π θ (s) a π θ (s, a)r a s For simplicity, today will mostly discuss the episodic case, but can easily extend to the continuing / infinite horizon case Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 15 Winter / 62

14 Policy optimization Policy based reinforcement learning is an optimization problem Find policy parameters θ that maximize V π θ Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 16 Winter / 62

15 Policy optimization Policy based reinforcement learning is an optimization problem Find policy parameters θ that maximize V π θ Can use gradient free optimization: Hill climbing Simplex / amoeba / Nelder Mead Genetic algorithms Cross-Entropy method (CEM) Covariance Matrix Adaptation (CMA) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 17 Winter / 62

16 Recall Human-in-the-Loop Exoskeleton Optimization (Zhang et al. Science 2017) u Optimization was done using CMA-ES, variation of covariance matrix evaluation Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 18 Winter / 62

17 Gradient Free Policy Optimization Can often work embarrassingly well: discovered that evolution strategies (ES), an optimization technique that s been known for decades, rivals the performance of standard reinforcement learning (RL) techniques on modern RL benchmarks (e.g. Atari/MuJoCo) ( Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 19 Winter / 62

18 Gradient Free Policy Optimization Often a great simple baseline to try Benefits Can work with any policy parameterizations, including non-differentiable Frequently very easy to parallelize Limitations Typically not very sample efficient because it ignores temporal structure Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 20 Winter / 62

19 Policy optimization Policy based reinforcement learning is an optimization problem Find policy parameters θ that maximize V π θ Can use gradient free optimization: Greater efficiency often possible using gradient Gradient descent Conjugate gradient Quasi-newton We focus on gradient descent, many extensions possible And on methods that exploit sequential structure Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 21 Winter / 62

20 Table of Contents 1 Introduction 2 Policy Gradient 3 Score Function and Policy Gradient Theorem 4 Policy Gradient Algorithms and Reducing Variance Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 22 Winter / 62

21 Policy Gradient Define V (θ) = V π θ to make explicit the dependence of the value on the policy parameters Assume episodic MDPs (easy to extend to related objectives, like average reward) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 23 Winter / 62

22 Policy Gradient Define V (θ) = V π θ to make explicit the dependence of the value on the policy parameters Assume episodic MDPs Policy gradient algorithms search for a local maximum in V (θ) by ascending the gradient of the policy, w.r.t parameters θ θ = α θ V (θ) Where θ V (θ) is the policy gradient and α is a step-size parameter θ V (θ) = δv (θ) δθ 1. δv (θ) δθ n Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 24 Winter / 62

23 Computing Gradients by Finite Differences To evaluate policy gradient of π θ (s, a) For each dimension k [1, n] Estimate kth partial derivative of objective function w.r.t. θ By perturbing θ by small amount ɛ in kth dimension δv (θ) δθ k V (θ + ɛu k V (θ) ɛ where u k is a unit vector with 1 in kth component, 0 elsewhere. Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 25 Winter / 62

24 Computing Gradients by Finite Differences To evaluate policy gradient of π θ (s, a) For each dimension k [1, n] Estimate kth partial derivative of objective function w.r.t. θ By perturbing θ by small amount ɛ in kth dimension δv (θ) δθ k V (θ + ɛu k V (θ) ɛ where u k is a unit vector with 1 in kth component, 0 elsewhere. Uses n evaluations to compute policy gradient in n dimensions Simple, noisy, inefficient - but sometimes effective Works for arbitrary policies, even if policy is not differentiable Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 26 Winter / 62

25 Training AIBO to Walk by Finite Difference Policy Gradient 27 Goal: learn a fast AIBO walk (useful for Robocup) Adapt these parameters by finite difference policy gradient Evaluate performance of policy by field traversal time 27 Kohl and Stone. Policy gradient reinforcement learning for fast quadrupedal locomotion. ICRA ai-lab/pubs/icra04.pdf Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 28 Winter / 62

26 AIBO Policy Parameterization AIBO walk policy is open-loop policy No state, choosing set of action parameters that define an ellipse Specified by 12 continuous parameters (elliptical loci) The front locus (3 parameters: height, x-pos., y-pos.) The rear locus (3 parameters) Locus length Locus skew multiplier in the x-y plane (for turning) The height of the front of the body The height of the rear of the body The time each foot takes to move through its locus The fraction of time each foot spends on the ground New policies: or each parameter, randomly add (ɛ, 0, or ɛ) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 29 Winter / 62

27 AIBO Policy Experiments All of the policy evaluations took place on actual robots... only human intervention required during an experiment involved replacing discharged batteries... about once an hour. Ran on 3 Aibos at once Evaluated 15 policies per iteration. Each policy evaluated 3 times (to reduce noise) and averaged Each iteration took 7.5 minutes Used η = 2 (learning rate for their finite difference approach) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 30 Winter / 62

28 Training AIBO to Walk by Finite Difference Policy Gradient Results Authors discuss that performance is likely impacted by: initial starting policy parameters, ɛ (how much policies are perturbed), η (how much to change policy), as well as policy parameterization Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 31 Winter / 62

29 AIBO Walk Policies walk Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 32 Winter / 62

30 Table of Contents 1 Introduction 2 Policy Gradient 3 Score Function and Policy Gradient Theorem 4 Policy Gradient Algorithms and Reducing Variance Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 33 Winter / 62

31 Computing the gradient analytically We now compute the policy gradient analytically Assume policy π θ is differentiable whenever it is non-zero and we know the gradient θ π θ (s, a) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 34 Winter / 62

32 Likelihood Ratio Policies Denote a state-action trajectory as τ = (s 0, a 0, r 0,..., s T 1, a T 1, r T 1, s T ) Use R(τ) = T t=0 R(s t, a t ) to be the sum of rewards for a trajectory τ Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 35 Winter / 62

33 Likelihood Ratio Policies Denote a state-action trajectory as τ = (s 0, a 0, r 0,..., s T 1, a T 1, r T 1, s T ) Use R(τ) = T t=0 R(s t, a t ) to be the sum of rewards for a trajectory τ Policy value is [ T ] V (θ) = E πθ R(s t, a t ); π θ = P(τ; θ)r(τ), (1) τ t=0 where P(τ; θ) is used to denote the probability over trajectories when executing policy π(θ) In this new notation, our goal is to find the policy parameters θ: arg max V (θ) = arg max P(τ; θ)r(τ). (2) θ θ τ Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 36 Winter / 62

34 Likelihood Ratio Policy Gradient Goal is to find the policy parameters θ: arg max V (θ) = arg max P(τ; θ)r(τ). (3) θ θ Take the gradient with respect to θ: θ V (θ) = θ P(τ; θ)r(τ) τ τ Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 37 Winter / 62

35 Likelihood Ratio Policy Gradient Goal is to find the policy parameters θ: arg max V (θ) = arg max P(τ; θ)r(τ). (4) θ θ Take the gradient with respect to θ: θ V (θ) = θ P(τ; θ)r(τ) = τ τ τ θ P(τ; θ)r(τ) = τ = τ = τ P(τ; θ) P(τ; θ) θp(τ; θ)r(τ) P(τ; θ)r(τ) θp(τ; θ) P(τ; θ) }{{} likelihood ratio P(τ; θ)r(τ) θ log P(τ; θ) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 38 Winter / 62

36 Likelihood Ratio Policy Gradient Goal is to find the policy parameters θ: arg max V (θ) = arg max θ θ P(τ; θ)r(τ). (5) Take the gradient with respect to θ: τ θ V (θ) = τ P(τ; θ)r(τ) θ log P(τ; θ) Approximate with empirical estimate for m sample paths under policy π θ : θ V (θ) ĝ = (1/m) m R(τ (i) ) θ log P(τ (i) ; θ) i=1 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 39 Winter / 62

37 Score Function Gradient Estimator: Intuition Consider generic form of R(τ (i) ) θ log P(τ (i) ; θ): ĝ i = f (x i ) θ log p(x i θ) f (x) measures how good the sample x is. Moving in the direction ĝ i pushes up the logprob of the sample, in proportion to how good it is Valid even if f (x) is discontinuous, and unknown, or sample space (containing x) is a discrete set Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 40 Winter / 62

38 Score Function Gradient Estimator: Intuition ĝ i = f (x i ) θ log p(x i θ) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 41 Winter / 62

39 Score Function Gradient Estimator: Intuition ĝ i = f (x i ) θ log p(x i θ) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 42 Winter / 62

40 Decomposing the Trajectories Into States and Actions Approximate with empirical estimate for m sample paths under policy π θ : m θ V (θ) ĝ = (1/m) R(τ (i) ) θ log P(τ (i) ) θ log P(τ (i) ; θ) = i=1 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 43 Winter / 62

41 Decomposing the Trajectories Into States and Actions Approximate with empirical estimate for m sample paths under policy π θ : θ V (θ) ĝ = (1/m) θ log P(τ (i);θ ) = θ log µ(s 0 ) m R(τ (i) ) θ log P(τ (i) ) i=1 }{{} Initial state distrib. T 1 t=0 π θ (a t s t ) P(s t+1 s t, a t ) }{{}}{{} policy dynamics model T 1 = θ [log µ(s 0 ) + log π θ (a t s t ) + log P(s t+1 s t, a t ) = T 1 t=0 t=0 θ log π θ (a t s t ) }{{} no dynamics model required! ] Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 44 Winter / 62

42 Score Function Define score function as θ log π θ (s, a) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 45 Winter / 62

43 Likelihood Ratio / Score Function Policy Gradient Putting this together Goal is to find the policy parameters θ: arg max V (θ) = arg max θ θ P(τ; θ)r(τ). (6) Approximate with empirical estimate for m sample paths under policy π θ using score function: θ V (θ) ĝ = (1/m) = (1/m) m i=1 Do not need to know dynamics model τ m R(τ (i) ) θ log P(τ (i);θ ) i=1 T 1 R(τ (i) ) θ log π θ (a (i) t s (i) t ) t=0 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 46 Winter / 62

44 Policy Gradient Theorem Theorem The policy gradient theorem generalizes the likelihood ratio approach For any differentiable policy π θ (s, a), for any of the policy objective function J = J 1, (episodic reward), J avr 1 (average reward per time step), or 1 γ J avv (average value), the policy gradient is θ J(θ) = E πθ [ θ log π θ (s, a)q π θ (s, a)] Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 47 Winter / 62

45 Table of Contents 1 Introduction 2 Policy Gradient 3 Score Function and Policy Gradient Theorem 4 Policy Gradient Algorithms and Reducing Variance Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 48 Winter / 62

46 Likelihood Ratio / Score Function Policy Gradient θ V (θ) (1/m) m i=1 Unbiased but very noisy Fixes that can make it practical Temporal structure Baseline T 1 R(τ (i) ) θ log π θ (a (i) t s (i) t ) t=0 Next time will discuss some additional tricks Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 49 Winter / 62

47 Policy Gradient: Use Temporal Structure Previously: [( T 1 θ E τ [R] = E τ t=0 ) ( T 1 r t t=0 θ log π θ (a t s t ) We can repeat the same argument to derive the gradient estimator for a single reward term r t. [ ] t θ E[r t ] = E r t θ log π θ (a t s t ) t=0 Summing this formula over t, we obtain [ T 1 ] t θ E[R] = E r t θ log π θ (a t s t ) = E t=0 [ T 1 t=0 t=0 T 1 θ log π θ (a t, s t ) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 50 Winter / 62 t =t r t ] )]

48 Policy Gradient: Use Temporal Structure Recall for a particular trajectory τ (i), T 1 t =t r (i) t is the return G (i) t θ E[R] (1/m) m T 1 i=1 t=0 θ log π θ (a t, s t )G (i) t Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 51 Winter / 62

49 Monte-Carlo Policy Gradient (REINFORCE) Leverages likelihood ratio / score function and temporal structure θ t = α θ log π θ (s t, a t )G t (7) REINFORCE: Initialize policy parameters θ arbitrarily for each episode {s 1, a 1, r 2,, s T 1, a T 1, r T } π θ do for t = 1 to T 1 do θ θ + α θ log π θ (s t, a t )G t endfor endfor return θ Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 52 Winter / 62

50 Differentiable Policy Classes Many choices of differentiable policy classes including: Softmax Gaussian Neural networks Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 53 Winter / 62

51 Softmax Policy Weight actions using linear combination of features φ(s, a) T θ Probability of action is proportional to exponentiated weight π θ (s, a) = e φ(s,a)t θ /( a e φ(s,a)t θ ) (8) The score function is θ log π θ (s, a) = φ(s, a) Eπ θ [φ(s, )] Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 54 Winter / 62

52 Gaussian Policy In continuous action spaces, a Gaussian policy is natural Mean is a linear combination of state features µ(s) = φ(s) T θ Variance may be fixed σ 2, or can also parametrised Policy is Gaussian a N (µ(s), σ 2 ) The score function is θ log π θ (s, a) = (a µ(s))φ(s) σ 2 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 55 Winter / 62

53 Likelihood Ratio / Score Function Policy Gradient θ V (θ) (1/m) m i=1 Unbiased but very noisy Fixes that can make it practical Temporal structure Baseline T 1 R(τ (i) ) θ log π θ (a (i) t s (i) t ) t=0 Next time will discuss some additional tricks Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 56 Winter / 62

54 Policy Gradient: Introduce Baseline Reduce variance by introducing a baseline b(s) [ T 1 θ E τ [R] = E τ t=0 θ log π(a t s t, θ) ( T 1 t =t For any choice of b(s), gradient estimator is unbiased. Near optimal choice is expected return, b(s t ) E[r t + r t r T 1 ] r t b(s t ) Interpretation: increase logprob of action a t proportionally to how much returns T 1 t =t r t are better than expected )] Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 57 Winter / 62

55 Baseline b(s) Does Not Introduce Bias Derivation E τ [ θ log π(a t s t, θ)b(s t )] = E s0:t,a 0:(t 1) [ Es(t+1):T,a t:(t 1) [ θ log π(a t s t, θ)b(s t )] ] Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 58 Winter / 62

56 Baseline b(s) Does Not Introduce Bias Derivation E τ [ θ log π(a t s t, θ)b(s t )] [ = E s0:t,a Es(t+1):T 0:(t 1),a t:(t 1) [ θ log π(a t s t, θ)b(s t )] ] (break up expectation) [ = E s0:t,a 0:(t 1) b(st )E s(t+1):t,a t:(t 1) [ θ log π(a t s t, θ)] ] (pull baseline term out) = E s0:t,a 0:(t 1) [b(s t )E at [ θ log π(a t s t, θ)]] (remove irrelevant variables) [ = E s0:t,a 0:(t 1) b(s t ) ] π θ (a t s t ) θπ(a t s t, θ) (likelihood ratio) π a θ (a t s t ) [ = E s0:t,a 0:(t 1) b(s t ) ] θ π(a t s t, θ) a [ ] = E s0:t,a 0:(t 1) b(s t ) θ π(a t s t, θ) = E s0:t,a 0:(t 1) [b(s t ) θ 1] = E s0:t,a 0:(t 1) [b(s t ) 0] = 0 a Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 59 Winter / 62

57 Vanilla Policy Gradient Algorithm Initialize policy parameter θ, baseline b for iteration=1, 2, do Collect a set of trajectories by executing the current policy At each timestep in each trajectory, compute the return R t = T 1 t =t r t, and the advantage estimate  t = R t b(s t ). Re-fit the baseline, by minimizing b(s t ) R t 2, summed over all trajectories and timesteps. Update the policy, using a policy gradient estimate ĝ, which is a sum of terms θ log π(a t s t, θ)â t. (Plug ĝ into SGD or ADAM) endfor Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 60 Winter / 62

58 Practical Implementation with Autodiff Usual formula t θ log π(a t s t ; θ)ât is inifficient want to batch data Define surrogate function using data from current batch L(θ) = t log π(a t s t ; θ)ât Then policy gradient estimator ĝ = θ L(θ) Can also include value function fit error L(θ) = (log π(z t s t ; θ)â t V (s t ) ˆR t 2) t Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 61 Winter / 62

59 Value Functions Recall Q-function / state-action-value function: Q π,γ [ (s, a) = E π r0 + γr 1 + γ 2 r 2 s 0 = s, a 0 = a ] State-value function can serve as a great baseline V π,γ [ (s) = E π r0 + γr 1 + γ 2 r 2 s 0 = s ] = E a π [Q π,γ (s, a)] Advantage function: Combining Q with baseline V A π,γ (s, a) = Q π,γ (s, a) V π,γ (s) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 62 Winter / 62

60 N-step estimators Can also consider blending between TD and MC estimators for the target to substitute for the true state-action value function. ˆR (1) t = r t + γv (s t+1 ) ˆR (2) t = r t + γr t+1 + γ 2 V (s t+2 ) ˆR (inf) t = r t + γr t+1 + γ 2 r t+1 + If subtract baselines from the above, get advantage estimators  (1) t = r t + γv (s t+1 ) V (s t )  (2) t = r t + γr t+1 + γ 2 V (s t+2 ) V (s t )  (inf) t = r t + γr t+1 + γ 2 r t+1 + V (s t )  (a) t has low variance & high bias.  ( ) t high variance but low bias. (Why? Like which model-free policy estimation techniques?) Using intermediate k (say, 20) can give an intermediate amount of bias and variance. Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 63 Winter / 62

61 Application: Robot Locomotion Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 64 Winter / 62

62 Class Structure Last time: Imitation Learning This time: Policy Search Next time: Policy Search Cont. Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 65 Winter / 62

Lecture 9: Policy Gradient II (Post lecture) 2

Lecture 9: Policy Gradient II (Post lecture) 2 Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver

More information

Lecture 9: Policy Gradient II 1

Lecture 9: Policy Gradient II 1 Lecture 9: Policy Gradient II 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John

More information

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted 15-889e Policy Search: Gradient Methods Emma Brunskill All slides from David Silver (with EB adding minor modificafons), unless otherwise noted Outline 1 Introduction 2 Finite Difference Policy Gradient

More information

Lecture 8: Policy Gradient

Lecture 8: Policy Gradient Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve

More information

Policy Gradient Methods. February 13, 2017

Policy Gradient Methods. February 13, 2017 Policy Gradient Methods February 13, 2017 Policy Optimization Problems maximize E π [expression] π Fixed-horizon episodic: T 1 Average-cost: lim T 1 T r t T 1 r t Infinite-horizon discounted: γt r t Variable-length

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Policy Search: Actor-Critic and Gradient Policy search Mario Martin CS-UPC May 28, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 28, 2018 / 63 Goal of this lecture So far

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Policy gradients Daniel Hennes 26.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Policy based reinforcement learning So far we approximated the action value

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Deep Reinforcement Learning: Policy Gradients and Q-Learning Deep Reinforcement Learning: Policy Gradients and Q-Learning John Schulman Bay Area Deep Learning School September 24, 2016 Introduction and Overview Aim of This Talk What is deep RL, and should I use

More information

Deep Reinforcement Learning via Policy Optimization

Deep Reinforcement Learning via Policy Optimization Deep Reinforcement Learning via Policy Optimization John Schulman July 3, 2017 Introduction Deep Reinforcement Learning: What to Learn? Policies (select next action) Deep Reinforcement Learning: What to

More information

Reinforcement Learning via Policy Optimization

Reinforcement Learning via Policy Optimization Reinforcement Learning via Policy Optimization Hanxiao Liu November 22, 2017 1 / 27 Reinforcement Learning Policy a π(s) 2 / 27 Example - Mario 3 / 27 Example - ChatBot 4 / 27 Applications - Video Games

More information

Trust Region Policy Optimization

Trust Region Policy Optimization Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017 Yixin Lin (Duke) TRPO March 28, 2017 1 / 21 Overview 1 Preliminaries Markov Decision Processes Policy iteration

More information

Variance Reduction for Policy Gradient Methods. March 13, 2017

Variance Reduction for Policy Gradient Methods. March 13, 2017 Variance Reduction for Policy Gradient Methods March 13, 2017 Reward Shaping Reward Shaping Reward Shaping Reward shaping: r(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ Theorem: r admits

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Reinforcement Learning for NLP

Reinforcement Learning for NLP Reinforcement Learning for NLP Advanced Machine Learning for NLP Jordan Boyd-Graber REINFORCEMENT OVERVIEW, POLICY GRADIENT Adapted from slides by David Silver, Pieter Abbeel, and John Schulman Advanced

More information

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Deep Reinforcement Learning

Deep Reinforcement Learning Deep Reinforcement Learning John Schulman 1 MLSS, May 2016, Cadiz 1 Berkeley Artificial Intelligence Research Lab Agenda Introduction and Overview Markov Decision Processes Reinforcement Learning via Black-Box

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Cyber Rodent Project Some slides from: David Silver, Radford Neal CSC411: Machine Learning and Data Mining, Winter 2017 Michael Guerzhoy 1 Reinforcement Learning Supervised learning:

More information

Parking lot navigation. Experimental setup. Problem setup. Nice driving style. Page 1. CS 287: Advanced Robotics Fall 2009

Parking lot navigation. Experimental setup. Problem setup. Nice driving style. Page 1. CS 287: Advanced Robotics Fall 2009 Consider the following scenario: There are two envelopes, each of which has an unknown amount of money in it. You get to choose one of the envelopes. Given this is all you get to know, how should you choose?

More information

Behavior Policy Gradient Supplemental Material

Behavior Policy Gradient Supplemental Material Behavior Policy Gradient Supplemental Material Josiah P. Hanna 1 Philip S. Thomas 2 3 Peter Stone 1 Scott Niekum 1 A. Proof of Theorem 1 In Appendix A, we give the full derivation of our primary theoretical

More information

Notes on policy gradients and the log derivative trick for reinforcement learning

Notes on policy gradients and the log derivative trick for reinforcement learning Notes on policy gradients and the log derivative trick for reinforcement learning David Meyer dmm@{1-4-5.net,uoregon.edu,brocade.com,...} June 3, 2016 1 Introduction The log derivative trick 1 is a widely

More information

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018 Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Machine Learning I Continuous Reinforcement Learning

Machine Learning I Continuous Reinforcement Learning Machine Learning I Continuous Reinforcement Learning Thomas Rückstieß Technische Universität München January 7/8, 2010 RL Problem Statement (reminder) state s t+1 ENVIRONMENT reward r t+1 new step r t

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Approximation Methods in Reinforcement Learning

Approximation Methods in Reinforcement Learning 2018 CS420, Machine Learning, Lecture 12 Approximation Methods in Reinforcement Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html Reinforcement

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

CSC321 Lecture 22: Q-Learning

CSC321 Lecture 22: Q-Learning CSC321 Lecture 22: Q-Learning Roger Grosse Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21 Overview Second of 3 lectures on reinforcement learning Last time: policy gradient (e.g. REINFORCE) Optimize

More information

ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki

ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches Ville Kyrki 9.10.2017 Today Direct policy learning via policy gradient. Learning goals Understand basis and limitations

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart

More information

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Markov Decision Processes and Solving Finite Problems. February 8, 2017 Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:

More information

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017 Actor-critic methods Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department February 21, 2017 1 / 21 In this lecture... The actor-critic architecture Least-Squares Policy Iteration

More information

CS Deep Reinforcement Learning HW2: Policy Gradients due September 19th 2018, 11:59 pm

CS Deep Reinforcement Learning HW2: Policy Gradients due September 19th 2018, 11:59 pm CS294-112 Deep Reinforcement Learning HW2: Policy Gradients due September 19th 2018, 11:59 pm 1 Introduction The goal of this assignment is to experiment with policy gradient and its variants, including

More information

(Deep) Reinforcement Learning

(Deep) Reinforcement Learning Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Reinforcement Learning II. George Konidaris

Reinforcement Learning II. George Konidaris Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2017 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes

More information

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms * Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

More information

Reinforcement Learning and Control

Reinforcement Learning and Control CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

More information

Reinforcement Learning II. George Konidaris

Reinforcement Learning II. George Konidaris Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2018 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes

More information

Notes on Reinforcement Learning

Notes on Reinforcement Learning 1 Introduction Notes on Reinforcement Learning Paulo Eduardo Rauber 2014 Reinforcement learning is the study of agents that act in an environment with the goal of maximizing cumulative reward signals.

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

15-780: ReinforcementLearning

15-780: ReinforcementLearning 15-780: ReinforcementLearning J. Zico Kolter March 2, 2016 1 Outline Challenge of RL Model-based methods Model-free methods Exploration and exploitation 2 Outline Challenge of RL Model-based methods Model-free

More information

Real Time Value Iteration and the State-Action Value Function

Real Time Value Iteration and the State-Action Value Function MS&E338 Reinforcement Learning Lecture 3-4/9/18 Real Time Value Iteration and the State-Action Value Function Lecturer: Ben Van Roy Scribe: Apoorva Sharma and Tong Mu 1 Review Last time we left off discussing

More information

CS885 Reinforcement Learning Lecture 7a: May 23, 2018

CS885 Reinforcement Learning Lecture 7a: May 23, 2018 CS885 Reinforcement Learning Lecture 7a: May 23, 2018 Policy Gradient Methods [SutBar] Sec. 13.1-13.3, 13.7 [SigBuf] Sec. 5.1-5.2, [RusNor] Sec. 21.5 CS885 Spring 2018 Pascal Poupart 1 Outline Stochastic

More information

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control Lecturer: Nikolay Atanasov: natanasov@ucsd.edu Teaching Assistants: Tianyu Wang: tiw161@eng.ucsd.edu Yongxi Lu: yol070@eng.ucsd.edu

More information

O P T I M I Z I N G E X P E C TAT I O N S : T O S T O C H A S T I C C O M P U TAT I O N G R A P H S. john schulman. Summer, 2016

O P T I M I Z I N G E X P E C TAT I O N S : T O S T O C H A S T I C C O M P U TAT I O N G R A P H S. john schulman. Summer, 2016 O P T I M I Z I N G E X P E C TAT I O N S : FROM DEEP REINFORCEMENT LEARNING T O S T O C H A S T I C C O M P U TAT I O N G R A P H S john schulman Summer, 2016 A dissertation submitted in partial satisfaction

More information

Combining PPO and Evolutionary Strategies for Better Policy Search

Combining PPO and Evolutionary Strategies for Better Policy Search Combining PPO and Evolutionary Strategies for Better Policy Search Jennifer She 1 Abstract A good policy search algorithm needs to strike a balance between being able to explore candidate policies and

More information

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory Deep Reinforcement Learning Jeremy Morton (jmorton2) November 7, 2016 SISL Stanford Intelligent Systems Laboratory Overview 2 1 Motivation 2 Neural Networks 3 Deep Reinforcement Learning 4 Deep Learning

More information

Introduction of Reinforcement Learning

Introduction of Reinforcement Learning Introduction of Reinforcement Learning Deep Reinforcement Learning Reference Textbook: Reinforcement Learning: An Introduction http://incompleteideas.net/sutton/book/the-book.html Lectures of David Silver

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Daniel Hennes 19.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns Forward and backward view Function

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Learning Dexterity Matthias Plappert SEPTEMBER 6, 2018

Learning Dexterity Matthias Plappert SEPTEMBER 6, 2018 Learning Dexterity Matthias Plappert SEPTEMBER 6, 2018 OpenAI OpenAI is a non-profit AI research company, discovering and enacting the path to safe artificial general intelligence. OpenAI OpenAI is a non-profit

More information

Reinforcement Learning as Variational Inference: Two Recent Approaches

Reinforcement Learning as Variational Inference: Two Recent Approaches Reinforcement Learning as Variational Inference: Two Recent Approaches Rohith Kuditipudi Duke University 11 August 2017 Outline 1 Background 2 Stein Variational Policy Gradient 3 Soft Q-Learning 4 Closing

More information

INTRODUCTION TO EVOLUTION STRATEGY ALGORITHMS. James Gleeson Eric Langlois William Saunders

INTRODUCTION TO EVOLUTION STRATEGY ALGORITHMS. James Gleeson Eric Langlois William Saunders INTRODUCTION TO EVOLUTION STRATEGY ALGORITHMS James Gleeson Eric Langlois William Saunders REINFORCEMENT LEARNING CHALLENGES f(θ) is a discrete function of theta How do we get a gradient θ f? Credit assignment

More information

Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More. March 8, 2017

Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More. March 8, 2017 Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More March 8, 2017 Defining a Loss Function for RL Let η(π) denote the expected return of π [ ] η(π) = E s0 ρ 0,a t π( s t) γ t r t We collect

More information

Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C.

Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C. Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C. Spall John Wiley and Sons, Inc., 2003 Preface... xiii 1. Stochastic Search

More information

Tutorial on Policy Gradient Methods. Jan Peters

Tutorial on Policy Gradient Methods. Jan Peters Tutorial on Policy Gradient Methods Jan Peters Outline 1. Reinforcement Learning 2. Finite Difference vs Likelihood-Ratio Policy Gradients 3. Likelihood-Ratio Policy Gradients 4. Conclusion General Setup

More information

ARTIFICIAL INTELLIGENCE. Reinforcement learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

Reinforcement Learning: An Introduction

Reinforcement Learning: An Introduction Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is

More information

Policy Gradient. U(θ) = E[ R(s t,a t );π θ ] = E[R(τ);π θ ] (1) 1 + e θ φ(s t) E[R(τ);π θ ] (3) = max. θ P(τ;θ)R(τ) (6) P(τ;θ) θ log P(τ;θ)R(τ) (9)

Policy Gradient. U(θ) = E[ R(s t,a t );π θ ] = E[R(τ);π θ ] (1) 1 + e θ φ(s t) E[R(τ);π θ ] (3) = max. θ P(τ;θ)R(τ) (6) P(τ;θ) θ log P(τ;θ)R(τ) (9) CS294-40 Learning for Robotics and Control Lecture 16-10/20/2008 Lecturer: Pieter Abbeel Policy Gradient Scribe: Jan Biermeyer 1 Recap Recall: H U() = E[ R(s t,a ;π ] = E[R();π ] (1) Here is a sample path

More information

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it

More information

Discrete Variables and Gradient Estimators

Discrete Variables and Gradient Estimators iscrete Variables and Gradient Estimators This assignment is designed to get you comfortable deriving gradient estimators, and optimizing distributions over discrete random variables. For most questions,

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Deep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department

Deep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department Deep reinforcement learning Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department 1 / 25 In this lecture... Introduction to deep reinforcement learning Value-based Deep RL Deep

More information

Lecture 6: CS395T Numerical Optimization for Graphics and AI Line Search Applications

Lecture 6: CS395T Numerical Optimization for Graphics and AI Line Search Applications Lecture 6: CS395T Numerical Optimization for Graphics and AI Line Search Applications Qixing Huang The University of Texas at Austin huangqx@cs.utexas.edu 1 Disclaimer This note is adapted from Section

More information

Reinforcement Learning Part 2

Reinforcement Learning Part 2 Reinforcement Learning Part 2 Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ From previous tutorial Reinforcement Learning Exploration No supervision Agent-Reward-Environment

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@cse.iitb.ac.in Department of Computer Science and Engineering Indian Institute of Technology Bombay April 2018 What is Reinforcement

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

Decision Theory: Q-Learning

Decision Theory: Q-Learning Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning

More information

Discrete Latent Variable Models

Discrete Latent Variable Models Discrete Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 14 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 1 / 29 Summary Major themes in the course

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

Reinforcement Learning as Classification Leveraging Modern Classifiers

Reinforcement Learning as Classification Leveraging Modern Classifiers Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions

More information

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5 Table of contents 1 Introduction 2 2 Markov Decision Processes 2 3 Future Cumulative Reward 3 4 Q-Learning 4 4.1 The Q-value.............................................. 4 4.2 The Temporal Difference.......................................

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning RL in continuous MDPs March April, 2015 Large/Continuous MDPs Large/Continuous state space Tabular representation cannot be used Large/Continuous action space Maximization over action

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Lecture 10 - Planning under Uncertainty (III)

Lecture 10 - Planning under Uncertainty (III) Lecture 10 - Planning under Uncertainty (III) Jesse Hoey School of Computer Science University of Waterloo March 27, 2018 Readings: Poole & Mackworth (2nd ed.)chapter 12.1,12.3-12.9 1/ 34 Reinforcement

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning. Machine Learning, Fall 2010 Reinforcement Learning Machine Learning, Fall 2010 1 Administrativia This week: finish RL, most likely start graphical models LA2: due on Thursday LA3: comes out on Thursday TA Office hours: Today 1:30-2:30

More information

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan Some slides borrowed from Peter Bodik and David Silver Course progress Learning

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department

Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department Dialogue management: Parametric approaches to policy optimisation Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department 1 / 30 Dialogue optimisation as a reinforcement learning

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques

More information

Reinforcement Learning and Deep Reinforcement Learning

Reinforcement Learning and Deep Reinforcement Learning Reinforcement Learning and Deep Reinforcement Learning Ashis Kumer Biswas, Ph.D. ashis.biswas@ucdenver.edu Deep Learning November 5, 2018 1 / 64 Outlines 1 Principles of Reinforcement Learning 2 The Q

More information

Lecture 7: Value Function Approximation

Lecture 7: Value Function Approximation Lecture 7: Value Function Approximation Joseph Modayil Outline 1 Introduction 2 3 Batch Methods Introduction Large-Scale Reinforcement Learning Reinforcement learning can be used to solve large problems,

More information

Deterministic Policy Gradient, Advanced RL Algorithms

Deterministic Policy Gradient, Advanced RL Algorithms NPFL122, Lecture 9 Deterministic Policy Gradient, Advanced RL Algorithms Milan Straka December 1, 218 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics

More information

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague Sequential decision making under uncertainty Jiří Kléma Department of Computer Science, Czech Technical University in Prague https://cw.fel.cvut.cz/wiki/courses/b4b36zui/prednasky pagenda Previous lecture:

More information

REINFORCEMENT LEARNING

REINFORCEMENT LEARNING REINFORCEMENT LEARNING Larry Page: Where s Google going next? DeepMind's DQN playing Breakout Contents Introduction to Reinforcement Learning Deep Q-Learning INTRODUCTION TO REINFORCEMENT LEARNING Contents

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information