Lecture 8: Policy Gradient I 2
|
|
- Cory Hope McLaughlin
- 5 years ago
- Views:
Transcription
1 Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp With many slides from or derived from David Silver and John Schulman and Pieter Abbeel Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 3 Winter / 62
2 Last Time: We want RL Algorithms that Perform Optimization Delayed consequences Exploration Generalization And do it statistically and computationally efficiently Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 4 Winter / 62
3 Last Time: Generalization and Efficiency Can use structure and additional knowledge to help constrain and speed reinforcement learning Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 5 Winter / 62
4 Class Structure Last time: Imitation Learning This time: Policy Search Next time: Policy Search Cont. Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 6 Winter / 62
5 Table of Contents 1 Introduction 2 Policy Gradient 3 Score Function and Policy Gradient Theorem 4 Policy Gradient Algorithms and Reducing Variance Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 7 Winter / 62
6 Policy-Based Reinforcement Learning In the last lecture we approximated the value or action-value function using parameters θ, V θ (s) V π (s) Q θ (s, a) Q π (s, a) A policy was generated directly from the value function e.g. using ɛ-greedy In this lecture we will directly parametrize the policy π θ (s, a) = P[a s, θ] Goal is to find a policy π with the highest value function V π We will focus again on model-free reinforcement learning Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 8 Winter / 62
7 Value-Based and Policy-Based RL Value Based Learnt Value Function Implicit policy (e.g. ɛ-greedy) Policy Based No Value Function Learnt Policy Actor-Critic Learnt Value Function Learnt Policy Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 9 Winter / 62
8 Advantages of Policy-Based RL Advantages: Better convergence properties Effective in high-dimensional or continuous action spaces Can learn stochastic policies Disadvantages: Typically converge to a local rather than global optimum Evaluating a policy is typically inefficient and high variance Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 10 Winter / 62
9 Example: Rock-Paper-Scissors Two-player game of rock-paper-scissors Scissors beats paper Rock beats scissors Paper beats rock Consider policies for iterated rock-paper-scissors A deterministic policy is easily exploited A uniform random policy is optimal (i.e. Nash equilibrium) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 11 Winter / 62
10 Example: Aliased Gridword (1) The agent cannot differentiate the grey states Consider features of the following form (for all N, E, S, W) φ(s, a) = 1(wall to N, a = move E) Compare value-based RL, using an approximate value function Q θ (s, a) = f (φ(s, a), θ) To policy-based RL, using a parametrised policy π θ (s, a) = g(φ(s, a), θ) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 12 Winter / 62
11 Example: Aliased Gridworld (2) Under aliasing, an optimal deterministic policy will either move W in both grey states (shown by red arrows) move E in both grey states Either way, it can get stuck and never reach the money Value-based RL learns a near-deterministic policy e.g. greedy or ɛ-greedy So it will traverse the corridor for a long time Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 13 Winter / 62
12 Example: Aliased Gridworld (3) An optimal stochastic policy will randomly move E or W in grey states π θ (wall to N and W, move E) = 0.5 π θ (wall to N and W, move W) = 0.5 It will reach the goal state in a few steps with high probability Policy-based RL can learn the optimal stochastic policy Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 14 Winter / 62
13 Policy Objective Functions Goal: given a policy π θ (s, a) with parameters θ, find best θ But how do we measure the quality for a policy π θ? In episodic environments we can use the start value of the policy J 1 (θ) = V π θ (s 1 ) = E πθ [v 1 ] In continuing environments we can use the average value J avv (θ) = s d π θ (s)v π θ (s) where d π θ(s) is the stationary distribution of Markov chain for π θ. Or the average reward per time-step J avr (θ) = s d π θ (s) a π θ (s, a)r a s For simplicity, today will mostly discuss the episodic case, but can easily extend to the continuing / infinite horizon case Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 15 Winter / 62
14 Policy optimization Policy based reinforcement learning is an optimization problem Find policy parameters θ that maximize V π θ Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 16 Winter / 62
15 Policy optimization Policy based reinforcement learning is an optimization problem Find policy parameters θ that maximize V π θ Can use gradient free optimization: Hill climbing Simplex / amoeba / Nelder Mead Genetic algorithms Cross-Entropy method (CEM) Covariance Matrix Adaptation (CMA) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 17 Winter / 62
16 Recall Human-in-the-Loop Exoskeleton Optimization (Zhang et al. Science 2017) u Optimization was done using CMA-ES, variation of covariance matrix evaluation Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 18 Winter / 62
17 Gradient Free Policy Optimization Can often work embarrassingly well: discovered that evolution strategies (ES), an optimization technique that s been known for decades, rivals the performance of standard reinforcement learning (RL) techniques on modern RL benchmarks (e.g. Atari/MuJoCo) ( Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 19 Winter / 62
18 Gradient Free Policy Optimization Often a great simple baseline to try Benefits Can work with any policy parameterizations, including non-differentiable Frequently very easy to parallelize Limitations Typically not very sample efficient because it ignores temporal structure Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 20 Winter / 62
19 Policy optimization Policy based reinforcement learning is an optimization problem Find policy parameters θ that maximize V π θ Can use gradient free optimization: Greater efficiency often possible using gradient Gradient descent Conjugate gradient Quasi-newton We focus on gradient descent, many extensions possible And on methods that exploit sequential structure Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 21 Winter / 62
20 Table of Contents 1 Introduction 2 Policy Gradient 3 Score Function and Policy Gradient Theorem 4 Policy Gradient Algorithms and Reducing Variance Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 22 Winter / 62
21 Policy Gradient Define V (θ) = V π θ to make explicit the dependence of the value on the policy parameters Assume episodic MDPs (easy to extend to related objectives, like average reward) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 23 Winter / 62
22 Policy Gradient Define V (θ) = V π θ to make explicit the dependence of the value on the policy parameters Assume episodic MDPs Policy gradient algorithms search for a local maximum in V (θ) by ascending the gradient of the policy, w.r.t parameters θ θ = α θ V (θ) Where θ V (θ) is the policy gradient and α is a step-size parameter θ V (θ) = δv (θ) δθ 1. δv (θ) δθ n Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 24 Winter / 62
23 Computing Gradients by Finite Differences To evaluate policy gradient of π θ (s, a) For each dimension k [1, n] Estimate kth partial derivative of objective function w.r.t. θ By perturbing θ by small amount ɛ in kth dimension δv (θ) δθ k V (θ + ɛu k V (θ) ɛ where u k is a unit vector with 1 in kth component, 0 elsewhere. Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 25 Winter / 62
24 Computing Gradients by Finite Differences To evaluate policy gradient of π θ (s, a) For each dimension k [1, n] Estimate kth partial derivative of objective function w.r.t. θ By perturbing θ by small amount ɛ in kth dimension δv (θ) δθ k V (θ + ɛu k V (θ) ɛ where u k is a unit vector with 1 in kth component, 0 elsewhere. Uses n evaluations to compute policy gradient in n dimensions Simple, noisy, inefficient - but sometimes effective Works for arbitrary policies, even if policy is not differentiable Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 26 Winter / 62
25 Training AIBO to Walk by Finite Difference Policy Gradient 27 Goal: learn a fast AIBO walk (useful for Robocup) Adapt these parameters by finite difference policy gradient Evaluate performance of policy by field traversal time 27 Kohl and Stone. Policy gradient reinforcement learning for fast quadrupedal locomotion. ICRA ai-lab/pubs/icra04.pdf Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 28 Winter / 62
26 AIBO Policy Parameterization AIBO walk policy is open-loop policy No state, choosing set of action parameters that define an ellipse Specified by 12 continuous parameters (elliptical loci) The front locus (3 parameters: height, x-pos., y-pos.) The rear locus (3 parameters) Locus length Locus skew multiplier in the x-y plane (for turning) The height of the front of the body The height of the rear of the body The time each foot takes to move through its locus The fraction of time each foot spends on the ground New policies: or each parameter, randomly add (ɛ, 0, or ɛ) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 29 Winter / 62
27 AIBO Policy Experiments All of the policy evaluations took place on actual robots... only human intervention required during an experiment involved replacing discharged batteries... about once an hour. Ran on 3 Aibos at once Evaluated 15 policies per iteration. Each policy evaluated 3 times (to reduce noise) and averaged Each iteration took 7.5 minutes Used η = 2 (learning rate for their finite difference approach) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 30 Winter / 62
28 Training AIBO to Walk by Finite Difference Policy Gradient Results Authors discuss that performance is likely impacted by: initial starting policy parameters, ɛ (how much policies are perturbed), η (how much to change policy), as well as policy parameterization Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 31 Winter / 62
29 AIBO Walk Policies walk Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 32 Winter / 62
30 Table of Contents 1 Introduction 2 Policy Gradient 3 Score Function and Policy Gradient Theorem 4 Policy Gradient Algorithms and Reducing Variance Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 33 Winter / 62
31 Computing the gradient analytically We now compute the policy gradient analytically Assume policy π θ is differentiable whenever it is non-zero and we know the gradient θ π θ (s, a) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 34 Winter / 62
32 Likelihood Ratio Policies Denote a state-action trajectory as τ = (s 0, a 0, r 0,..., s T 1, a T 1, r T 1, s T ) Use R(τ) = T t=0 R(s t, a t ) to be the sum of rewards for a trajectory τ Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 35 Winter / 62
33 Likelihood Ratio Policies Denote a state-action trajectory as τ = (s 0, a 0, r 0,..., s T 1, a T 1, r T 1, s T ) Use R(τ) = T t=0 R(s t, a t ) to be the sum of rewards for a trajectory τ Policy value is [ T ] V (θ) = E πθ R(s t, a t ); π θ = P(τ; θ)r(τ), (1) τ t=0 where P(τ; θ) is used to denote the probability over trajectories when executing policy π(θ) In this new notation, our goal is to find the policy parameters θ: arg max V (θ) = arg max P(τ; θ)r(τ). (2) θ θ τ Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 36 Winter / 62
34 Likelihood Ratio Policy Gradient Goal is to find the policy parameters θ: arg max V (θ) = arg max P(τ; θ)r(τ). (3) θ θ Take the gradient with respect to θ: θ V (θ) = θ P(τ; θ)r(τ) τ τ Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 37 Winter / 62
35 Likelihood Ratio Policy Gradient Goal is to find the policy parameters θ: arg max V (θ) = arg max P(τ; θ)r(τ). (4) θ θ Take the gradient with respect to θ: θ V (θ) = θ P(τ; θ)r(τ) = τ τ τ θ P(τ; θ)r(τ) = τ = τ = τ P(τ; θ) P(τ; θ) θp(τ; θ)r(τ) P(τ; θ)r(τ) θp(τ; θ) P(τ; θ) }{{} likelihood ratio P(τ; θ)r(τ) θ log P(τ; θ) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 38 Winter / 62
36 Likelihood Ratio Policy Gradient Goal is to find the policy parameters θ: arg max V (θ) = arg max θ θ P(τ; θ)r(τ). (5) Take the gradient with respect to θ: τ θ V (θ) = τ P(τ; θ)r(τ) θ log P(τ; θ) Approximate with empirical estimate for m sample paths under policy π θ : θ V (θ) ĝ = (1/m) m R(τ (i) ) θ log P(τ (i) ; θ) i=1 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 39 Winter / 62
37 Score Function Gradient Estimator: Intuition Consider generic form of R(τ (i) ) θ log P(τ (i) ; θ): ĝ i = f (x i ) θ log p(x i θ) f (x) measures how good the sample x is. Moving in the direction ĝ i pushes up the logprob of the sample, in proportion to how good it is Valid even if f (x) is discontinuous, and unknown, or sample space (containing x) is a discrete set Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 40 Winter / 62
38 Score Function Gradient Estimator: Intuition ĝ i = f (x i ) θ log p(x i θ) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 41 Winter / 62
39 Score Function Gradient Estimator: Intuition ĝ i = f (x i ) θ log p(x i θ) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 42 Winter / 62
40 Decomposing the Trajectories Into States and Actions Approximate with empirical estimate for m sample paths under policy π θ : m θ V (θ) ĝ = (1/m) R(τ (i) ) θ log P(τ (i) ) θ log P(τ (i) ; θ) = i=1 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 43 Winter / 62
41 Decomposing the Trajectories Into States and Actions Approximate with empirical estimate for m sample paths under policy π θ : θ V (θ) ĝ = (1/m) θ log P(τ (i);θ ) = θ log µ(s 0 ) m R(τ (i) ) θ log P(τ (i) ) i=1 }{{} Initial state distrib. T 1 t=0 π θ (a t s t ) P(s t+1 s t, a t ) }{{}}{{} policy dynamics model T 1 = θ [log µ(s 0 ) + log π θ (a t s t ) + log P(s t+1 s t, a t ) = T 1 t=0 t=0 θ log π θ (a t s t ) }{{} no dynamics model required! ] Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 44 Winter / 62
42 Score Function Define score function as θ log π θ (s, a) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 45 Winter / 62
43 Likelihood Ratio / Score Function Policy Gradient Putting this together Goal is to find the policy parameters θ: arg max V (θ) = arg max θ θ P(τ; θ)r(τ). (6) Approximate with empirical estimate for m sample paths under policy π θ using score function: θ V (θ) ĝ = (1/m) = (1/m) m i=1 Do not need to know dynamics model τ m R(τ (i) ) θ log P(τ (i);θ ) i=1 T 1 R(τ (i) ) θ log π θ (a (i) t s (i) t ) t=0 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 46 Winter / 62
44 Policy Gradient Theorem Theorem The policy gradient theorem generalizes the likelihood ratio approach For any differentiable policy π θ (s, a), for any of the policy objective function J = J 1, (episodic reward), J avr 1 (average reward per time step), or 1 γ J avv (average value), the policy gradient is θ J(θ) = E πθ [ θ log π θ (s, a)q π θ (s, a)] Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 47 Winter / 62
45 Table of Contents 1 Introduction 2 Policy Gradient 3 Score Function and Policy Gradient Theorem 4 Policy Gradient Algorithms and Reducing Variance Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 48 Winter / 62
46 Likelihood Ratio / Score Function Policy Gradient θ V (θ) (1/m) m i=1 Unbiased but very noisy Fixes that can make it practical Temporal structure Baseline T 1 R(τ (i) ) θ log π θ (a (i) t s (i) t ) t=0 Next time will discuss some additional tricks Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 49 Winter / 62
47 Policy Gradient: Use Temporal Structure Previously: [( T 1 θ E τ [R] = E τ t=0 ) ( T 1 r t t=0 θ log π θ (a t s t ) We can repeat the same argument to derive the gradient estimator for a single reward term r t. [ ] t θ E[r t ] = E r t θ log π θ (a t s t ) t=0 Summing this formula over t, we obtain [ T 1 ] t θ E[R] = E r t θ log π θ (a t s t ) = E t=0 [ T 1 t=0 t=0 T 1 θ log π θ (a t, s t ) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 50 Winter / 62 t =t r t ] )]
48 Policy Gradient: Use Temporal Structure Recall for a particular trajectory τ (i), T 1 t =t r (i) t is the return G (i) t θ E[R] (1/m) m T 1 i=1 t=0 θ log π θ (a t, s t )G (i) t Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 51 Winter / 62
49 Monte-Carlo Policy Gradient (REINFORCE) Leverages likelihood ratio / score function and temporal structure θ t = α θ log π θ (s t, a t )G t (7) REINFORCE: Initialize policy parameters θ arbitrarily for each episode {s 1, a 1, r 2,, s T 1, a T 1, r T } π θ do for t = 1 to T 1 do θ θ + α θ log π θ (s t, a t )G t endfor endfor return θ Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 52 Winter / 62
50 Differentiable Policy Classes Many choices of differentiable policy classes including: Softmax Gaussian Neural networks Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 53 Winter / 62
51 Softmax Policy Weight actions using linear combination of features φ(s, a) T θ Probability of action is proportional to exponentiated weight π θ (s, a) = e φ(s,a)t θ /( a e φ(s,a)t θ ) (8) The score function is θ log π θ (s, a) = φ(s, a) Eπ θ [φ(s, )] Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 54 Winter / 62
52 Gaussian Policy In continuous action spaces, a Gaussian policy is natural Mean is a linear combination of state features µ(s) = φ(s) T θ Variance may be fixed σ 2, or can also parametrised Policy is Gaussian a N (µ(s), σ 2 ) The score function is θ log π θ (s, a) = (a µ(s))φ(s) σ 2 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 55 Winter / 62
53 Likelihood Ratio / Score Function Policy Gradient θ V (θ) (1/m) m i=1 Unbiased but very noisy Fixes that can make it practical Temporal structure Baseline T 1 R(τ (i) ) θ log π θ (a (i) t s (i) t ) t=0 Next time will discuss some additional tricks Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 56 Winter / 62
54 Policy Gradient: Introduce Baseline Reduce variance by introducing a baseline b(s) [ T 1 θ E τ [R] = E τ t=0 θ log π(a t s t, θ) ( T 1 t =t For any choice of b(s), gradient estimator is unbiased. Near optimal choice is expected return, b(s t ) E[r t + r t r T 1 ] r t b(s t ) Interpretation: increase logprob of action a t proportionally to how much returns T 1 t =t r t are better than expected )] Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 57 Winter / 62
55 Baseline b(s) Does Not Introduce Bias Derivation E τ [ θ log π(a t s t, θ)b(s t )] = E s0:t,a 0:(t 1) [ Es(t+1):T,a t:(t 1) [ θ log π(a t s t, θ)b(s t )] ] Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 58 Winter / 62
56 Baseline b(s) Does Not Introduce Bias Derivation E τ [ θ log π(a t s t, θ)b(s t )] [ = E s0:t,a Es(t+1):T 0:(t 1),a t:(t 1) [ θ log π(a t s t, θ)b(s t )] ] (break up expectation) [ = E s0:t,a 0:(t 1) b(st )E s(t+1):t,a t:(t 1) [ θ log π(a t s t, θ)] ] (pull baseline term out) = E s0:t,a 0:(t 1) [b(s t )E at [ θ log π(a t s t, θ)]] (remove irrelevant variables) [ = E s0:t,a 0:(t 1) b(s t ) ] π θ (a t s t ) θπ(a t s t, θ) (likelihood ratio) π a θ (a t s t ) [ = E s0:t,a 0:(t 1) b(s t ) ] θ π(a t s t, θ) a [ ] = E s0:t,a 0:(t 1) b(s t ) θ π(a t s t, θ) = E s0:t,a 0:(t 1) [b(s t ) θ 1] = E s0:t,a 0:(t 1) [b(s t ) 0] = 0 a Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 59 Winter / 62
57 Vanilla Policy Gradient Algorithm Initialize policy parameter θ, baseline b for iteration=1, 2, do Collect a set of trajectories by executing the current policy At each timestep in each trajectory, compute the return R t = T 1 t =t r t, and the advantage estimate  t = R t b(s t ). Re-fit the baseline, by minimizing b(s t ) R t 2, summed over all trajectories and timesteps. Update the policy, using a policy gradient estimate ĝ, which is a sum of terms θ log π(a t s t, θ)â t. (Plug ĝ into SGD or ADAM) endfor Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 60 Winter / 62
58 Practical Implementation with Autodiff Usual formula t θ log π(a t s t ; θ)ât is inifficient want to batch data Define surrogate function using data from current batch L(θ) = t log π(a t s t ; θ)ât Then policy gradient estimator ĝ = θ L(θ) Can also include value function fit error L(θ) = (log π(z t s t ; θ)â t V (s t ) ˆR t 2) t Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 61 Winter / 62
59 Value Functions Recall Q-function / state-action-value function: Q π,γ [ (s, a) = E π r0 + γr 1 + γ 2 r 2 s 0 = s, a 0 = a ] State-value function can serve as a great baseline V π,γ [ (s) = E π r0 + γr 1 + γ 2 r 2 s 0 = s ] = E a π [Q π,γ (s, a)] Advantage function: Combining Q with baseline V A π,γ (s, a) = Q π,γ (s, a) V π,γ (s) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 62 Winter / 62
60 N-step estimators Can also consider blending between TD and MC estimators for the target to substitute for the true state-action value function. ˆR (1) t = r t + γv (s t+1 ) ˆR (2) t = r t + γr t+1 + γ 2 V (s t+2 ) ˆR (inf) t = r t + γr t+1 + γ 2 r t+1 + If subtract baselines from the above, get advantage estimators  (1) t = r t + γv (s t+1 ) V (s t )  (2) t = r t + γr t+1 + γ 2 V (s t+2 ) V (s t )  (inf) t = r t + γr t+1 + γ 2 r t+1 + V (s t )  (a) t has low variance & high bias.  ( ) t high variance but low bias. (Why? Like which model-free policy estimation techniques?) Using intermediate k (say, 20) can give an intermediate amount of bias and variance. Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 63 Winter / 62
61 Application: Robot Locomotion Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 64 Winter / 62
62 Class Structure Last time: Imitation Learning This time: Policy Search Next time: Policy Search Cont. Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 65 Winter / 62
Lecture 9: Policy Gradient II (Post lecture) 2
Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver
More informationLecture 9: Policy Gradient II 1
Lecture 9: Policy Gradient II 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John
More information15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted
15-889e Policy Search: Gradient Methods Emma Brunskill All slides from David Silver (with EB adding minor modificafons), unless otherwise noted Outline 1 Introduction 2 Finite Difference Policy Gradient
More informationLecture 8: Policy Gradient
Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve
More informationPolicy Gradient Methods. February 13, 2017
Policy Gradient Methods February 13, 2017 Policy Optimization Problems maximize E π [expression] π Fixed-horizon episodic: T 1 Average-cost: lim T 1 T r t T 1 r t Infinite-horizon discounted: γt r t Variable-length
More informationReinforcement Learning
Reinforcement Learning Policy Search: Actor-Critic and Gradient Policy search Mario Martin CS-UPC May 28, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 28, 2018 / 63 Goal of this lecture So far
More informationReinforcement Learning
Reinforcement Learning Policy gradients Daniel Hennes 26.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Policy based reinforcement learning So far we approximated the action value
More informationReinforcement Learning and NLP
1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value
More informationDeep Reinforcement Learning: Policy Gradients and Q-Learning
Deep Reinforcement Learning: Policy Gradients and Q-Learning John Schulman Bay Area Deep Learning School September 24, 2016 Introduction and Overview Aim of This Talk What is deep RL, and should I use
More informationDeep Reinforcement Learning via Policy Optimization
Deep Reinforcement Learning via Policy Optimization John Schulman July 3, 2017 Introduction Deep Reinforcement Learning: What to Learn? Policies (select next action) Deep Reinforcement Learning: What to
More informationReinforcement Learning via Policy Optimization
Reinforcement Learning via Policy Optimization Hanxiao Liu November 22, 2017 1 / 27 Reinforcement Learning Policy a π(s) 2 / 27 Example - Mario 3 / 27 Example - ChatBot 4 / 27 Applications - Video Games
More informationTrust Region Policy Optimization
Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017 Yixin Lin (Duke) TRPO March 28, 2017 1 / 21 Overview 1 Preliminaries Markov Decision Processes Policy iteration
More informationVariance Reduction for Policy Gradient Methods. March 13, 2017
Variance Reduction for Policy Gradient Methods March 13, 2017 Reward Shaping Reward Shaping Reward Shaping Reward shaping: r(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ Theorem: r admits
More informationLecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation
Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free
More informationIntroduction to Reinforcement Learning. CMPT 882 Mar. 18
Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and
More informationReinforcement Learning for NLP
Reinforcement Learning for NLP Advanced Machine Learning for NLP Jordan Boyd-Graber REINFORCEMENT OVERVIEW, POLICY GRADIENT Adapted from slides by David Silver, Pieter Abbeel, and John Schulman Advanced
More informationREINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning
REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and
More informationDeep Reinforcement Learning
Deep Reinforcement Learning John Schulman 1 MLSS, May 2016, Cadiz 1 Berkeley Artificial Intelligence Research Lab Agenda Introduction and Overview Markov Decision Processes Reinforcement Learning via Black-Box
More informationReinforcement Learning
Reinforcement Learning Cyber Rodent Project Some slides from: David Silver, Radford Neal CSC411: Machine Learning and Data Mining, Winter 2017 Michael Guerzhoy 1 Reinforcement Learning Supervised learning:
More informationParking lot navigation. Experimental setup. Problem setup. Nice driving style. Page 1. CS 287: Advanced Robotics Fall 2009
Consider the following scenario: There are two envelopes, each of which has an unknown amount of money in it. You get to choose one of the envelopes. Given this is all you get to know, how should you choose?
More informationBehavior Policy Gradient Supplemental Material
Behavior Policy Gradient Supplemental Material Josiah P. Hanna 1 Philip S. Thomas 2 3 Peter Stone 1 Scott Niekum 1 A. Proof of Theorem 1 In Appendix A, we give the full derivation of our primary theoretical
More informationNotes on policy gradients and the log derivative trick for reinforcement learning
Notes on policy gradients and the log derivative trick for reinforcement learning David Meyer dmm@{1-4-5.net,uoregon.edu,brocade.com,...} June 3, 2016 1 Introduction The log derivative trick 1 is a widely
More informationINF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018
Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)
More informationQ-Learning in Continuous State Action Spaces
Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental
More informationMachine Learning I Continuous Reinforcement Learning
Machine Learning I Continuous Reinforcement Learning Thomas Rückstieß Technische Universität München January 7/8, 2010 RL Problem Statement (reminder) state s t+1 ENVIRONMENT reward r t+1 new step r t
More informationIntroduction to Reinforcement Learning
CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.
More informationApproximation Methods in Reinforcement Learning
2018 CS420, Machine Learning, Lecture 12 Approximation Methods in Reinforcement Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html Reinforcement
More informationCS599 Lecture 1 Introduction To RL
CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming
More informationTemporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI
Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationCSC321 Lecture 22: Q-Learning
CSC321 Lecture 22: Q-Learning Roger Grosse Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21 Overview Second of 3 lectures on reinforcement learning Last time: policy gradient (e.g. REINFORCE) Optimize
More informationELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki
ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches Ville Kyrki 9.10.2017 Today Direct policy learning via policy gradient. Learning goals Understand basis and limitations
More informationReinforcement Learning
Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation
More informationReinforcement Learning
Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart
More informationMarkov Decision Processes and Solving Finite Problems. February 8, 2017
Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:
More informationActor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017
Actor-critic methods Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department February 21, 2017 1 / 21 In this lecture... The actor-critic architecture Least-Squares Policy Iteration
More informationCS Deep Reinforcement Learning HW2: Policy Gradients due September 19th 2018, 11:59 pm
CS294-112 Deep Reinforcement Learning HW2: Policy Gradients due September 19th 2018, 11:59 pm 1 Introduction The goal of this assignment is to experiment with policy gradient and its variants, including
More information(Deep) Reinforcement Learning
Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More informationReinforcement Learning II. George Konidaris
Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2017 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes
More informationUsing Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *
Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms
More informationReinforcement Learning and Control
CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make
More informationReinforcement Learning II. George Konidaris
Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2018 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes
More informationNotes on Reinforcement Learning
1 Introduction Notes on Reinforcement Learning Paulo Eduardo Rauber 2014 Reinforcement learning is the study of agents that act in an environment with the goal of maximizing cumulative reward signals.
More informationLecture 1: March 7, 2018
Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights
More informationLecture 23: Reinforcement Learning
Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:
More informationInternet Monetization
Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition
More information15-780: ReinforcementLearning
15-780: ReinforcementLearning J. Zico Kolter March 2, 2016 1 Outline Challenge of RL Model-based methods Model-free methods Exploration and exploitation 2 Outline Challenge of RL Model-based methods Model-free
More informationReal Time Value Iteration and the State-Action Value Function
MS&E338 Reinforcement Learning Lecture 3-4/9/18 Real Time Value Iteration and the State-Action Value Function Lecturer: Ben Van Roy Scribe: Apoorva Sharma and Tong Mu 1 Review Last time we left off discussing
More informationCS885 Reinforcement Learning Lecture 7a: May 23, 2018
CS885 Reinforcement Learning Lecture 7a: May 23, 2018 Policy Gradient Methods [SutBar] Sec. 13.1-13.3, 13.7 [SigBuf] Sec. 5.1-5.2, [RusNor] Sec. 21.5 CS885 Spring 2018 Pascal Poupart 1 Outline Stochastic
More informationECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control
ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control Lecturer: Nikolay Atanasov: natanasov@ucsd.edu Teaching Assistants: Tianyu Wang: tiw161@eng.ucsd.edu Yongxi Lu: yol070@eng.ucsd.edu
More informationO P T I M I Z I N G E X P E C TAT I O N S : T O S T O C H A S T I C C O M P U TAT I O N G R A P H S. john schulman. Summer, 2016
O P T I M I Z I N G E X P E C TAT I O N S : FROM DEEP REINFORCEMENT LEARNING T O S T O C H A S T I C C O M P U TAT I O N G R A P H S john schulman Summer, 2016 A dissertation submitted in partial satisfaction
More informationCombining PPO and Evolutionary Strategies for Better Policy Search
Combining PPO and Evolutionary Strategies for Better Policy Search Jennifer She 1 Abstract A good policy search algorithm needs to strike a balance between being able to explore candidate policies and
More informationDeep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory
Deep Reinforcement Learning Jeremy Morton (jmorton2) November 7, 2016 SISL Stanford Intelligent Systems Laboratory Overview 2 1 Motivation 2 Neural Networks 3 Deep Reinforcement Learning 4 Deep Learning
More informationIntroduction of Reinforcement Learning
Introduction of Reinforcement Learning Deep Reinforcement Learning Reference Textbook: Reinforcement Learning: An Introduction http://incompleteideas.net/sutton/book/the-book.html Lectures of David Silver
More informationToday s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes
Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks
More informationReinforcement Learning
Reinforcement Learning Function approximation Daniel Hennes 19.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns Forward and backward view Function
More informationBalancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm
Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu
More informationLearning Dexterity Matthias Plappert SEPTEMBER 6, 2018
Learning Dexterity Matthias Plappert SEPTEMBER 6, 2018 OpenAI OpenAI is a non-profit AI research company, discovering and enacting the path to safe artificial general intelligence. OpenAI OpenAI is a non-profit
More informationReinforcement Learning as Variational Inference: Two Recent Approaches
Reinforcement Learning as Variational Inference: Two Recent Approaches Rohith Kuditipudi Duke University 11 August 2017 Outline 1 Background 2 Stein Variational Policy Gradient 3 Soft Q-Learning 4 Closing
More informationINTRODUCTION TO EVOLUTION STRATEGY ALGORITHMS. James Gleeson Eric Langlois William Saunders
INTRODUCTION TO EVOLUTION STRATEGY ALGORITHMS James Gleeson Eric Langlois William Saunders REINFORCEMENT LEARNING CHALLENGES f(θ) is a discrete function of theta How do we get a gradient θ f? Credit assignment
More informationAdvanced Policy Gradient Methods: Natural Gradient, TRPO, and More. March 8, 2017
Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More March 8, 2017 Defining a Loss Function for RL Let η(π) denote the expected return of π [ ] η(π) = E s0 ρ 0,a t π( s t) γ t r t We collect
More informationCondensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C.
Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C. Spall John Wiley and Sons, Inc., 2003 Preface... xiii 1. Stochastic Search
More informationTutorial on Policy Gradient Methods. Jan Peters
Tutorial on Policy Gradient Methods Jan Peters Outline 1. Reinforcement Learning 2. Finite Difference vs Likelihood-Ratio Policy Gradients 3. Likelihood-Ratio Policy Gradients 4. Conclusion General Setup
More informationARTIFICIAL INTELLIGENCE. Reinforcement learning
INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html
More informationReinforcement Learning: An Introduction
Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is
More informationPolicy Gradient. U(θ) = E[ R(s t,a t );π θ ] = E[R(τ);π θ ] (1) 1 + e θ φ(s t) E[R(τ);π θ ] (3) = max. θ P(τ;θ)R(τ) (6) P(τ;θ) θ log P(τ;θ)R(τ) (9)
CS294-40 Learning for Robotics and Control Lecture 16-10/20/2008 Lecturer: Pieter Abbeel Policy Gradient Scribe: Jan Biermeyer 1 Recap Recall: H U() = E[ R(s t,a ;π ] = E[R();π ] (1) Here is a sample path
More informationReinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina
Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it
More informationDiscrete Variables and Gradient Estimators
iscrete Variables and Gradient Estimators This assignment is designed to get you comfortable deriving gradient estimators, and optimizing distributions over discrete random variables. For most questions,
More information6 Reinforcement Learning
6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationDeep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department
Deep reinforcement learning Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department 1 / 25 In this lecture... Introduction to deep reinforcement learning Value-based Deep RL Deep
More informationLecture 6: CS395T Numerical Optimization for Graphics and AI Line Search Applications
Lecture 6: CS395T Numerical Optimization for Graphics and AI Line Search Applications Qixing Huang The University of Texas at Austin huangqx@cs.utexas.edu 1 Disclaimer This note is adapted from Section
More informationReinforcement Learning Part 2
Reinforcement Learning Part 2 Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ From previous tutorial Reinforcement Learning Exploration No supervision Agent-Reward-Environment
More informationReinforcement Learning with Function Approximation. Joseph Christian G. Noel
Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is
More informationAn Introduction to Reinforcement Learning
An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@cse.iitb.ac.in Department of Computer Science and Engineering Indian Institute of Technology Bombay April 2018 What is Reinforcement
More informationCMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro
CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING
More informationDecision Theory: Q-Learning
Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning
More informationDiscrete Latent Variable Models
Discrete Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 14 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 1 / 29 Summary Major themes in the course
More informationArtificial Intelligence
Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important
More informationReinforcement Learning as Classification Leveraging Modern Classifiers
Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions
More information1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5
Table of contents 1 Introduction 2 2 Markov Decision Processes 2 3 Future Cumulative Reward 3 4 Q-Learning 4 4.1 The Q-value.............................................. 4 4.2 The Temporal Difference.......................................
More informationReinforcement Learning
Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of
More informationReinforcement Learning
Reinforcement Learning RL in continuous MDPs March April, 2015 Large/Continuous MDPs Large/Continuous state space Tabular representation cannot be used Large/Continuous action space Maximization over action
More information, and rewards and transition matrices as shown below:
CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount
More informationLecture 10 - Planning under Uncertainty (III)
Lecture 10 - Planning under Uncertainty (III) Jesse Hoey School of Computer Science University of Waterloo March 27, 2018 Readings: Poole & Mackworth (2nd ed.)chapter 12.1,12.3-12.9 1/ 34 Reinforcement
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationReinforcement Learning. Machine Learning, Fall 2010
Reinforcement Learning Machine Learning, Fall 2010 1 Administrativia This week: finish RL, most likely start graphical models LA2: due on Thursday LA3: comes out on Thursday TA Office hours: Today 1:30-2:30
More informationLecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan
COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan Some slides borrowed from Peter Bodik and David Silver Course progress Learning
More informationCourse 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016
Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the
More informationDialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department
Dialogue management: Parametric approaches to policy optimisation Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department 1 / 30 Dialogue optimisation as a reinforcement learning
More informationReinforcement Learning. Yishay Mansour Tel-Aviv University
Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak
More informationReinforcement Learning
Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques
More informationReinforcement Learning and Deep Reinforcement Learning
Reinforcement Learning and Deep Reinforcement Learning Ashis Kumer Biswas, Ph.D. ashis.biswas@ucdenver.edu Deep Learning November 5, 2018 1 / 64 Outlines 1 Principles of Reinforcement Learning 2 The Q
More informationLecture 7: Value Function Approximation
Lecture 7: Value Function Approximation Joseph Modayil Outline 1 Introduction 2 3 Batch Methods Introduction Large-Scale Reinforcement Learning Reinforcement learning can be used to solve large problems,
More informationDeterministic Policy Gradient, Advanced RL Algorithms
NPFL122, Lecture 9 Deterministic Policy Gradient, Advanced RL Algorithms Milan Straka December 1, 218 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics
More informationSequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague
Sequential decision making under uncertainty Jiří Kléma Department of Computer Science, Czech Technical University in Prague https://cw.fel.cvut.cz/wiki/courses/b4b36zui/prednasky pagenda Previous lecture:
More informationREINFORCEMENT LEARNING
REINFORCEMENT LEARNING Larry Page: Where s Google going next? DeepMind's DQN playing Breakout Contents Introduction to Reinforcement Learning Deep Q-Learning INTRODUCTION TO REINFORCEMENT LEARNING Contents
More informationReinforcement Learning
1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision
More information