Deterministic Policy Gradient, Advanced RL Algorithms

Size: px
Start display at page:

Download "Deterministic Policy Gradient, Advanced RL Algorithms"

Transcription

1 NPFL122, Lecture 9 Deterministic Policy Gradient, Advanced RL Algorithms Milan Straka December 1, 218 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

2 REINFORCE with Baseline The returns can be arbitrary better-than-average and worse-than-average returns cannot be recognized from the absolute value of the return. Hopefully, we can generalize the policy gradient theorem using a baseline A good choice for is, which can be shown to minimize variance of the estimator. Such baseline reminds centering of returns, given that. Then, betterthan-average returns are positive and worse-than-average returns are negative. to The resulting value is also called an advantage function. Of course, the (a s; θ) v (s) baseline can be only approximated. If neural networks are used to estimate, then some part of the network is usually shared between the policy and value function estimation, which is trained using mean square error of the predicted and observed return. b(s) θ J(θ) (q b(s)) θ(a s; θ). μ(s) (s, a) b(s) v (s) s S a A v (s) = E q a (s, a) a (s, a) = def q (s, a) v (s) 2/27

3 Parallel Advantage Actor Critic An alternative to independent workers is to train in a synchronous and centralized way by having the workes to only generate episodes. Such approach was described in May 217 by Celemente et al., who named their agent parallel advantage actor-critic (PAAC). Actions... States DNN Worker Worker nw learn Environments ne States, Rewards Targets Master Figure 1 of the paper "Efficient Parallel Methods for Deep Reinforcement Learning" by Alfredo V. Clemente et al. 3/27

4 Continuous Action Space Until now, the actions were discreet. However, many environments naturally accept actions from continuous space. We now consider actions which come from range for, or more generally from a Cartesian product of several such ranges: i i i [a, b ]. [a, b] a, b R A simple way how to parametrize the action distribution is to choose them from the normal distribution. Given mean and variance, probability density function of μ σ 2 2 N (μ, σ ) is 1..8 p.6 (x µ)2 2σ 2 µ=, µ=, µ=, µ= 2, 2 σ =.2, 2 σ = 1., 2 σ = 5., 2 σ =.5, (x μ) 2 1 p(x) = def e 2σ 2. 2σ x Figure from section 13.7 of "Reinforcement Learning: An Introduction, Second Edition". 4/27

5 Continuous Action Space in Gradient Methods Utilizing continuous action spaces in gradient-based methods is straightforward. Instead of the softmax distribution we suitably parametrize the action value, usually using the normal distribution. Considering only one real-valued action, we therefore have μ(s; θ) σ(s; θ) where and are function approximation of mean and standard deviation of the action distribution. (a s; θ) = def 2 P (a N (μ(s; θ), σ(s; θ) )), The mean and standard deviation are usually computed from the shared representation, with the mean being computed as a regular regression (i.e., one output neuron without activation); the standard variance (which must be positive) being computed again as a regression, x followed most commonly by either or, where. exp softplus softplus(x) = def log(1 + e ) 5/27

6 Continuous Action Space in Gradient Methods During training, we compute and and then sample the action value (clipping it to [a, b] μ(s; θ) σ(s; θ) if required). To compute the loss, we utilize the probability density function of the normal distribution (and usually also add the entropy penalty). mu = tf.layers.dense(hidden_layer, 1)[:, ] sd = tf.layers.dense(hidden_layer, 1)[:, ] sd = tf.exp(log_sd) # or sd = tf.nn.softplus(sd) normal_dist = tf.distributions.normal(mu, sd) # Loss computed as - log (a s) - entropy_regularization loss = - normal_dist.log_prob(self.actions) * self.returns \ - args.entropy_regularization * normal_dist.entropy() 6/27

7 Deterministic Policy Gradient Theorem Combining continuous actions and Deep Q Networks is not straightforward. In order to do so, we need a different variant of the policy gradient theorem. Recall that in policy gradient theorem, θ J(θ) q a) θ μ(s) (s, (a s; θ). s S Deterministic Policy Gradient Theorem (s; θ) a A Assume that the policy is deterministic and computes an action. Then under several assumptions about continuousness, the following holds: J(θ) θ E [ (s; θ) q (s, a) ]. s μ(s) θ a a=(s;θ) a R The theorem was first proven in the paper Deterministic Policy Gradient Algorithms by David Silver et al. 7/27

8 Deterministic Policy Gradient Theorem Proof The proof is very similar to the original (stochastic) policy gradient theorem. We assume that are continuous in all params. p(s s, a), p(s s, a), r(s, a), r(s, a), (s; θ), (s; θ) a v θ (s) = q θ (s, (s; θ)) a = θ(r(s, (s; θ)) + γ p(s s, (s; θ))v (s ) ds ) s θ = (s; θ) r(s, a) + θ a a=(s;θ) γ θ s p(s s, (s; θ))v (s ) ds = (s; θ) (r(s, a) + θ a a=(s;θ) γ p(s s, a))v (s ) ds ) s + γ p(s s, (s; θ)) v (s ) ds s θ Similarly to the gradient theorem, we finish the proof by continually expanding. = (s; θ) q θ a (s, a) + γ p(s s, (s; θ)) v (s ) ds a=(s;θ) θ s v (s ) θ 8/27

9 Deep Deterministic Policy Gradients Note that the formulation of deterministic policy gradient theorem allows an off-policy algorithm, because the loss functions no longer depends on actions (similarly to how expected Sarsa is also an off-policy algorithm). We therefore train function approximation for both and, training using a deterministic variant of the Bellman equation: and (s; θ) according to the deterministic policy gradient theorem. (s; θ) q(s, a; θ) q(s, a; θ) q(s, A t t; θ) = E R t+1,s t+1 [R t+1 + γq(s t+1, (S t+1; θ))] The algorithm was first described in the paper Continuous Control with Deep Reinforcement Learning by Timothy P. Lillicrap et al. (215). τ =.1 The authors utilize a replay buffer, a target network (updated by exponential moving average with ), batch normalization for CNNs, and perform exploration by adding a normaldistributed noise to predicted actions. Training is performed by Adam with learning rates of 1e-4 and 1e-3 for the policy and critic network, respectively. 9/27

10 Deep Deterministic Policy Gradients Algorithm 1 DDPG algorithm Randomly initialize critic network Q(s, a θ Q ) and actor µ(s θ µ ) with weights θ Q and θ µ. Initialize target network Q and µ with weights θ Q θ Q, θ µ θ µ Initialize replay buffer R for episode = 1, M do Initialize a random process N for action exploration Receive initial observation state s 1 for t = 1, T do Select action a t = µ(s t θ µ ) + N t according to the current policy and exploration noise Execute action a t and observe reward r t and observe new state s t+1 Store transition (s t, a t, r t, s t+1 ) in R Sample a random minibatch of N transitions (s i, a i, r i, s i+1 ) from R Set y i = r i + γq (s i+1, µ (s i+1 θ µ ) θ Q ) Update critic by minimizing the loss: L = 1 N i (y i Q(s i, a i θ Q )) 2 Update the actor policy using the sampled policy gradient: θ µj 1 N i a Q(s, a θ Q ) s=si,a=µ(si) θ µµ(s θ µ ) si Update the target networks: θ Q τ θ Q + (1 τ )θ Q end for end for θ µ τ θ µ + (1 τ )θ µ Algorithm 1 of the paper "Continuous Control with Deep Reinforcement Learning" by Timothy P. Lillicrap et al. 1/27

11 Deep Deterministic Policy Gradients Cart Pendulum Swing-up Cartpole Swing-up Fixed Reacher Monoped Balancing Normalized Reward 1 Gripper 1 Blockworld 1 Puck Shooting Cheetah Moving Gripper Million Steps 1 1 Figure 2: Performance curves for a selection of domains using variants of DPG: original DPG algorithm (minibatch NFQCA) with batch normalization (light grey), with target network (dark grey), with target networks and batch normalization (green), with target networks from pixel-only inputs (blue). Target networks are crucial. Figure 3 of the paper "Continuous Control with Deep Reinforcement Learning" by Timothy P. Lillicrap et al. 11/27

12 Deep Deterministic Policy Gradients Results using low-dimensional (lowd) version of the environment, pixel representation (pix) and DPG reference (cntrl). environment R av,lowd R best,lowd R av,pix R best,pix R av,cntrl R best,cntrl blockworld blockworld3da canada canada2d cart cartpole cartpolebalance cartpoleparalleldouble cartpoleserialdouble cartpoleserialtriple cheetah fixedreacher fixedreacherdouble fixedreachersingle gripper gripperrandom hardcheetah hopper hyq movinggripper pendulum reacher reacher3dafixedtarget reacher3darandomtarget reachersingle walker2d torcs Table 1 of the paper "Continuous Control with Deep Reinforcement Learning" by Timothy P. Lillicrap et al. 12/27

13 Natural Policy Gradient The following approach has been introduced by Kakade (22). Using policy gradient theorem, we are able to compute. Normally, we update the parameters by using directly this gradient. This choice is justified by the fact that a vector which maximizes under the constraint that is bounded by a small constant is exactly the gradient. Normally, the length is computed using Euclidean metric. But in general, any metric could be used. Representing a metric using a positive-definite matrix metric), we can compute the distance as direction is then given by. (identity matrix for Euclidean. The steepest ascent Note that when is the Hessian, the above process is exactly Newton's method. v v (s; θ + d) d 2 v G d 2 G 1 v Hv G d 2 = G d d T ij ij i j = d Gd d 13/27

14 Natural Policy Gradient J, / ( ) (a) Vanilla policy gradient. (b) Natural policy gradient. Figure 3 of the paper "Reinforcement learning of motor skills with policy gradients" by Jan Peters et al. 14/27

15 Natural Policy Gradient A suitable choice for the metric is Fisher information matrix defined as log (a s; θ) log (a s; θ) F s (θ) = def E (a s;θ) [ ] = E[ (a s; θ)]e[ (a s; θ)] T. θ θ It can be shown that the Fisher information metric is the only Riemannian metric (up to rescaling) invariant to change of parameters under sufficient statistic. Recall Kullback-Leibler distance (or relative entropy) defined as i p D KL (p q) = def p i i log = H(p, q) H(p). q i The Fisher information matrix is also a Hessian of the : 2 j i D ((a s; θ) (a s; θ ) KL F s (θ) = D ((a s; θ) (a s; θ ) KL. θ θ θ =θ i j 15/27

16 Natural Policy Gradient Using the metric we want to update the parameters using. An interesting property of using the to update the parameters is that updating using will choose an arbitrary better action in state ; updating using chooses the best action (maximizing expected return), similarly to tabular greedy policy improvement. However, computing θ v 1 θ F (θ) v d F F (θ) = E F (θ) d F s μ in a straightforward way is too costly. θ d F = def F (θ) s 1 v s 16/27

17 Truncated Natural Policy Gradient Duan et al. (216) in paper Benchmarking Deep Reinforcement Learning for Continuous Control propose a modification to the NPG to efficiently compute. Following Schulman et al. (215), they suggest to use conjugate gradient algorithm, which can solve a system of linear equations in an iterative manner, by using only to compute products for a suitable. Therefore, Av d F v Ax = b is found as a solution of d F A F (θ)d = F v and using only 1 iterations of the algorithm seem to suffice according to the experiments. Furthermore, Duan et al. suggest to use a specific learning rate suggested by Peters et al (28) of α ( v ) F (θ) v T 1. 17/27

18 Trust Region Policy Optimization Schulman et al. in 215 wrote an influential paper introducing TRPO as an improved variant of NPG. Considering two policies a (a s), we can write where is the advantage function and is the on-policy distribution of the policy. Analogously to policy improvement, we see that if, policy performance increases (or stays the same if the advantages are zero everywhere). However, sampling states, ~ ~ v ~ = v + E E a s μ( ~ ) a ~ (a s) (a s), s μ( ~ ) q (a s) v (s) μ( ~ ) a (a s) ~ is costly. Therefore, we instead consider L ( ~ ) = v + E E a (a s). s μ() a ~ (a s) 18/27

19 Trust Region Policy Optimization L ( ~ ) = v + E E a (a s) It can be shown that for parametrized the matches to the first order. Schulman et al. additionally proves that if we denote max D ( ( s) ( s)) s KL old new, then s μ() a ~ (a s) (a s; θ) L ( ~ ) max α = D ( ) = Therefore, TRPO minimizes subject to, where θ KL KL old new is used instead of D performance reasons; is a constant found empirically, as the one implied by the above equation is too small; importance sampling is used to account for sampling actions from. KL 4εγ v new L old ( new ) α where ε = (1 γ) 2 old v max s,a L ( θ) D ( θ ) < θ θ KL D ( θ θ) = E s μ( )[D ( ( s) ( s))] δ θ θ δ ~ new a (s, a). max KL for 19/27

20 Trust Region Policy Optimization The parameters are updated using, utilizing the conjugate gradient algorithm as described earlier for TNPG (note that the algorithm was designed originally for TRPO and only later employed for TNPG). To guarantee improvement and respect the We start by the learning rate of minimize L ( ) subject to D ( ) < constraint is satistifed and the objective improves. θ θ θ KL d 1 F = F (θ) L ( θ) D KL T δ/(d F (θ) 1d F F ) θ θ θ δ constraint, a line search is in fact performed. and shrink it exponentially until the 2/27

21 Trust Region Policy Optimization (a) (b) (c) (d) (e) (f) (g) Figure 1. Illustration of locomotion tasks: (a) Swimmer; (b) Hopper; (c) Walker; (d) Half-Cheetah; (e) Ant; (f) Simple Humanoid; and (g) Full Humanoid. Figure 1 of the paper "Benchmarking Deep Reinforcement Learning for Continuous Control" by Duan et al. Task Random REINFORCE TNPG RWR REPS TRPO CEM CMA-ES DDPG Cart-Pole Balancing 77.1 ± ± ± ± ± ± ± ± ± 87.8 Inverted Pendulum* ± ± ± ± ± ± ± ± ± Mountain Car ± ± ± ± ± ± ± ± ± 17.3 Acrobot ± ± ± ± ± ± ± ± ± 5.8 Double Inverted Pendulum* ± ± ± ± ± ± ± ± ± 154. Swimmer* 1.7 ± ± ± ± ± ± ± ± ± 1.8 Hopper 8.4 ± ± ± ± ± ± ± ± ± D Walker 1.7 ± ± ± ± ± ± ± ± ± Half-Cheetah 9.8 ± ± ± ± ± ± ± ± ± 72.7 Ant* 13.4 ± ± ± ± ± ± ± ± ± 2.8 Simple Humanoid 41.5 ± ± ± ± ± ± ± ± ± 28.1 Full Humanoid 13.2 ± ± ± ± ± ± ± 2.9 N/A ± N/A 119. ± 31.2 Table 1 of the paper "Benchmarking Deep Reinforcement Learning for Continuous Control" by Duan et al. 21/27

22 Proximal Policy Optimization A simplification of TRPO which can be implemented using a few lines of code. Let t = def (A t S t;θ) (A S ;θ ) r (θ) t t old. PPO minimizes the objective L CLIP (θ) = def E [ min (r (θ), clip(r (θ), 1 t t A^t t ε, 1 + ε) ))]. A^t Such L CLIP (θ) is a lower (pessimistic) bound. L CLIP A > A < 1 ǫ 1 r r ǫ L CLIP Figure 1 of the paper "Proximal Policy Optimization Algorithms" by Schulman et al. 22/27

23 Proximal Policy Optimization The advantages the usual are additionally estimated using generalized advantage estimation. Instead of def A^t T t 1 ir t+1+i T t γ V (S T ) V (S t ) A^t = γ + where. Algorithm 1 PPO, Actor-Critic Style for iteration=1, 2, do for actor=1, 2,, N do Run policy θold in environment for T timesteps the authors employ Compute advantage estimates Â1,, ÂT end for Optimize surrogate L wrt θ, with K epochs and minibatch size M N T θ old end for i= δ t = R t+1 + γv (S t+1) V (S t ) θ T t 1 A^t = def (γλ) δ, i= i t+i Algorithm 1 of the paper "Proximal Policy Optimization Algorithms" by Schulman et al. 23/27

24 Proximal Policy Optimization Figure 3 of the paper "Proximal Policy Optimization Algorithms" by Schulman et al. NPFL122, Lecture 9 Refresh DPG DDPG NPG TRPO PPO SAC 24/27

25 Soft Actor Critic The paper Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor by Tuomas Haarnoja et al. introduces a different off-policy algorithm for continuous action space. The general idea is to introduce entropy directly in the value function we want to maximize. 25/27

26 Soft Actor Critic Algorithm 1 Soft Actor-Critic Initialize parameter vectors ψ, ψ, θ, φ. for each iteration do for each environment step do a t φ (a t s t ) s t+1 p(s t+1 s t, a t ) D D {(s t, a t, r(s t, a t ), s t+1 )} end for for each gradient step do ψ ψ λ V ˆ ψ J V (ψ) θ i θ i λ Q ˆ θi J Q (θ i ) for i {1, 2} φ φ λ ˆ φ J (φ) ψ τ ψ + (1 τ ) ψ end for end for Algorithm 1 of the paper "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor" by Tuomas Haarnoja et al. 26/27

27 Soft Actor Critic pp nr ut er e g ar e v a nr ut er e g ar e v a nr ut er e g ar e v a million steps (a) Hopper-v million steps (b) Walker2d-v nr ut er e g ar e v a nr ut er e g ar e v a nr ut er e g ar e v a ) SAC DDPG PPO SQL TD3 (concurrent) (c) HalfCheetah-v1 ( million steps million steps (d) Ant-v million steps 8 1 (e) Humanoid-v million steps 8 1 (f) Humanoid (rllab) Figure 1 of the paper "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor" by Tuomas Haarnoja et al. NPFL122, Lecture 9 Refresh DPG DDPG NPG TRPO PPO SAC 27/27

Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More. March 8, 2017

Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More. March 8, 2017 Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More March 8, 2017 Defining a Loss Function for RL Let η(π) denote the expected return of π [ ] η(π) = E s0 ρ 0,a t π( s t) γ t r t We collect

More information

Lecture 9: Policy Gradient II 1

Lecture 9: Policy Gradient II 1 Lecture 9: Policy Gradient II 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John

More information

Lecture 9: Policy Gradient II (Post lecture) 2

Lecture 9: Policy Gradient II (Post lecture) 2 Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver

More information

Deep Reinforcement Learning via Policy Optimization

Deep Reinforcement Learning via Policy Optimization Deep Reinforcement Learning via Policy Optimization John Schulman July 3, 2017 Introduction Deep Reinforcement Learning: What to Learn? Policies (select next action) Deep Reinforcement Learning: What to

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Policy Search: Actor-Critic and Gradient Policy search Mario Martin CS-UPC May 28, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 28, 2018 / 63 Goal of this lecture So far

More information

Policy Gradient Methods. February 13, 2017

Policy Gradient Methods. February 13, 2017 Policy Gradient Methods February 13, 2017 Policy Optimization Problems maximize E π [expression] π Fixed-horizon episodic: T 1 Average-cost: lim T 1 T r t T 1 r t Infinite-horizon discounted: γt r t Variable-length

More information

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Deep Reinforcement Learning: Policy Gradients and Q-Learning Deep Reinforcement Learning: Policy Gradients and Q-Learning John Schulman Bay Area Deep Learning School September 24, 2016 Introduction and Overview Aim of This Talk What is deep RL, and should I use

More information

Lecture 8: Policy Gradient

Lecture 8: Policy Gradient Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve

More information

Lecture 8: Policy Gradient I 2

Lecture 8: Policy Gradient I 2 Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver and John Schulman

More information

Reinforcement Learning as Variational Inference: Two Recent Approaches

Reinforcement Learning as Variational Inference: Two Recent Approaches Reinforcement Learning as Variational Inference: Two Recent Approaches Rohith Kuditipudi Duke University 11 August 2017 Outline 1 Background 2 Stein Variational Policy Gradient 3 Soft Q-Learning 4 Closing

More information

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory Deep Reinforcement Learning Jeremy Morton (jmorton2) November 7, 2016 SISL Stanford Intelligent Systems Laboratory Overview 2 1 Motivation 2 Neural Networks 3 Deep Reinforcement Learning 4 Deep Learning

More information

Combining PPO and Evolutionary Strategies for Better Policy Search

Combining PPO and Evolutionary Strategies for Better Policy Search Combining PPO and Evolutionary Strategies for Better Policy Search Jennifer She 1 Abstract A good policy search algorithm needs to strike a balance between being able to explore candidate policies and

More information

Variance Reduction for Policy Gradient Methods. March 13, 2017

Variance Reduction for Policy Gradient Methods. March 13, 2017 Variance Reduction for Policy Gradient Methods March 13, 2017 Reward Shaping Reward Shaping Reward Shaping Reward shaping: r(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ Theorem: r admits

More information

Multi-Batch Experience Replay for Fast Convergence of Continuous Action Control

Multi-Batch Experience Replay for Fast Convergence of Continuous Action Control 1 Multi-Batch Experience Replay for Fast Convergence of Continuous Action Control Seungyul Han and Youngchul Sung Abstract arxiv:1710.04423v1 [cs.lg] 12 Oct 2017 Policy gradient methods for direct policy

More information

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017 Actor-critic methods Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department February 21, 2017 1 / 21 In this lecture... The actor-critic architecture Least-Squares Policy Iteration

More information

Trust Region Policy Optimization

Trust Region Policy Optimization Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017 Yixin Lin (Duke) TRPO March 28, 2017 1 / 21 Overview 1 Preliminaries Markov Decision Processes Policy iteration

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted 15-889e Policy Search: Gradient Methods Emma Brunskill All slides from David Silver (with EB adding minor modificafons), unless otherwise noted Outline 1 Introduction 2 Finite Difference Policy Gradient

More information

arxiv: v2 [cs.lg] 8 Aug 2018

arxiv: v2 [cs.lg] 8 Aug 2018 : Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor Tuomas Haarnoja 1 Aurick Zhou 1 Pieter Abbeel 1 Sergey Levine 1 arxiv:181.129v2 [cs.lg] 8 Aug 218 Abstract Model-free deep

More information

Reinforcement Learning via Policy Optimization

Reinforcement Learning via Policy Optimization Reinforcement Learning via Policy Optimization Hanxiao Liu November 22, 2017 1 / 27 Reinforcement Learning Policy a π(s) 2 / 27 Example - Mario 3 / 27 Example - ChatBot 4 / 27 Applications - Video Games

More information

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018 Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)

More information

Behavior Policy Gradient Supplemental Material

Behavior Policy Gradient Supplemental Material Behavior Policy Gradient Supplemental Material Josiah P. Hanna 1 Philip S. Thomas 2 3 Peter Stone 1 Scott Niekum 1 A. Proof of Theorem 1 In Appendix A, we give the full derivation of our primary theoretical

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Reinforcement Learning for NLP

Reinforcement Learning for NLP Reinforcement Learning for NLP Advanced Machine Learning for NLP Jordan Boyd-Graber REINFORCEMENT OVERVIEW, POLICY GRADIENT Adapted from slides by David Silver, Pieter Abbeel, and John Schulman Advanced

More information

Learning Control for Air Hockey Striking using Deep Reinforcement Learning

Learning Control for Air Hockey Striking using Deep Reinforcement Learning Learning Control for Air Hockey Striking using Deep Reinforcement Learning Ayal Taitler, Nahum Shimkin Faculty of Electrical Engineering Technion - Israel Institute of Technology May 8, 2017 A. Taitler,

More information

Approximation Methods in Reinforcement Learning

Approximation Methods in Reinforcement Learning 2018 CS420, Machine Learning, Lecture 12 Approximation Methods in Reinforcement Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html Reinforcement

More information

Real Time Value Iteration and the State-Action Value Function

Real Time Value Iteration and the State-Action Value Function MS&E338 Reinforcement Learning Lecture 3-4/9/18 Real Time Value Iteration and the State-Action Value Function Lecturer: Ben Van Roy Scribe: Apoorva Sharma and Tong Mu 1 Review Last time we left off discussing

More information

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari

More information

CS Deep Reinforcement Learning HW5: Soft Actor-Critic Due November 14th, 11:59 pm

CS Deep Reinforcement Learning HW5: Soft Actor-Critic Due November 14th, 11:59 pm CS294-112 Deep Reinforcement Learning HW5: Soft Actor-Critic Due November 14th, 11:59 pm 1 Introduction For this homework, you get to choose among several topics to investigate. Each topic corresponds

More information

arxiv: v2 [cs.lg] 7 Mar 2018

arxiv: v2 [cs.lg] 7 Mar 2018 Comparing Deep Reinforcement Learning and Evolutionary Methods in Continuous Control arxiv:1712.00006v2 [cs.lg] 7 Mar 2018 Shangtong Zhang 1, Osmar R. Zaiane 2 12 Dept. of Computing Science, University

More information

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016) Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016) Yoonho Lee Department of Computer Science and Engineering Pohang University of Science and Technology October 11, 2016 Outline

More information

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel & Sergey Levine Department of Electrical Engineering and Computer

More information

Comparing Deep Reinforcement Learning and Evolutionary Methods in Continuous Control

Comparing Deep Reinforcement Learning and Evolutionary Methods in Continuous Control Comparing Deep Reinforcement Learning and Evolutionary Methods in Continuous Control Shangtong Zhang Dept. of Computing Science University of Alberta shangtong.zhang@ualberta.ca Osmar R. Zaiane Dept. of

More information

Deep Reinforcement Learning

Deep Reinforcement Learning Deep Reinforcement Learning John Schulman 1 MLSS, May 2016, Cadiz 1 Berkeley Artificial Intelligence Research Lab Agenda Introduction and Overview Markov Decision Processes Reinforcement Learning via Black-Box

More information

(Deep) Reinforcement Learning

(Deep) Reinforcement Learning Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015

More information

Machine Learning I Continuous Reinforcement Learning

Machine Learning I Continuous Reinforcement Learning Machine Learning I Continuous Reinforcement Learning Thomas Rückstieß Technische Universität München January 7/8, 2010 RL Problem Statement (reminder) state s t+1 ENVIRONMENT reward r t+1 new step r t

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

Tutorial on Policy Gradient Methods. Jan Peters

Tutorial on Policy Gradient Methods. Jan Peters Tutorial on Policy Gradient Methods Jan Peters Outline 1. Reinforcement Learning 2. Finite Difference vs Likelihood-Ratio Policy Gradients 3. Likelihood-Ratio Policy Gradients 4. Conclusion General Setup

More information

6 Basic Convergence Results for RL Algorithms

6 Basic Convergence Results for RL Algorithms Learning in Complex Systems Spring 2011 Lecture Notes Nahum Shimkin 6 Basic Convergence Results for RL Algorithms We establish here some asymptotic convergence results for the basic RL algorithms, by showing

More information

Proximal Policy Optimization Algorithms

Proximal Policy Optimization Algorithms Proximal Policy Optimization Algorithms John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov OpenAI {joschu, filip, prafulla, alec, oleg}@openai.com arxiv:177.6347v2 [cs.lg] 28 Aug

More information

Multiagent (Deep) Reinforcement Learning

Multiagent (Deep) Reinforcement Learning Multiagent (Deep) Reinforcement Learning MARTIN PILÁT (MARTIN.PILAT@MFF.CUNI.CZ) Reinforcement learning The agent needs to learn to perform tasks in environment No prior knowledge about the effects of

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Policy gradients Daniel Hennes 26.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Policy based reinforcement learning So far we approximated the action value

More information

arxiv: v2 [cs.lg] 2 Oct 2018

arxiv: v2 [cs.lg] 2 Oct 2018 AMBER: Adaptive Multi-Batch Experience Replay for Continuous Action Control arxiv:171.4423v2 [cs.lg] 2 Oct 218 Seungyul Han Dept. of Electrical Engineering KAIST Daejeon, South Korea 34141 sy.han@kaist.ac.kr

More information

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it

More information

Decoupling deep learning and reinforcement learning for stable and efficient deep policy gradient algorithms

Decoupling deep learning and reinforcement learning for stable and efficient deep policy gradient algorithms Decoupling deep learning and reinforcement learning for stable and efficient deep policy gradient algorithms Alfredo Vicente Clemente Master of Science in Computer Science Submission date: August 2017

More information

Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO) Proximal Policy Optimization (PPO) default reinforcement learning algorithm at OpenAI Policy Gradient On-policy Off-policy Add constraint DeepMind https://youtu.be/gn4nrcc9twq OpenAI https://blog.openai.com/o

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms * Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

More information

Reinforcement Learning Part 2

Reinforcement Learning Part 2 Reinforcement Learning Part 2 Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ From previous tutorial Reinforcement Learning Exploration No supervision Agent-Reward-Environment

More information

arxiv: v1 [cs.lg] 13 Dec 2018

arxiv: v1 [cs.lg] 13 Dec 2018 Soft Actor-Critic Algorithms and Applications Tuomas Haarnoja Aurick Zhou Kristian Hartikainen George Tucker arxiv:1812.05905v1 [cs.lg] 13 Dec 2018 Sehoon Ha Jie Tan Vikash Kumar Henry Zhu Abhishek Gupta

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

Reinforcement Learning in Continuous Action Spaces

Reinforcement Learning in Continuous Action Spaces Reinforcement Learning in Continuous Action Spaces Hado van Hasselt and Marco A. Wiering Intelligent Systems Group, Department of Information and Computing Sciences, Utrecht University Padualaan 14, 3508

More information

Deep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department

Deep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department Deep reinforcement learning Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department 1 / 25 In this lecture... Introduction to deep reinforcement learning Value-based Deep RL Deep

More information

CS Deep Reinforcement Learning HW2: Policy Gradients due September 19th 2018, 11:59 pm

CS Deep Reinforcement Learning HW2: Policy Gradients due September 19th 2018, 11:59 pm CS294-112 Deep Reinforcement Learning HW2: Policy Gradients due September 19th 2018, 11:59 pm 1 Introduction The goal of this assignment is to experiment with policy gradient and its variants, including

More information

Lecture 7: Value Function Approximation

Lecture 7: Value Function Approximation Lecture 7: Value Function Approximation Joseph Modayil Outline 1 Introduction 2 3 Batch Methods Introduction Large-Scale Reinforcement Learning Reinforcement learning can be used to solve large problems,

More information

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5 Table of contents 1 Introduction 2 2 Markov Decision Processes 2 3 Future Cumulative Reward 3 4 Q-Learning 4 4.1 The Q-value.............................................. 4 4.2 The Temporal Difference.......................................

More information

CS230: Lecture 9 Deep Reinforcement Learning

CS230: Lecture 9 Deep Reinforcement Learning CS230: Lecture 9 Deep Reinforcement Learning Kian Katanforoosh Menti code: 21 90 15 Today s outline I. Motivation II. Recycling is good: an introduction to RL III. Deep Q-Learning IV. Application of Deep

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

COMP9444 Neural Networks and Deep Learning 10. Deep Reinforcement Learning. COMP9444 c Alan Blair, 2017

COMP9444 Neural Networks and Deep Learning 10. Deep Reinforcement Learning. COMP9444 c Alan Blair, 2017 COMP9444 Neural Networks and Deep Learning 10. Deep Reinforcement Learning COMP9444 17s2 Deep Reinforcement Learning 1 Outline History of Reinforcement Learning Deep Q-Learning for Atari Games Actor-Critic

More information

CSC321 Lecture 22: Q-Learning

CSC321 Lecture 22: Q-Learning CSC321 Lecture 22: Q-Learning Roger Grosse Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21 Overview Second of 3 lectures on reinforcement learning Last time: policy gradient (e.g. REINFORCE) Optimize

More information

Lecture 10 - Planning under Uncertainty (III)

Lecture 10 - Planning under Uncertainty (III) Lecture 10 - Planning under Uncertainty (III) Jesse Hoey School of Computer Science University of Waterloo March 27, 2018 Readings: Poole & Mackworth (2nd ed.)chapter 12.1,12.3-12.9 1/ 34 Reinforcement

More information

Olivier Sigaud. September 21, 2012

Olivier Sigaud. September 21, 2012 Supervised and Reinforcement Learning Tools for Motor Learning Models Olivier Sigaud Université Pierre et Marie Curie - Paris 6 September 21, 2012 1 / 64 Introduction Who is speaking? 2 / 64 Introduction

More information

Planning in Markov Decision Processes

Planning in Markov Decision Processes Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov

More information

arxiv: v4 [cs.lg] 14 Oct 2018

arxiv: v4 [cs.lg] 14 Oct 2018 Equivalence Between Policy Gradients and Soft -Learning John Schulman 1, Xi Chen 1,2, and Pieter Abbeel 1,2 1 OpenAI 2 UC Berkeley, EECS Dept. {joschu, peter, pieter}@openai.com arxiv:174.644v4 cs.lg 14

More information

Implicit Incremental Natural Actor Critic

Implicit Incremental Natural Actor Critic Implicit Incremental Natural Actor Critic Ryo Iwaki and Minoru Asada Osaka University, -1, Yamadaoka, Suita city, Osaka, Japan {ryo.iwaki,asada}@ams.eng.osaka-u.ac.jp Abstract. The natural policy gradient

More information

Lecture 6: CS395T Numerical Optimization for Graphics and AI Line Search Applications

Lecture 6: CS395T Numerical Optimization for Graphics and AI Line Search Applications Lecture 6: CS395T Numerical Optimization for Graphics and AI Line Search Applications Qixing Huang The University of Texas at Austin huangqx@cs.utexas.edu 1 Disclaimer This note is adapted from Section

More information

Misc. Topics and Reinforcement Learning

Misc. Topics and Reinforcement Learning 1/79 Misc. Topics and Reinforcement Learning http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html Acknowledgement: some parts of this slides is based on Prof. Mengdi Wang s, Prof. Dimitri Bertsekas and Prof.

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning From the basics to Deep RL Olivier Sigaud ISIR, UPMC + INRIA http://people.isir.upmc.fr/sigaud September 14, 2017 1 / 54 Introduction Outline Some quick background about discrete

More information

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function

More information

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control Lecturer: Nikolay Atanasov: natanasov@ucsd.edu Teaching Assistants: Tianyu Wang: tiw161@eng.ucsd.edu Yongxi Lu: yol070@eng.ucsd.edu

More information

Lecture 4: Approximate dynamic programming

Lecture 4: Approximate dynamic programming IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

15-780: ReinforcementLearning

15-780: ReinforcementLearning 15-780: ReinforcementLearning J. Zico Kolter March 2, 2016 1 Outline Challenge of RL Model-based methods Model-free methods Exploration and exploitation 2 Outline Challenge of RL Model-based methods Model-free

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

arxiv: v5 [cs.lg] 9 Sep 2016

arxiv: v5 [cs.lg] 9 Sep 2016 HIGH-DIMENSIONAL CONTINUOUS CONTROL USING GENERALIZED ADVANTAGE ESTIMATION John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan and Pieter Abbeel Department of Electrical Engineering and Computer

More information

Reinforcement. Function Approximation. Learning with KATJA HOFMANN. Researcher, MSR Cambridge

Reinforcement. Function Approximation. Learning with KATJA HOFMANN. Researcher, MSR Cambridge Reinforcement Learning with Function Approximation KATJA HOFMANN Researcher, MSR Cambridge Representation and Generalization in RL Focus on training stability Learning generalizable value functions Navigating

More information

Off-Policy Actor-Critic

Off-Policy Actor-Critic Off-Policy Actor-Critic Ludovic Trottier Laval University July 25 2012 Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 1 / 34 Table of Contents 1 Reinforcement Learning Theory

More information

Reinforcement Learning In Continuous Time and Space

Reinforcement Learning In Continuous Time and Space Reinforcement Learning In Continuous Time and Space presentation of paper by Kenji Doya Leszek Rybicki lrybicki@mat.umk.pl 18.07.2008 Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous

More information

Reinforcement learning an introduction

Reinforcement learning an introduction Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Notes on Reinforcement Learning

Notes on Reinforcement Learning 1 Introduction Notes on Reinforcement Learning Paulo Eduardo Rauber 2014 Reinforcement learning is the study of agents that act in an environment with the goal of maximizing cumulative reward signals.

More information

ECE521 Lectures 9 Fully Connected Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

More information

ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki

ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches Ville Kyrki 9.10.2017 Today Direct policy learning via policy gradient. Learning goals Understand basis and limitations

More information

Reinforcement Learning. George Konidaris

Reinforcement Learning. George Konidaris Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information

David Silver, Google DeepMind

David Silver, Google DeepMind Tutorial: Deep Reinforcement Learning David Silver, Google DeepMind Outline Introduction to Deep Learning Introduction to Reinforcement Learning Value-Based Deep RL Policy-Based Deep RL Model-Based Deep

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques

More information

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you

More information

Policy Gradient Reinforcement Learning for Robotics

Policy Gradient Reinforcement Learning for Robotics Policy Gradient Reinforcement Learning for Robotics Michael C. Koval mkoval@cs.rutgers.edu Michael L. Littman mlittman@cs.rutgers.edu May 9, 211 1 Introduction Learning in an environment with a continuous

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

Gradient Methods for Markov Decision Processes

Gradient Methods for Markov Decision Processes Gradient Methods for Markov Decision Processes Department of Computer Science University College London May 11, 212 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods

More information

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning Lei Lei Ruoxuan Xiong December 16, 2017 1 Introduction Deep Neural Network

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

O P T I M I Z I N G E X P E C TAT I O N S : T O S T O C H A S T I C C O M P U TAT I O N G R A P H S. john schulman. Summer, 2016

O P T I M I Z I N G E X P E C TAT I O N S : T O S T O C H A S T I C C O M P U TAT I O N G R A P H S. john schulman. Summer, 2016 O P T I M I Z I N G E X P E C TAT I O N S : FROM DEEP REINFORCEMENT LEARNING T O S T O C H A S T I C C O M P U TAT I O N G R A P H S john schulman Summer, 2016 A dissertation submitted in partial satisfaction

More information

BACKPROPAGATION THROUGH THE VOID

BACKPROPAGATION THROUGH THE VOID BACKPROPAGATION THROUGH THE VOID Optimizing Control Variates for Black-Box Gradient Estimation 27 Nov 2017, University of Cambridge Speaker: Geoffrey Roeder, University of Toronto OPTIMIZING EXPECTATIONS

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning RL in continuous MDPs March April, 2015 Large/Continuous MDPs Large/Continuous state space Tabular representation cannot be used Large/Continuous action space Maximization over action

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart

More information