Deep Reinforcement Learning

Size: px
Start display at page:

Download "Deep Reinforcement Learning"

Transcription

1 Deep Reinforcement Learning John Schulman 1 MLSS, May 2016, Cadiz 1 Berkeley Artificial Intelligence Research Lab

2 Agenda Introduction and Overview Markov Decision Processes Reinforcement Learning via Black-Box Optimization Policy Gradient Methods Variance Reduction for Policy Gradients Trust Region and Natural Gradient Methods Open Problems Course materials: goo.gl/5wsgbj

3 Introduction and Overview

4 What is Reinforcement Learning? Branch of machine learning concerned with taking sequences of actions Usually described in terms of agent interacting with a previously unknown environment, trying to maximize cumulative reward action Agent Environment observation, reward

5 Motor Control and Robotics Robotics: Observations: camera images, joint angles Actions: joint torques Rewards: stay balanced, navigate to target locations, serve and protect humans

6 Business Operations Inventory Management Observations: current inventory levels Actions: number of units of each item to purchase Rewards: profit Resource allocation: who to provide customer service to first Routing problems: in management of shipping fleet, which trucks / truckers to assign to which cargo

7 Games A different kind of optimization problem (min-max) but still considered to be RL. Go (complete information, deterministic) AlphaGo 2 Backgammon (complete information, stochastic) TD-Gammon 3 Stratego (incomplete information, deterministic) Poker (incomplete information, stochastic) 2 David Silver, Aja Huang, et al. Mastering the game of Go with deep neural networks and tree search. In: Nature (2016), pp Gerald Tesauro. Temporal difference learning and TD-Gammon. In: Communications of the ACM 38.3 (1995), pp

8 Approaches to RL Policy Optimization Dynamic Programming modified policy iteration DFO / Evolution Policy Gradients Policy Iteration Value Iteration Actor-Critic Methods Q-Learning

9 What is Deep RL? RL using nonlinear function approximators Usually, updating parameters with stochastic gradient descent

10 What s Deep RL? Whatever the front half of the cerebral cortex does (motor and executive cortices)

11 Markov Decision Processes

12 Definition Markov Decision Process (MDP) defined by (S, A, P), where S: state space A: action space P(r, s s, a): a transition probability distribution Extra objects defined depending on problem setting µ: Initial state distribution γ: discount factor

13 Episodic Setting In each episode, the initial state is sampled from µ, and the process proceeds until the terminal state is reached. For example: Taxi robot reaches its destination (termination = good) Waiter robot finishes a shift (fixed time) Walking robot falls over (termination = bad) Goal: maximize expected reward per episode

14 Policies Deterministic policies: a = π(s) Stochastic policies: a π(a s) Parameterized policies: π θ

15 Episodic Setting s 0 µ(s 0 ) a 0 π(a 0 s 0 ) s 1, r 0 P(s 1, r 0 s 0, a 0 ) a 1 π(a 1 s 1 ) s 2, r 1 P(s 2, r 1 s 1, a 1 )... a T 1 π(a T 1 s T 1 ) s T, r T 1 P(s T s T 1, a T 1 ) Objective: maximize η(π), where η(π) = E[r 0 + r r T 1 π]

16 Episodic Setting π Agent a0 a1 at-1 s0 s1 s2 st μ0 r0 r1 rt-1 P Environment Objective: maximize η(π), where η(π) = E[r 0 + r r T 1 π]

17 Parameterized Policies A family of policies indexed by parameter vector θ R d Deterministic: a = π(s, θ) Stochastic: π(a s, θ) Analogous to classification or regression with input s, output a. E.g. for neural network stochastic policies: Discrete action space: network outputs vector of probabilities Continuous action space: network outputs mean and diagonal covariance of Gaussian

18 Reinforcement Learning via Black-Box Optimization

19 Derivative Free Optimization Approach Objective: maximize E[R π(, θ)] View θ R as a black box Ignore all other information other than R collected during episode

20 Cross-Entropy Method Evolutionary algorithm Works embarrassingly well István Szita and András Lörincz. Learning Tetris using the noisy cross-entropy method. In: Neural computation (2006), pp Victor Gabillon, Mohammad Ghavamzadeh, and Bruno Scherrer. Approximate Dynamic Programming Finally Performs Well in the Game of Tetris. In: Advances in Neural Information Processing Systems. 2013

21 Cross-Entropy Method Evolutionary algorithm Works embarrassingly well A similar algorithm, Covariance Matrix Adaptation, has become standard in graphics:

22 Cross-Entropy Method Initialize µ R d, σ R d for iteration = 1, 2,... do Collect n samples of θ i N(µ, diag(σ)) Perform a noisy evaluation R i θ i Select the top p% of samples (e.g. p = 20), which we ll call the elite set Fit a Gaussian distribution, with diagonal covariance, to the elite set, obtaining a new µ, σ. end for Return the final µ.

23 Cross-Entropy Method Analysis: a very similar algorithm is an minorization-maximization (MM) algorithm, guaranteed to monotonically increase expected reward Recall that Monte-Carlo EM algorithm collects samples, reweights them, and them maximizes their logprob We can derive MM algorithm where each iteration you maximize i log p(θ i)r i

24 Policy Gradient Methods

25 Policy Gradient Methods: Overview Problem: maximize E[R π θ ] Intuitions: collect a bunch of trajectories, and Make the good trajectories more probable 2. Make the good actions more probable (actor-critic, GAE) 3. Push the actions towards good actions (DPG, SVG)

26 Score Function Gradient Estimator Consider an expectation E x p(x θ) [f (x)]. Want to compute gradient wrt θ θ E x [f (x)] = θ dx p(x θ)f (x) = dx θ p(x θ)f (x) = dx p(x θ) θp(x θ) p(x θ) f (x) = dx p(x θ) θ log p(x θ)f (x) = E x [f (x) θ log p(x θ)]. Last expression gives us an unbiased gradient estimator. Just sample x i p(x θ), and compute ĝ i = f (x i ) θ log p(x i θ). Need to be able to compute and differentiate density p(x θ) wrt θ

27 Derivation via Importance Sampling Alternate Derivation Using Importance Sampling [ ] p(x θ) E x θ [f (x)] = E x θold p(x θ old ) f (x) [ ] θ p(x θ) θ E x θ [f (x)] = E x θold p(x θ old ) f (x) θ E x θ [f (x)] [ θ p(x θ) θ=θold θ=θold = E x θold f (x) p(x θ old ) = E x θold [ θ log p(x θ) ] θ=θold f (x) ]

28 Score Function Gradient Estimator: Intuition ĝ i = f (x i ) θ log p(x i θ) Let s say that f (x) measures how good the sample x is. Moving in the direction ĝ i pushes up the logprob of the sample, in proportion to how good it is Valid even if f (x) is discontinuous, and unknown, or sample space (containing x) is a discrete set

29 Score Function Gradient Estimator: Intuition ĝ i = f (x i ) θ log p(x i θ)

30 Score Function Gradient Estimator: Intuition ĝ i = f (x i ) θ log p(x i θ)

31 Score Function Gradient Estimator for Policies Now random variable x is a whole trajectory τ = (s 0, a 0, r 0, s 1, a 1, r 1,..., s T 1, a T 1, r T 1, s T ) Just need to write out p(τ θ): θ E τ [R(τ)] = E τ [ θ log p(τ θ)r(τ)] T 1 p(τ θ) = µ(s 0 ) [π(a t s t, θ)p(s t+1, r t s t, a t )] t=0 T 1 log p(τ θ) = log µ(s 0 ) + [log π(a t s t, θ) + log P(s t+1, r t s t, a t )] t=0 T 1 θ log p(τ θ) = θ log π(a t s t, θ) θ E τ [R] = E τ [ t=0 R θ T 1 ] log π(a t s t, θ) t=0 Interpretation: using good trajectories (high R) as supervised examples in classification / regression

32 Policy Gradient Slightly Better Formula Previous slide: [( T 1 )( T 1 )] θ E τ [R] = E τ r t θ log π(a t s t, θ) t=0 But we can cut trajectory to t steps and derive gradient estimator for one reward term r t. [ ] t θ E [r t ] = E r t θ log π(a t s t, θ) t=0 t=0 Sum this formula over t, obtaining θ E [R] = E = E [ T 1 r t t=0 [ T 1 t=0 t t=0 θ log π(a t s t, θ) T 1 θ log π(a t s t, θ) t =t r t ] ]

33 Adding a Baseline Suppose f (x) 0, x Then for every x i, gradient estimator ĝ i tries to push up it s density We can derive a new unbiased estimator that avoids this problem, and only pushes up the density for better-than-average x i. θ E x [f (x)] = θ E x [f (x) b] = E x [ θ log p(x θ)(f (x) b)] A near-optimal choice of b is always E [f (x)] (which must be estimated)

34 Policy Gradient with Baseline Recall T 1 θ E τ [R] = t =0 r t T 1 θ log π(a t s t, θ) Using the E at [ θ log π(a t s t, θ)] = 0, we can show [ T 1 ( T 1 )] θ E τ [R] = E τ θ log π(a t s t, θ) r t b(s t ) t=0 t=t for any baseline function b : S R Increase logprob of action a t proportionally to how much returns T 1 t=t r t are better than expected Later: use value functions to further isolate effect of action, at the cost of bias For more general picture of score function gradient estimator, see stochastic computation graphs 4. 4 John Schulman, Nicolas Heess, et al. Gradient Estimation Using Stochastic Computation Graphs. In: Advances in Neural Information Processing Systems. 2015, pp t=t

35 That s all for today Course Materials: goo.gl/5wsgbj

36 Variance Reduction for Policy Gradients

37 Review (I) Process for generating trajectory τ = (s 0, a 0, r 0, s 1, a 1, r 1,..., s T 1, a T 1, r T 1, s T ) s 0 µ(s 0 ) a 0 π(a 0 s 0 ) s 1, r 0 P(s 1, r 0 s 0, a 0 ) a 1 π(a 1 s 1 ) s 2, r 1 P(s 2, r 1 s 1, a 1 )... a T 1 π(a T 1 s T 1 ) s T, r T 1 P(s T s T 1, a T 1 ) Given parameterized policy π(a s, θ), the optimization problem is maximize θ E τ [R π(, θ)] where R = r 0 + r r T 1.

38 Review (II) In general, we can compute gradients of expectations with the score function gradient estimator θ E x p(x θ) [f (x)] = E x [ θ log p(x θ)f (x)] We derived a formula for the policy gradient [ T 1 θ E τ [R] = E τ θ log π(a t s t, θ) t=0 ( T 1 t=t r t b(s t ) )]

39 Value Functions The state-value function V π is defined as: V π (s) = E[r 0 + r 1 + r s 0 = s] Measures expected future return, starting with state s The state-action value function Q π is defined as Q π (s, a) = E[r 0 + r 1 + r s 0 = s, a 0 = a] The advantage function A π is A π (s, a) = Q π (s, a) V π (s) Measures how much better is action a than what the policy π would ve done.

40 Refining the Policy Gradient Formula Recall [ T 1 ( T 1 )] θ E τ [R] = E τ θ log π(a t s t, θ) r t b(s t ) t=0 t=t ( T 1 T 1 )] = E τ [ θ log π(a t s t, θ) r t b(s t ) t=0 t=t [ [( T 1 T 1 )]] = E s0...a t θ log π(a t s t, θ)e rtst+1...s T r t b(s t ) t=0 t=t = T 1 t=0 E s0...a t [ θ log π(a t s t, θ)e rts t+1...s T [Q π (s t, a t ) b(s t )] ] Where the last equality used the fact that [ T 1 ] E rtst+1...s T = Q π (s t, a t ) t=t r t

41 Refining the Policy Gradient Formula From the previous slide, we ve obtained [ T 1 ] θ E τ [R] = E τ θ log π(a t s t, θ)(q π (s t, a t ) b(s t )) t=0 Now let s define b(s) = V π (s), which turns out to be near-optimal 5. We get [ T 1 ] θ E τ [R] = E τ θ log π(a t s t, θ)a π (s t, a t ) t=0 Intuition: increase the probability of good actions (positive advantage) decrease the probability of bad ones (negative advantage) 5 Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. In: The Journal of Machine Learning Research 5 (2004), pp

42 Variance Reduction Now, we have the following policy gradient formula: [ T 1 ] θ E τ [R] = E τ θ log π(a t s t, θ)a π (s t, a t ) t=0 A π is not known, but we can plug in a random variable  t, an advantage estimator Previously, we showed that taking  t = r t + r t+1 + r t+2 + b(s t ) for any function b(s t ), gives an unbiased policy gradient estimator. b(s t ) V π (s t ) gives variance reduction.

43 The Delayed Reward Problem One reason RL is difficult is the long delay between action and reward

44 The Delayed Reward Problem With policy gradient methods, we are confounding the effect of multiple actions:  t = r t + r t+1 + r t+2 + b(s t ) mixes effect of a t, a t+1, a t+2,... SNR of  t scales roughly as 1/T Only at contributes to signal A π (s t, a t ), but a t+1, a t+2,... contribute to noise.

45 Var. Red. Idea 1: Using Discounts Discount factor γ, 0 < γ < 1, downweights the effect of rewars that are far in the future ignore long term dependencies We can form an advantage estimator using the discounted return: Â γ t = r t + γr t+1 + γ 2 r t b(s }{{} t ) discounted return reduces to our previous estimator when γ = 1. So advantage has expectation zero, we should fit baseline to be discounted value function V π,γ (s) = E τ [ r0 + γr 1 + γ 2 r s 0 = s ] Â γ t is a biased estimator of the advantage function

46 Var. Red. Idea 2: Value Functions in the Future Another approach for variance reduction is to use the value function to estimate future rewards r t + r t+1 + r t r t + V (s t+1 ) r t + r t+1 + V (s t+2 )... use empirical rewards cut off at one timestep cut off at two timesteps Adding the baseline again, we get the advantage estimators  t = r t + V (s t+1 ) V (s t )  t = r t + r t+1 + V (s t+2 ) V (s t )... cut off at one timestep cut off at two timesteps

47 Combining Ideas 1 and 2 Can combine discounts and value functions in the future, e.g., Â t = r t + γv (s t+1 ) V (s t ), where V approximates discounted value function V π,γ. The above formula is called an actor-critic method, where actor is the policy π, and critic is the value function V. 6 Going further, the generalized advantage estimator 7 Â γ,λ t =δ t + (γλ)δ t+1 + (γλ) 2 δ t where δ t = r t + γv (s t+1 ) V (s t ) Interpolates between two previous estimators: λ = 0 : r t + γv (s t+1 ) V (s t ) (low v, high b) λ = 1 : r t + γr t+1 + γ 2 r t+2 + V (s t ) (low b, high v) 6 Vijay R Konda and John N Tsitsiklis. Actor-Critic Algorithms. In: Advances in Neural Information Processing Systems. Vol. 13. Citeseer. 1999, pp John Schulman, Philipp Moritz, et al. High-dimensional continuous control using generalized advantage estimation. In: arxiv preprint arxiv: (2015).

48 Alternative Approach: Reparameterization Suppose problem has continuous action space, a R d Then d da Qπ (s, a) tells use how to improve our action We can use reparameterization trick, so a is a deterministic function a = f (s, z), where z is noise. Then, θ E τ [R] = θ Q π (s 0, a 0 ) + θ Q π (s 1, a 1 ) +... This method is called the deterministic policy gradient 8 A generalized version, which also uses a dynamics model, is described as the stochastic value gradient 9 8 David Silver, Guy Lever, et al. Deterministic policy gradient algorithms. In: ICML. 2014; Timothy P Lillicrap et al. Continuous control with deep reinforcement learning. In: arxiv preprint arxiv: (2015). 9 Nicolas Heess et al. Learning continuous control policies by stochastic value gradients. In: Advances in Neural Information Processing Systems. 2015, pp

49 Trust Region and Natural Gradient Methods

50 Optimization Issues with Policy Gradients Hard to choose reasonable stepsize that works for the whole optimization we have a gradient estimate, no objective for line search statistics of data (observations and rewards) change during learning They make inefficient use of data: each experience is only used to compute one gradient. Given a batch of trajectories, what s the most we can do with it?

51 Policy Performance Function Let η(π) denote the performance of policy π η(π) = E τ [R π] The following neat identity holds: η( π) = η(π) + E τ π [A π (s 0, a 0 ) + A π (s 1, a 1 ) + A π (s 2, a 2 ) +... ] Proof: consider nonstationary policy π 0 π 1 π 2,... η( π π π ) = η(πππ ) + η( πππ ) η(πππ ) + η( π ππ ) η( πππ ) + η( π π π ) η( π ππ ) +... t th difference term equals A π (s t, a t )

52 Local Approximation We just derived an expression for the performance of a policy π relative to π η( π) = η(π) + E τ π [A π (s 0, a 0 ) + A π (s 1, a 1 ) +... ] = η(π) + E s0: π [E a0: π [A π (s 0, a 0 ) + A π (s 1, a 1 ) +... ]] Can t use this to optimize π because state distribution has complicated dependence. Let s define L π the local approximation, which ignores change in state distribution can be estimated by sampling from π L π ( π) = E s0: π [E a0: π [A π (s 0, a 0 ) + A π (s 1, a 1 ) +... ]] [ T 1 ] = E s0: E a π [A π (s t, a t )] t=0 [ T 1 [ ] ] π(at s t ) = E s0: E a π π(a t s t ) Aπ (s t, a t ) t=0 [ T 1 π(a t s t ) = E τ π π(a t s t ) Aπ (s t, a t ) t=0 ]

53 Local Approximation Now let s consider parameterized policy, π(a s, θ). Sample with θ old, now write local approximation in terms of θ. [ T 1 [ ] ] π(at s t ) L π ( π) = E s0: E a π π(a t s t ) Aπ (s t, a t ) t=0 t=0 [ T 1 [ ] ] π(at s t, θ) L θold (θ) = E s0: E a θ π(a t s t, θ old ) Aθ (s t, a t ) L θold (θ) matches η(θ) to first order around θ old. θ L θold (θ) [ T 1 [ ] ] θ π(a t s t, θ) θ=θ0 = E s0: E a θ π(a t s t, θ old ) Aθ (s t, a t ) t=0 [ T 1 [ = E s0: E a θ θ log π(a t s t, θ)a θ (s t, a t ) ]] t=0 = θ η(θ) θ=θold

54 MM Algorithm Theorem (ignoring some details) 10 η(θ) L θold (θ) C max D KL [π( θ old, s) π( θ, s)] }{{} s }{{} local approx. to η penalty for changing policy θ η(θ) L(θ)+C KL L(θ) If θ old θ new improves lower bound, it s guaranteed to improve η 10 John Schulman, Sergey Levine, et al. Trust Region Policy Optimization. In: arxiv preprint arxiv: (2015).

55 Review Want to optimize η(θ). Collected data with policy parameter θ old, now want to do update Derived local approximation L θold (θ) Optimizing KL penalized local approximation givesn guaranteed improvement to η More approximations gives practical algorithm, called TRPO

56 TRPO Approximations Steps: Instead of max over state space, take mean Linear approximation to L, quadratic approximation to KL divergence Use hard constraint on KL divergence instead of penalty Solve the following problem approximately maximize L θold (θ) subject to D KL [θ old θ] δ Solve approximately through line search in the natural gradient direction s = F 1 g Resulting algorithm is a refined version of natural policy gradient Sham Kakade. A Natural Policy Gradient. In: NIPS. vol , pp

57 Empirical Results: TRPO + GAE TRPO, with neural network policies, was applied to learn controllers for 2D robotic swimming, hopping, and walking, and playing Atari games 12 Used TRPO along with generalized advantage estimation to optimize locomotion policies for 3D simulated robots 13 Friday, June 6, John Schulman, Sergey Levine, et al. Trust Region Policy Optimization. In: arxiv preprint arxiv: (2015). 13 John Schulman, Philipp Moritz, et al. High-dimensional continuous control using generalized advantage estimation. In: arxiv preprint arxiv: (2015).

58 Putting In Perspective Quick and incomplete overview of recent results with deep RL algorithms Policy gradient methods TRPO + GAE Standard policy gradient (no trust region) + deep nets + parallel implementation 14 Repar trick 15 Q-learning 16 and modifications 17 Combining search + supervised learning V. Mnih et al. Playing Atari with Deep Reinforcement Learning. In: arxiv preprint arxiv: (2013). 15 Nicolas Heess et al. Learning continuous control policies by stochastic value gradients. In: Advances in Neural Information Processing Systems. 2015, pp ; Timothy P Lillicrap et al. Continuous control with deep reinforcement learning. In: arxiv preprint arxiv: (2015). 16 V. Mnih et al. Playing Atari with Deep Reinforcement Learning. In: arxiv preprint arxiv: (2013). 17 Ziyu Wang, Nando de Freitas, and Marc Lanctot. Dueling Network Architectures for Deep Reinforcement Learning. In: arxiv preprint arxiv: (2015); Hado V Hasselt. Double Q-learning. In: Advances in Neural Information Processing Systems. 2010, pp X. Guo et al. Deep learning for real-time Atari game play using offline Monte-Carlo tree search planning. In: Advances in Neural Information Processing Systems. 2014, pp ; Sergey Levine et al. End-to-end training of deep visuomotor policies. In: arxiv preprint arxiv: (2015); Igor Mordatch et al. Interactive Control of Diverse Complex Characters with Neural Networks. In: Advances in Neural Information Processing Systems. 2015, pp

59 Open Problems

60 What s the Right Core Model-Free Algorithm? Policy gradients (score function vs. reparameterization, natural vs. not natural) vs. Q-learning vs. derivative-free optimization vs others Desiderata scalable sample-efficient robust learns from off-policy data

61 Exploration Exploration: actively encourage agent to reach unfamiliar parts of state space, avoid getting stuck in local maximum of performance Can solve finite MDPs in polynomial time with exploration 19 optimism about new states and actions maintain distribution over possible models, and plan with them (Bayesian RL, Thompson sampling) How to do exploration in deep RL setting? Thompson sampling 20, novelty bonus Alexander L Strehl et al. PAC model-free reinforcement learning. In: Proceedings of the 23rd international conference on Machine learning. ACM. 2006, pp Ian Osband et al. Deep Exploration via Bootstrapped DQN.. In: arxiv preprint arxiv: (2016). 21 Bradly C Stadie, Sergey Levine, and Pieter Abbeel. Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models. In: arxiv preprint arxiv: (2015).

62 Hierarchy task 1 task 2 task 3 task 4 10 timesteps / day walk to x fetch object y say z.01 hz: 10 3 time steps per day footstep planning: 1hz: 10 5 timesteps / day torque control: 100hz: 10 7 timesteps /day

63 More Open Problems Using learned models Learning from demonstrations

64 The End Questions?

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Deep Reinforcement Learning: Policy Gradients and Q-Learning Deep Reinforcement Learning: Policy Gradients and Q-Learning John Schulman Bay Area Deep Learning School September 24, 2016 Introduction and Overview Aim of This Talk What is deep RL, and should I use

More information

Policy Gradient Methods. February 13, 2017

Policy Gradient Methods. February 13, 2017 Policy Gradient Methods February 13, 2017 Policy Optimization Problems maximize E π [expression] π Fixed-horizon episodic: T 1 Average-cost: lim T 1 T r t T 1 r t Infinite-horizon discounted: γt r t Variable-length

More information

Deep Reinforcement Learning via Policy Optimization

Deep Reinforcement Learning via Policy Optimization Deep Reinforcement Learning via Policy Optimization John Schulman July 3, 2017 Introduction Deep Reinforcement Learning: What to Learn? Policies (select next action) Deep Reinforcement Learning: What to

More information

Lecture 9: Policy Gradient II 1

Lecture 9: Policy Gradient II 1 Lecture 9: Policy Gradient II 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John

More information

Lecture 9: Policy Gradient II (Post lecture) 2

Lecture 9: Policy Gradient II (Post lecture) 2 Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver

More information

Trust Region Policy Optimization

Trust Region Policy Optimization Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017 Yixin Lin (Duke) TRPO March 28, 2017 1 / 21 Overview 1 Preliminaries Markov Decision Processes Policy iteration

More information

Variance Reduction for Policy Gradient Methods. March 13, 2017

Variance Reduction for Policy Gradient Methods. March 13, 2017 Variance Reduction for Policy Gradient Methods March 13, 2017 Reward Shaping Reward Shaping Reward Shaping Reward shaping: r(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ Theorem: r admits

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

Lecture 8: Policy Gradient I 2

Lecture 8: Policy Gradient I 2 Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver and John Schulman

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Policy gradients Daniel Hennes 26.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Policy based reinforcement learning So far we approximated the action value

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Policy Search: Actor-Critic and Gradient Policy search Mario Martin CS-UPC May 28, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 28, 2018 / 63 Goal of this lecture So far

More information

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari

More information

Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More. March 8, 2017

Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More. March 8, 2017 Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More March 8, 2017 Defining a Loss Function for RL Let η(π) denote the expected return of π [ ] η(π) = E s0 ρ 0,a t π( s t) γ t r t We collect

More information

O P T I M I Z I N G E X P E C TAT I O N S : T O S T O C H A S T I C C O M P U TAT I O N G R A P H S. john schulman. Summer, 2016

O P T I M I Z I N G E X P E C TAT I O N S : T O S T O C H A S T I C C O M P U TAT I O N G R A P H S. john schulman. Summer, 2016 O P T I M I Z I N G E X P E C TAT I O N S : FROM DEEP REINFORCEMENT LEARNING T O S T O C H A S T I C C O M P U TAT I O N G R A P H S john schulman Summer, 2016 A dissertation submitted in partial satisfaction

More information

Reinforcement Learning via Policy Optimization

Reinforcement Learning via Policy Optimization Reinforcement Learning via Policy Optimization Hanxiao Liu November 22, 2017 1 / 27 Reinforcement Learning Policy a π(s) 2 / 27 Example - Mario 3 / 27 Example - ChatBot 4 / 27 Applications - Video Games

More information

Reinforcement Learning for NLP

Reinforcement Learning for NLP Reinforcement Learning for NLP Advanced Machine Learning for NLP Jordan Boyd-Graber REINFORCEMENT OVERVIEW, POLICY GRADIENT Adapted from slides by David Silver, Pieter Abbeel, and John Schulman Advanced

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

REINFORCEMENT LEARNING

REINFORCEMENT LEARNING REINFORCEMENT LEARNING Larry Page: Where s Google going next? DeepMind's DQN playing Breakout Contents Introduction to Reinforcement Learning Deep Q-Learning INTRODUCTION TO REINFORCEMENT LEARNING Contents

More information

Exploration. 2015/10/12 John Schulman

Exploration. 2015/10/12 John Schulman Exploration 2015/10/12 John Schulman What is the exploration problem? Given a long-lived agent (or long-running learning algorithm), how to balance exploration and exploitation to maximize long-term rewards

More information

Optimizing Expectations: From Deep Reinforcement Learning to Stochastic Computation Graphs

Optimizing Expectations: From Deep Reinforcement Learning to Stochastic Computation Graphs Optimizing Expectations: From Deep Reinforcement Learning to Stochastic Computation Graphs John Schulman Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report

More information

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018 Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)

More information

arxiv: v5 [cs.lg] 9 Sep 2016

arxiv: v5 [cs.lg] 9 Sep 2016 HIGH-DIMENSIONAL CONTINUOUS CONTROL USING GENERALIZED ADVANTAGE ESTIMATION John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan and Pieter Abbeel Department of Electrical Engineering and Computer

More information

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Markov Decision Processes and Solving Finite Problems. February 8, 2017 Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:

More information

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory Deep Reinforcement Learning Jeremy Morton (jmorton2) November 7, 2016 SISL Stanford Intelligent Systems Laboratory Overview 2 1 Motivation 2 Neural Networks 3 Deep Reinforcement Learning 4 Deep Learning

More information

VIME: Variational Information Maximizing Exploration

VIME: Variational Information Maximizing Exploration 1 VIME: Variational Information Maximizing Exploration Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, Pieter Abbeel Reviewed by Zhao Song January 13, 2017 1 Exploration in Reinforcement

More information

Combining PPO and Evolutionary Strategies for Better Policy Search

Combining PPO and Evolutionary Strategies for Better Policy Search Combining PPO and Evolutionary Strategies for Better Policy Search Jennifer She 1 Abstract A good policy search algorithm needs to strike a balance between being able to explore candidate policies and

More information

Behavior Policy Gradient Supplemental Material

Behavior Policy Gradient Supplemental Material Behavior Policy Gradient Supplemental Material Josiah P. Hanna 1 Philip S. Thomas 2 3 Peter Stone 1 Scott Niekum 1 A. Proof of Theorem 1 In Appendix A, we give the full derivation of our primary theoretical

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Lecture 8: Policy Gradient

Lecture 8: Policy Gradient Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Daniel Hennes 19.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns Forward and backward view Function

More information

Generalized Advantage Estimation for Policy Gradients

Generalized Advantage Estimation for Policy Gradients Generalized Advantage Estimation for Policy Gradients Authors Department of Electrical Engineering and Computer Science Abstract Value functions provide an elegant solution to the delayed reward problem

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted 15-889e Policy Search: Gradient Methods Emma Brunskill All slides from David Silver (with EB adding minor modificafons), unless otherwise noted Outline 1 Introduction 2 Finite Difference Policy Gradient

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

Deep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department

Deep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department Deep reinforcement learning Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department 1 / 25 In this lecture... Introduction to deep reinforcement learning Value-based Deep RL Deep

More information

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan Some slides borrowed from Peter Bodik and David Silver Course progress Learning

More information

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms * Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

More information

Gradient Estimation Using Stochastic Computation Graphs

Gradient Estimation Using Stochastic Computation Graphs Gradient Estimation Using Stochastic Computation Graphs Yoonho Lee Department of Computer Science and Engineering Pohang University of Science and Technology May 9, 2017 Outline Stochastic Gradient Estimation

More information

(Deep) Reinforcement Learning

(Deep) Reinforcement Learning Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Reinforcement Learning as Classification Leveraging Modern Classifiers

Reinforcement Learning as Classification Leveraging Modern Classifiers Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions

More information

CSC321 Lecture 22: Q-Learning

CSC321 Lecture 22: Q-Learning CSC321 Lecture 22: Q-Learning Roger Grosse Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21 Overview Second of 3 lectures on reinforcement learning Last time: policy gradient (e.g. REINFORCE) Optimize

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

Variational Deep Q Network

Variational Deep Q Network Variational Deep Q Network Yunhao Tang Department of IEOR Columbia University yt2541@columbia.edu Alp Kucukelbir Department of Computer Science Columbia University alp@cs.columbia.edu Abstract We propose

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Cyber Rodent Project Some slides from: David Silver, Radford Neal CSC411: Machine Learning and Data Mining, Winter 2017 Michael Guerzhoy 1 Reinforcement Learning Supervised learning:

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart

More information

Introduction of Reinforcement Learning

Introduction of Reinforcement Learning Introduction of Reinforcement Learning Deep Reinforcement Learning Reference Textbook: Reinforcement Learning: An Introduction http://incompleteideas.net/sutton/book/the-book.html Lectures of David Silver

More information

arxiv: v4 [cs.lg] 14 Oct 2018

arxiv: v4 [cs.lg] 14 Oct 2018 Equivalence Between Policy Gradients and Soft -Learning John Schulman 1, Xi Chen 1,2, and Pieter Abbeel 1,2 1 OpenAI 2 UC Berkeley, EECS Dept. {joschu, peter, pieter}@openai.com arxiv:174.644v4 cs.lg 14

More information

Learning Dexterity Matthias Plappert SEPTEMBER 6, 2018

Learning Dexterity Matthias Plappert SEPTEMBER 6, 2018 Learning Dexterity Matthias Plappert SEPTEMBER 6, 2018 OpenAI OpenAI is a non-profit AI research company, discovering and enacting the path to safe artificial general intelligence. OpenAI OpenAI is a non-profit

More information

Deep Reinforcement Learning. Scratching the surface

Deep Reinforcement Learning. Scratching the surface Deep Reinforcement Learning Scratching the surface Deep Reinforcement Learning Scenario of Reinforcement Learning Observation State Agent Action Change the environment Don t do that Reward Environment

More information

Approximate Q-Learning. Dan Weld / University of Washington

Approximate Q-Learning. Dan Weld / University of Washington Approximate Q-Learning Dan Weld / University of Washington [Many slides taken from Dan Klein and Pieter Abbeel / CS188 Intro to AI at UC Berkeley materials available at http://ai.berkeley.edu.] Q Learning

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Assignment #1 Roll-out: nice example paper: X.

More information

Approximation Methods in Reinforcement Learning

Approximation Methods in Reinforcement Learning 2018 CS420, Machine Learning, Lecture 12 Approximation Methods in Reinforcement Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html Reinforcement

More information

Machine Learning I Continuous Reinforcement Learning

Machine Learning I Continuous Reinforcement Learning Machine Learning I Continuous Reinforcement Learning Thomas Rückstieß Technische Universität München January 7/8, 2010 RL Problem Statement (reminder) state s t+1 ENVIRONMENT reward r t+1 new step r t

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@cse.iitb.ac.in Department of Computer Science and Engineering Indian Institute of Technology Bombay April 2018 What is Reinforcement

More information

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Implicit Incremental Natural Actor Critic

Implicit Incremental Natural Actor Critic Implicit Incremental Natural Actor Critic Ryo Iwaki and Minoru Asada Osaka University, -1, Yamadaoka, Suita city, Osaka, Japan {ryo.iwaki,asada}@ams.eng.osaka-u.ac.jp Abstract. The natural policy gradient

More information

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING Slides adapted from Tom Mitchell and Peter Abeel Machine Learning: Jordan Boyd-Graber UMD Machine Learning

More information

CS230: Lecture 9 Deep Reinforcement Learning

CS230: Lecture 9 Deep Reinforcement Learning CS230: Lecture 9 Deep Reinforcement Learning Kian Katanforoosh Menti code: 21 90 15 Today s outline I. Motivation II. Recycling is good: an introduction to RL III. Deep Q-Learning IV. Application of Deep

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

Reinforcement learning an introduction

Reinforcement learning an introduction Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,

More information

Multiagent (Deep) Reinforcement Learning

Multiagent (Deep) Reinforcement Learning Multiagent (Deep) Reinforcement Learning MARTIN PILÁT (MARTIN.PILAT@MFF.CUNI.CZ) Reinforcement learning The agent needs to learn to perform tasks in environment No prior knowledge about the effects of

More information

Deep Reinforcement Learning

Deep Reinforcement Learning Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 50 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015

More information

arxiv: v2 [cs.lg] 7 Mar 2018

arxiv: v2 [cs.lg] 7 Mar 2018 Comparing Deep Reinforcement Learning and Evolutionary Methods in Continuous Control arxiv:1712.00006v2 [cs.lg] 7 Mar 2018 Shangtong Zhang 1, Osmar R. Zaiane 2 12 Dept. of Computing Science, University

More information

Reinforcement Learning II. George Konidaris

Reinforcement Learning II. George Konidaris Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2017 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes

More information

Lecture 7: Value Function Approximation

Lecture 7: Value Function Approximation Lecture 7: Value Function Approximation Joseph Modayil Outline 1 Introduction 2 3 Batch Methods Introduction Large-Scale Reinforcement Learning Reinforcement learning can be used to solve large problems,

More information

Reinforcement. Function Approximation. Learning with KATJA HOFMANN. Researcher, MSR Cambridge

Reinforcement. Function Approximation. Learning with KATJA HOFMANN. Researcher, MSR Cambridge Reinforcement Learning with Function Approximation KATJA HOFMANN Researcher, MSR Cambridge Representation and Generalization in RL Focus on training stability Learning generalizable value functions Navigating

More information

Reinforcement Learning II. George Konidaris

Reinforcement Learning II. George Konidaris Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2018 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

An Off-policy Policy Gradient Theorem Using Emphatic Weightings

An Off-policy Policy Gradient Theorem Using Emphatic Weightings An Off-policy Policy Gradient Theorem Using Emphatic Weightings Ehsan Imani, Eric Graves, Martha White Reinforcement Learning and Artificial Intelligence Laboratory Department of Computing Science University

More information

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017 Deep Reinforcement Learning STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017 Outline Introduction to Reinforcement Learning AlphaGo (Deep RL for Computer Go)

More information

Tutorial on Policy Gradient Methods. Jan Peters

Tutorial on Policy Gradient Methods. Jan Peters Tutorial on Policy Gradient Methods Jan Peters Outline 1. Reinforcement Learning 2. Finite Difference vs Likelihood-Ratio Policy Gradients 3. Likelihood-Ratio Policy Gradients 4. Conclusion General Setup

More information

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5 Table of contents 1 Introduction 2 2 Markov Decision Processes 2 3 Future Cumulative Reward 3 4 Q-Learning 4 4.1 The Q-value.............................................. 4 4.2 The Temporal Difference.......................................

More information

Forward Actor-Critic for Nonlinear Function Approximation in Reinforcement Learning

Forward Actor-Critic for Nonlinear Function Approximation in Reinforcement Learning Forward Actor-Critic for Nonlinear Function Approximation in Reinforcement Learning Vivek Veeriah Dept. of Computing Science University of Alberta Edmonton, Canada vivekveeriah@ualberta.ca Harm van Seijen

More information

Deterministic Policy Gradient, Advanced RL Algorithms

Deterministic Policy Gradient, Advanced RL Algorithms NPFL122, Lecture 9 Deterministic Policy Gradient, Advanced RL Algorithms Milan Straka December 1, 218 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics

More information

David Silver, Google DeepMind

David Silver, Google DeepMind Tutorial: Deep Reinforcement Learning David Silver, Google DeepMind Outline Introduction to Deep Learning Introduction to Reinforcement Learning Value-Based Deep RL Policy-Based Deep RL Model-Based Deep

More information

Reinforcement Learning: An Introduction

Reinforcement Learning: An Introduction Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is

More information

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University

More information

Comparing Deep Reinforcement Learning and Evolutionary Methods in Continuous Control

Comparing Deep Reinforcement Learning and Evolutionary Methods in Continuous Control Comparing Deep Reinforcement Learning and Evolutionary Methods in Continuous Control Shangtong Zhang Dept. of Computing Science University of Alberta shangtong.zhang@ualberta.ca Osmar R. Zaiane Dept. of

More information

Learning Control for Air Hockey Striking using Deep Reinforcement Learning

Learning Control for Air Hockey Striking using Deep Reinforcement Learning Learning Control for Air Hockey Striking using Deep Reinforcement Learning Ayal Taitler, Nahum Shimkin Faculty of Electrical Engineering Technion - Israel Institute of Technology May 8, 2017 A. Taitler,

More information

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it

More information

CS 570: Machine Learning Seminar. Fall 2016

CS 570: Machine Learning Seminar. Fall 2016 CS 570: Machine Learning Seminar Fall 2016 Class Information Class web page: http://web.cecs.pdx.edu/~mm/mlseminar2016-2017/fall2016/ Class mailing list: cs570@cs.pdx.edu My office hours: T,Th, 2-3pm or

More information

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016) Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016) Yoonho Lee Department of Computer Science and Engineering Pohang University of Science and Technology October 11, 2016 Outline

More information

Notes on Reinforcement Learning

Notes on Reinforcement Learning 1 Introduction Notes on Reinforcement Learning Paulo Eduardo Rauber 2014 Reinforcement learning is the study of agents that act in an environment with the goal of maximizing cumulative reward signals.

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Stuart Russell, UC Berkeley Stuart Russell, UC Berkeley 1 Outline Sequential decision making Dynamic programming algorithms Reinforcement learning algorithms temporal difference

More information

COMP9444 Neural Networks and Deep Learning 10. Deep Reinforcement Learning. COMP9444 c Alan Blair, 2017

COMP9444 Neural Networks and Deep Learning 10. Deep Reinforcement Learning. COMP9444 c Alan Blair, 2017 COMP9444 Neural Networks and Deep Learning 10. Deep Reinforcement Learning COMP9444 17s2 Deep Reinforcement Learning 1 Outline History of Reinforcement Learning Deep Q-Learning for Atari Games Actor-Critic

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Towards Generalization and Simplicity in Continuous Control

Towards Generalization and Simplicity in Continuous Control Towards Generalization and Simplicity in Continuous Control Aravind Rajeswaran, Kendall Lowrey, Emanuel Todorov, Sham Kakade Paul Allen School of Computer Science University of Washington Seattle { aravraj,

More information