Deep Reinforcement Learning

Size: px

Start display at page:

Download "Deep Reinforcement Learning"

Irma Knight
6 years ago
Views:

1 Deep Reinforcement Learning John Schulman 1 MLSS, May 2016, Cadiz 1 Berkeley Artificial Intelligence Research Lab

2 Agenda Introduction and Overview Markov Decision Processes Reinforcement Learning via Black-Box Optimization Policy Gradient Methods Variance Reduction for Policy Gradients Trust Region and Natural Gradient Methods Open Problems Course materials: goo.gl/5wsgbj

3 Introduction and Overview

4 What is Reinforcement Learning? Branch of machine learning concerned with taking sequences of actions Usually described in terms of agent interacting with a previously unknown environment, trying to maximize cumulative reward action Agent Environment observation, reward

5 Motor Control and Robotics Robotics: Observations: camera images, joint angles Actions: joint torques Rewards: stay balanced, navigate to target locations, serve and protect humans

6 Business Operations Inventory Management Observations: current inventory levels Actions: number of units of each item to purchase Rewards: profit Resource allocation: who to provide customer service to first Routing problems: in management of shipping fleet, which trucks / truckers to assign to which cargo

7 Games A different kind of optimization problem (min-max) but still considered to be RL. Go (complete information, deterministic) AlphaGo 2 Backgammon (complete information, stochastic) TD-Gammon 3 Stratego (incomplete information, deterministic) Poker (incomplete information, stochastic) 2 David Silver, Aja Huang, et al. Mastering the game of Go with deep neural networks and tree search. In: Nature (2016), pp Gerald Tesauro. Temporal difference learning and TD-Gammon. In: Communications of the ACM 38.3 (1995), pp

8 Approaches to RL Policy Optimization Dynamic Programming modified policy iteration DFO / Evolution Policy Gradients Policy Iteration Value Iteration Actor-Critic Methods Q-Learning

9 What is Deep RL? RL using nonlinear function approximators Usually, updating parameters with stochastic gradient descent

10 What s Deep RL? Whatever the front half of the cerebral cortex does (motor and executive cortices)

11 Markov Decision Processes

12 Definition Markov Decision Process (MDP) defined by (S, A, P), where S: state space A: action space P(r, s s, a): a transition probability distribution Extra objects defined depending on problem setting µ: Initial state distribution γ: discount factor

13 Episodic Setting In each episode, the initial state is sampled from µ, and the process proceeds until the terminal state is reached. For example: Taxi robot reaches its destination (termination = good) Waiter robot finishes a shift (fixed time) Walking robot falls over (termination = bad) Goal: maximize expected reward per episode

14 Policies Deterministic policies: a = π(s) Stochastic policies: a π(a s) Parameterized policies: π θ

15 Episodic Setting s 0 µ(s 0 ) a 0 π(a 0 s 0 ) s 1, r 0 P(s 1, r 0 s 0, a 0 ) a 1 π(a 1 s 1 ) s 2, r 1 P(s 2, r 1 s 1, a 1 )... a T 1 π(a T 1 s T 1 ) s T, r T 1 P(s T s T 1, a T 1 ) Objective: maximize η(π), where η(π) = E[r 0 + r r T 1 π]

16 Episodic Setting π Agent a0 a1 at-1 s0 s1 s2 st μ0 r0 r1 rt-1 P Environment Objective: maximize η(π), where η(π) = E[r 0 + r r T 1 π]

17 Parameterized Policies A family of policies indexed by parameter vector θ R d Deterministic: a = π(s, θ) Stochastic: π(a s, θ) Analogous to classification or regression with input s, output a. E.g. for neural network stochastic policies: Discrete action space: network outputs vector of probabilities Continuous action space: network outputs mean and diagonal covariance of Gaussian

18 Reinforcement Learning via Black-Box Optimization

19 Derivative Free Optimization Approach Objective: maximize E[R π(, θ)] View θ R as a black box Ignore all other information other than R collected during episode

Cross-Entropy Method Evolutionary algorithm Works embarrassingly well István Szita and András Lörincz. Learning Tetris using the noisy cross-entropy method. In: Neural computation 18.12 (2006), pp.

20 Cross-Entropy Method Evolutionary algorithm Works embarrassingly well István Szita and András Lörincz. Learning Tetris using the noisy cross-entropy method. In: Neural computation (2006), pp Victor Gabillon, Mohammad Ghavamzadeh, and Bruno Scherrer. Approximate Dynamic Programming Finally Performs Well in the Game of Tetris. In: Advances in Neural Information Processing Systems. 2013

21 Cross-Entropy Method Evolutionary algorithm Works embarrassingly well A similar algorithm, Covariance Matrix Adaptation, has become standard in graphics:

22 Cross-Entropy Method Initialize µ R d, σ R d for iteration = 1, 2,... do Collect n samples of θ i N(µ, diag(σ)) Perform a noisy evaluation R i θ i Select the top p% of samples (e.g. p = 20), which we ll call the elite set Fit a Gaussian distribution, with diagonal covariance, to the elite set, obtaining a new µ, σ. end for Return the final µ.

23 Cross-Entropy Method Analysis: a very similar algorithm is an minorization-maximization (MM) algorithm, guaranteed to monotonically increase expected reward Recall that Monte-Carlo EM algorithm collects samples, reweights them, and them maximizes their logprob We can derive MM algorithm where each iteration you maximize i log p(θ i)r i

24 Policy Gradient Methods

25 Policy Gradient Methods: Overview Problem: maximize E[R π θ ] Intuitions: collect a bunch of trajectories, and Make the good trajectories more probable 2. Make the good actions more probable (actor-critic, GAE) 3. Push the actions towards good actions (DPG, SVG)

26 Score Function Gradient Estimator Consider an expectation E x p(x θ) [f (x)]. Want to compute gradient wrt θ θ E x [f (x)] = θ dx p(x θ)f (x) = dx θ p(x θ)f (x) = dx p(x θ) θp(x θ) p(x θ) f (x) = dx p(x θ) θ log p(x θ)f (x) = E x [f (x) θ log p(x θ)]. Last expression gives us an unbiased gradient estimator. Just sample x i p(x θ), and compute ĝ i = f (x i ) θ log p(x i θ). Need to be able to compute and differentiate density p(x θ) wrt θ

27 Derivation via Importance Sampling Alternate Derivation Using Importance Sampling [ ] p(x θ) E x θ [f (x)] = E x θold p(x θ old ) f (x) [ ] θ p(x θ) θ E x θ [f (x)] = E x θold p(x θ old ) f (x) θ E x θ [f (x)] [ θ p(x θ) θ=θold θ=θold = E x θold f (x) p(x θ old ) = E x θold [ θ log p(x θ) ] θ=θold f (x) ]

28 Score Function Gradient Estimator: Intuition ĝ i = f (x i ) θ log p(x i θ) Let s say that f (x) measures how good the sample x is. Moving in the direction ĝ i pushes up the logprob of the sample, in proportion to how good it is Valid even if f (x) is discontinuous, and unknown, or sample space (containing x) is a discrete set

29 Score Function Gradient Estimator: Intuition ĝ i = f (x i ) θ log p(x i θ)

30 Score Function Gradient Estimator: Intuition ĝ i = f (x i ) θ log p(x i θ)

31 Score Function Gradient Estimator for Policies Now random variable x is a whole trajectory τ = (s 0, a 0, r 0, s 1, a 1, r 1,..., s T 1, a T 1, r T 1, s T ) Just need to write out p(τ θ): θ E τ [R(τ)] = E τ [ θ log p(τ θ)r(τ)] T 1 p(τ θ) = µ(s 0 ) [π(a t s t, θ)p(s t+1, r t s t, a t )] t=0 T 1 log p(τ θ) = log µ(s 0 ) + [log π(a t s t, θ) + log P(s t+1, r t s t, a t )] t=0 T 1 θ log p(τ θ) = θ log π(a t s t, θ) θ E τ [R] = E τ [ t=0 R θ T 1 ] log π(a t s t, θ) t=0 Interpretation: using good trajectories (high R) as supervised examples in classification / regression

32 Policy Gradient Slightly Better Formula Previous slide: [( T 1 )( T 1 )] θ E τ [R] = E τ r t θ log π(a t s t, θ) t=0 But we can cut trajectory to t steps and derive gradient estimator for one reward term r t. [ ] t θ E [r t ] = E r t θ log π(a t s t, θ) t=0 t=0 Sum this formula over t, obtaining θ E [R] = E = E [ T 1 r t t=0 [ T 1 t=0 t t=0 θ log π(a t s t, θ) T 1 θ log π(a t s t, θ) t =t r t ] ]

33 Adding a Baseline Suppose f (x) 0, x Then for every x i, gradient estimator ĝ i tries to push up it s density We can derive a new unbiased estimator that avoids this problem, and only pushes up the density for better-than-average x i. θ E x [f (x)] = θ E x [f (x) b] = E x [ θ log p(x θ)(f (x) b)] A near-optimal choice of b is always E [f (x)] (which must be estimated)

34 Policy Gradient with Baseline Recall T 1 θ E τ [R] = t =0 r t T 1 θ log π(a t s t, θ) Using the E at [ θ log π(a t s t, θ)] = 0, we can show [ T 1 ( T 1 )] θ E τ [R] = E τ θ log π(a t s t, θ) r t b(s t ) t=0 t=t for any baseline function b : S R Increase logprob of action a t proportionally to how much returns T 1 t=t r t are better than expected Later: use value functions to further isolate effect of action, at the cost of bias For more general picture of score function gradient estimator, see stochastic computation graphs 4. 4 John Schulman, Nicolas Heess, et al. Gradient Estimation Using Stochastic Computation Graphs. In: Advances in Neural Information Processing Systems. 2015, pp t=t

35 That s all for today Course Materials: goo.gl/5wsgbj

36 Variance Reduction for Policy Gradients

37 Review (I) Process for generating trajectory τ = (s 0, a 0, r 0, s 1, a 1, r 1,..., s T 1, a T 1, r T 1, s T ) s 0 µ(s 0 ) a 0 π(a 0 s 0 ) s 1, r 0 P(s 1, r 0 s 0, a 0 ) a 1 π(a 1 s 1 ) s 2, r 1 P(s 2, r 1 s 1, a 1 )... a T 1 π(a T 1 s T 1 ) s T, r T 1 P(s T s T 1, a T 1 ) Given parameterized policy π(a s, θ), the optimization problem is maximize θ E τ [R π(, θ)] where R = r 0 + r r T 1.

38 Review (II) In general, we can compute gradients of expectations with the score function gradient estimator θ E x p(x θ) [f (x)] = E x [ θ log p(x θ)f (x)] We derived a formula for the policy gradient [ T 1 θ E τ [R] = E τ θ log π(a t s t, θ) t=0 ( T 1 t=t r t b(s t ) )]

39 Value Functions The state-value function V π is defined as: V π (s) = E[r 0 + r 1 + r s 0 = s] Measures expected future return, starting with state s The state-action value function Q π is defined as Q π (s, a) = E[r 0 + r 1 + r s 0 = s, a 0 = a] The advantage function A π is A π (s, a) = Q π (s, a) V π (s) Measures how much better is action a than what the policy π would ve done.

40 Refining the Policy Gradient Formula Recall [ T 1 ( T 1 )] θ E τ [R] = E τ θ log π(a t s t, θ) r t b(s t ) t=0 t=t ( T 1 T 1 )] = E τ [ θ log π(a t s t, θ) r t b(s t ) t=0 t=t [ [( T 1 T 1 )]] = E s0...a t θ log π(a t s t, θ)e rtst+1...s T r t b(s t ) t=0 t=t = T 1 t=0 E s0...a t [ θ log π(a t s t, θ)e rts t+1...s T [Q π (s t, a t ) b(s t )] ] Where the last equality used the fact that [ T 1 ] E rtst+1...s T = Q π (s t, a t ) t=t r t

41 Refining the Policy Gradient Formula From the previous slide, we ve obtained [ T 1 ] θ E τ [R] = E τ θ log π(a t s t, θ)(q π (s t, a t ) b(s t )) t=0 Now let s define b(s) = V π (s), which turns out to be near-optimal 5. We get [ T 1 ] θ E τ [R] = E τ θ log π(a t s t, θ)a π (s t, a t ) t=0 Intuition: increase the probability of good actions (positive advantage) decrease the probability of bad ones (negative advantage) 5 Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. In: The Journal of Machine Learning Research 5 (2004), pp

42 Variance Reduction Now, we have the following policy gradient formula: [ T 1 ] θ E τ [R] = E τ θ log π(a t s t, θ)a π (s t, a t ) t=0 A π is not known, but we can plug in a random variable Â t, an advantage estimator Previously, we showed that taking Â t = r t + r t+1 + r t+2 + b(s t ) for any function b(s t ), gives an unbiased policy gradient estimator. b(s t ) V π (s t ) gives variance reduction.

43 The Delayed Reward Problem One reason RL is difficult is the long delay between action and reward

44 The Delayed Reward Problem With policy gradient methods, we are confounding the effect of multiple actions: Â t = r t + r t+1 + r t+2 + b(s t ) mixes effect of a t, a t+1, a t+2,... SNR of Â t scales roughly as 1/T Only at contributes to signal A π (s t, a t ), but a t+1, a t+2,... contribute to noise.

45 Var. Red. Idea 1: Using Discounts Discount factor γ, 0 < γ < 1, downweights the effect of rewars that are far in the future ignore long term dependencies We can form an advantage estimator using the discounted return: Â γ t = r t + γr t+1 + γ 2 r t b(s }{{} t ) discounted return reduces to our previous estimator when γ = 1. So advantage has expectation zero, we should fit baseline to be discounted value function V π,γ (s) = E τ [ r0 + γr 1 + γ 2 r s 0 = s ] Â γ t is a biased estimator of the advantage function

46 Var. Red. Idea 2: Value Functions in the Future Another approach for variance reduction is to use the value function to estimate future rewards r t + r t+1 + r t r t + V (s t+1 ) r t + r t+1 + V (s t+2 )... use empirical rewards cut off at one timestep cut off at two timesteps Adding the baseline again, we get the advantage estimators Â t = r t + V (s t+1 ) V (s t ) Â t = r t + r t+1 + V (s t+2 ) V (s t )... cut off at one timestep cut off at two timesteps

47 Combining Ideas 1 and 2 Can combine discounts and value functions in the future, e.g., Â t = r t + γv (s t+1 ) V (s t ), where V approximates discounted value function V π,γ. The above formula is called an actor-critic method, where actor is the policy π, and critic is the value function V. 6 Going further, the generalized advantage estimator 7 Â γ,λ t =δ t + (γλ)δ t+1 + (γλ) 2 δ t where δ t = r t + γv (s t+1 ) V (s t ) Interpolates between two previous estimators: λ = 0 : r t + γv (s t+1 ) V (s t ) (low v, high b) λ = 1 : r t + γr t+1 + γ 2 r t+2 + V (s t ) (low b, high v) 6 Vijay R Konda and John N Tsitsiklis. Actor-Critic Algorithms. In: Advances in Neural Information Processing Systems. Vol. 13. Citeseer. 1999, pp John Schulman, Philipp Moritz, et al. High-dimensional continuous control using generalized advantage estimation. In: arxiv preprint arxiv: (2015).

48 Alternative Approach: Reparameterization Suppose problem has continuous action space, a R d Then d da Qπ (s, a) tells use how to improve our action We can use reparameterization trick, so a is a deterministic function a = f (s, z), where z is noise. Then, θ E τ [R] = θ Q π (s 0, a 0 ) + θ Q π (s 1, a 1 ) +... This method is called the deterministic policy gradient 8 A generalized version, which also uses a dynamics model, is described as the stochastic value gradient 9 8 David Silver, Guy Lever, et al. Deterministic policy gradient algorithms. In: ICML. 2014; Timothy P Lillicrap et al. Continuous control with deep reinforcement learning. In: arxiv preprint arxiv: (2015). 9 Nicolas Heess et al. Learning continuous control policies by stochastic value gradients. In: Advances in Neural Information Processing Systems. 2015, pp

49 Trust Region and Natural Gradient Methods

50 Optimization Issues with Policy Gradients Hard to choose reasonable stepsize that works for the whole optimization we have a gradient estimate, no objective for line search statistics of data (observations and rewards) change during learning They make inefficient use of data: each experience is only used to compute one gradient. Given a batch of trajectories, what s the most we can do with it?

51 Policy Performance Function Let η(π) denote the performance of policy π η(π) = E τ [R π] The following neat identity holds: η( π) = η(π) + E τ π [A π (s 0, a 0 ) + A π (s 1, a 1 ) + A π (s 2, a 2 ) +... ] Proof: consider nonstationary policy π 0 π 1 π 2,... η( π π π ) = η(πππ ) + η( πππ ) η(πππ ) + η( π ππ ) η( πππ ) + η( π π π ) η( π ππ ) +... t th difference term equals A π (s t, a t )

52 Local Approximation We just derived an expression for the performance of a policy π relative to π η( π) = η(π) + E τ π [A π (s 0, a 0 ) + A π (s 1, a 1 ) +... ] = η(π) + E s0: π [E a0: π [A π (s 0, a 0 ) + A π (s 1, a 1 ) +... ]] Can t use this to optimize π because state distribution has complicated dependence. Let s define L π the local approximation, which ignores change in state distribution can be estimated by sampling from π L π ( π) = E s0: π [E a0: π [A π (s 0, a 0 ) + A π (s 1, a 1 ) +... ]] [ T 1 ] = E s0: E a π [A π (s t, a t )] t=0 [ T 1 [ ] ] π(at s t ) = E s0: E a π π(a t s t ) Aπ (s t, a t ) t=0 [ T 1 π(a t s t ) = E τ π π(a t s t ) Aπ (s t, a t ) t=0 ]

53 Local Approximation Now let s consider parameterized policy, π(a s, θ). Sample with θ old, now write local approximation in terms of θ. [ T 1 [ ] ] π(at s t ) L π ( π) = E s0: E a π π(a t s t ) Aπ (s t, a t ) t=0 t=0 [ T 1 [ ] ] π(at s t, θ) L θold (θ) = E s0: E a θ π(a t s t, θ old ) Aθ (s t, a t ) L θold (θ) matches η(θ) to first order around θ old. θ L θold (θ) [ T 1 [ ] ] θ π(a t s t, θ) θ=θ0 = E s0: E a θ π(a t s t, θ old ) Aθ (s t, a t ) t=0 [ T 1 [ = E s0: E a θ θ log π(a t s t, θ)a θ (s t, a t ) ]] t=0 = θ η(θ) θ=θold

54 MM Algorithm Theorem (ignoring some details) 10 η(θ) L θold (θ) C max D KL [π( θ old, s) π( θ, s)] }{{} s }{{} local approx. to η penalty for changing policy θ η(θ) L(θ)+C KL L(θ) If θ old θ new improves lower bound, it s guaranteed to improve η 10 John Schulman, Sergey Levine, et al. Trust Region Policy Optimization. In: arxiv preprint arxiv: (2015).

55 Review Want to optimize η(θ). Collected data with policy parameter θ old, now want to do update Derived local approximation L θold (θ) Optimizing KL penalized local approximation givesn guaranteed improvement to η More approximations gives practical algorithm, called TRPO

56 TRPO Approximations Steps: Instead of max over state space, take mean Linear approximation to L, quadratic approximation to KL divergence Use hard constraint on KL divergence instead of penalty Solve the following problem approximately maximize L θold (θ) subject to D KL [θ old θ] δ Solve approximately through line search in the natural gradient direction s = F 1 g Resulting algorithm is a refined version of natural policy gradient Sham Kakade. A Natural Policy Gradient. In: NIPS. vol , pp

Empirical Results: TRPO + GAE TRPO, with neural

controllers for 2D robotic swimming, hopping,

TRPO along with generalized advantage estimation

robots 13 Friday, June 6, 14 12 John Schulman,

13 John Schulman, Philipp Moritz, et al.

57 Empirical Results: TRPO + GAE TRPO, with neural network policies, was applied to learn controllers for 2D robotic swimming, hopping, and walking, and playing Atari games 12 Used TRPO along with generalized advantage estimation to optimize locomotion policies for 3D simulated robots 13 Friday, June 6, John Schulman, Sergey Levine, et al. Trust Region Policy Optimization. In: arxiv preprint arxiv: (2015). 13 John Schulman, Philipp Moritz, et al. High-dimensional continuous control using generalized advantage estimation. In: arxiv preprint arxiv: (2015).

58 Putting In Perspective Quick and incomplete overview of recent results with deep RL algorithms Policy gradient methods TRPO + GAE Standard policy gradient (no trust region) + deep nets + parallel implementation 14 Repar trick 15 Q-learning 16 and modifications 17 Combining search + supervised learning V. Mnih et al. Playing Atari with Deep Reinforcement Learning. In: arxiv preprint arxiv: (2013). 15 Nicolas Heess et al. Learning continuous control policies by stochastic value gradients. In: Advances in Neural Information Processing Systems. 2015, pp ; Timothy P Lillicrap et al. Continuous control with deep reinforcement learning. In: arxiv preprint arxiv: (2015). 16 V. Mnih et al. Playing Atari with Deep Reinforcement Learning. In: arxiv preprint arxiv: (2013). 17 Ziyu Wang, Nando de Freitas, and Marc Lanctot. Dueling Network Architectures for Deep Reinforcement Learning. In: arxiv preprint arxiv: (2015); Hado V Hasselt. Double Q-learning. In: Advances in Neural Information Processing Systems. 2010, pp X. Guo et al. Deep learning for real-time Atari game play using offline Monte-Carlo tree search planning. In: Advances in Neural Information Processing Systems. 2014, pp ; Sergey Levine et al. End-to-end training of deep visuomotor policies. In: arxiv preprint arxiv: (2015); Igor Mordatch et al. Interactive Control of Diverse Complex Characters with Neural Networks. In: Advances in Neural Information Processing Systems. 2015, pp

59 Open Problems

60 What s the Right Core Model-Free Algorithm? Policy gradients (score function vs. reparameterization, natural vs. not natural) vs. Q-learning vs. derivative-free optimization vs others Desiderata scalable sample-efficient robust learns from off-policy data

61 Exploration Exploration: actively encourage agent to reach unfamiliar parts of state space, avoid getting stuck in local maximum of performance Can solve finite MDPs in polynomial time with exploration 19 optimism about new states and actions maintain distribution over possible models, and plan with them (Bayesian RL, Thompson sampling) How to do exploration in deep RL setting? Thompson sampling 20, novelty bonus Alexander L Strehl et al. PAC model-free reinforcement learning. In: Proceedings of the 23rd international conference on Machine learning. ACM. 2006, pp Ian Osband et al. Deep Exploration via Bootstrapped DQN.. In: arxiv preprint arxiv: (2016). 21 Bradly C Stadie, Sergey Levine, and Pieter Abbeel. Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models. In: arxiv preprint arxiv: (2015).

62 Hierarchy task 1 task 2 task 3 task 4 10 timesteps / day walk to x fetch object y say z.01 hz: 10 3 time steps per day footstep planning: 1hz: 10 5 timesteps / day torque control: 100hz: 10 7 timesteps /day

63 More Open Problems Using learned models Learning from demonstrations

64 The End Questions?

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Deep Reinforcement Learning: Policy Gradients and Q-Learning John Schulman Bay Area Deep Learning School September 24, 2016 Introduction and Overview Aim of This Talk What is deep RL, and should I use