ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6.

Size: px

Start display at page:

Download "ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6."

Blanche Bernice Perry
5 years ago
Views:

1 ROB 537: Learning-Based Control Week 5, Lecture 1 Policy Gradient, Eligibility Traces, Transfer Learning (MaC Taylor Announcements: Project background due Today HW 3 Due on 10/30 Midterm Exam on 11/6 Reading: Transfer Learning survey by Taylor & Stone Policy Gradient Gradient Decent MSE( θ t = s S p(s ( V π (s V t (s 2 Basic parameter update (based on uniform p(s : θ ( 2 Δ θ t = 1 2 α V π (s t V t (s t Update parameters based on the negauve of the gradient 1

2 Gradient Descent Sarsa Gradient descent for state- acuon values: Δ θ t = α ( Q π (s t Q t (s t Q θ t (s t Δ! θ t = α ( r t +γ Q t (s t+1, a t+1 Q t (s t, a t! Q θ t(s t, a t Linear Methods Special case of gradient descent methods: V t is linear with respect to the parameters θ t There is a vector of features φ for each state FuncUon approximauon becomes: n V t (s = θ t (i φ s (i i=1 2

3 Coarse Coding Cover the space with concentric features (for example circles Weights are 0 or 1 Value is based on coverage x Tile Coding ExhausUve paruuons (Ulings Same number of features as number of Ulings x 3

4 Radial Basis Functions Coarse coding generalized to conunuous value features Each feature can be weighted by a real number (instead of 0 or 1 σ i c i-1 c i c i+1 4

5 Eligibility Traces Eligibility Traces are the basic mechanism of assigning credit in Reinforcement Learning In TD(λ, λ refers to an eligibility trace Basic Idea: Determine how many rewards will be used in evaluaung a value funcuon Determine which states are eligible for an update Connect TD learning and Monte Carlo Learning Eligibility Traces Forward view: TheoreUcal Connect TD learning and Monte Carlo learning In TD learning only previous state is eligible In Monte Carlo, all visited states are eligible IntuiUon: How far ahead do you need to look to figure out value of a state Backward view: MechanisUc Temporary record of events VisiUng states, taking acuons, Receiving Rewards IntuiUon: How many state value pairs do I need to store to figure out value of a state 5

6 Forward View Recall Monte Carlo methods for a T step episode: R t = r t+1 + γ r t+2 + γ 2 r t γ T t 1 r T The update is based on all the returns In TD backup: R t (1 = r t+1 + γ V t (s t+1 The update is based on 1- step backup Why restrict ourselves to these two extremes? Forward View: n-step TD Why not look at 2- step backup: R t (2 = r t+1 + γ r t+2 + γ 2 V t (s t+2 How about n- step backup? R t (n = r t+1 + γ r t+2 + γ 2 r t γ n 1 r t+n + γ n V t (s t+n Called corrected n- step truncated return or n- step return 6

7 Forward View (SuCon & Barto Backward View Causal, incremental view Implementable Introduce a new memory variable: eligibility trace e t (s On each step eligibility trace for each state decays by γλ Eligibility trace of one state increased by 1 e t (s = γ λ e (s if s s t 1 t γ λ e t 1 (s +1 if s = s t λ is trace decay parameter γ is the discount rate used previously 7

8 Intuition At any Ume eligibility traces keep track of which states have been recently visited Recently is quanufied by γλ The traces capture how eligible a state is for an update Recall basic reinforcement learning: NewEstimate OldEstimate + Stepsize (Target OldEstimate Now, update all values based on eligibility trace Implementation NewEstimate OldEstimate + Stepsize (Target OldEstimate For TD, the error was: δ t = r(s t + γ V(s t+1 V(s t Then update becomes: V t +1 (s = V t (s + α δ t e(s t for all s S 8

9 Special Cases Let s explore two special cases based on this update rule V t +1 (s = V t (s + α δ t e(s t for all s S For λ=0: All traces are 0 at t except state visited at t: s t Reduces to TD update rule For λ=1: All states receive credit decaying by γ This is the Monte Carlo updates If γ=1 as well, there is no decay in Ume and we have undiscounted Monte Carlo updates Forward and Backward Views Forward view 9

10 Forward and Backward Views Forward view Backward view (SuCon & Barto Sarsa (λ Recall Sarsa learning: Q(s t Q(s t + α ( r(s t + γ Q(s t Q(s t Now, we have: Q(s t Q(s t + α δ t e(s t for all s,a Increase eligibility if s,a picked Decay eligibility otherwise 10

11 Sarsa (λ Recall Sarsa learning: Q(s t Q(s t + α ( r(s t + γ Q(s t Q(s t Now, we have: where: Q(s t Q(s t + α δ t e(s t δ t = r(s t + γ Q(s t Q(s t for all s,a e t (s,a = γ λ e t 1(s,a +1 if s = s t and a = a t for all s,a γ λ e t 1 (s,a otherwise Example: Sarsa (λ Recall Sarsa updates: only last state receives update. All other state acuon pairs will update when visited/taken next Path taken Action values increased by one-step Sarsa Action values increased by Sarsa(! with!=0.9 11

12 Q(λ Recall Q- learning Q(s t Q(s t + α r(s t + γ maxq(s t +1,a Q(s t Now, we have: ( a where Q(s t Q(s t + α δ t e(s t for all s,a δ t = r(s t + γ maxq(s t +1,a Q(s t a γ λ e e t (s,a = I sst I t 1 (s,a +1 if Q t -1 (s t = max a Q t -1 (s t,a aat 0 otherwise Q(λ Q learning with eligibility traces: Like in Sarsa learning, except eligibility trace set to zero aner an exploratory move (not greedy IntuiUon: two step process Step 1: Either each state- acuon pair decayed by λ Or all set to 0 aner an exploratory step Step 2: Trace for current state acuon pair incremented by 1 (this is Watkins Q(λ; There are other versions as well. 12

13 Back to Gradient Descent Sarsa Gradient descent for state- acuon values: Δ θ t = α ( Q π (s t Q t (s t Q θ t (s t Δ θ t = α ( r t +1 + γ Q t (s t Q t (s t Q θ t (s t Gradient descent Sarsa: 1- step backup Extend to eligibility traces Gradient Descent Sarsa(λ Gradient descent Sarsa(λ: where Δ θ t = α δ t e t δ t = r t +1 + γ Q t (s t Q t (s t e t = γ λ e t 1 + θ Q t (s t 13

14 Transfer Learning So far, we learned all policies from scratch Can be unnecessarily slow Humans can use past informauon Soccer with different numbers of players How do we leverage learned knowledge in novel tasks Bias learning : speedup method 14

15 Transfer Learning MaChew E. Taylor and Peter Stone. Transfer Learning for Reinforcement Learning Domains: A Survey. Journal of Machine Learning Research, 10(1: , (Slides courtesy of MaC Taylor Primary Questions Is it possible to transfer learned knowledge? Is it possible to transfer without providing a task mapping? Focus on reinforcement learning tasks - Plenty of work in supervised setngs Source S SOURCE, A SOURCE Target S TARGET, A TARGET 15

16 Value Function Transfer Source Task Environment Action S State S Reward S Q S not defined on S T and A T ρ(q S (S S, A S = Q T (S T, A T Agent Target Task Environment Q S : S S A S R ρ Q T : S T A T R AcUon- Value funcuon transferred ρ is task- dependant: relies on inter- task mappings Action T State T Reward T Agent Inter Task Mappings Source Target Source Target 16

17 Inter Task Mappings χ x: s target s source Given state variable in target task (some x from s=x 1, x 2, x n Return corresponding state variable in source task χ A: a target a source Given acuon in target task Return corresponding acuon in source task S SOURCE A SOURCE Source x1 x k {a 1 a j } χ x χ A Target x1 x n {a 1 a m } S TARGET A TARGET IntuiUve mappings exist in some domains (Oracle Used to construct transfer funcuon Q Value reuse Can be considered a type of reward shaping Directly use Q values from source task to bias learner in target task Not funcuon- approximator specific No iniualizauon step needed between learning the two tasks Drawbacks: an increased lookup Ume and larger memory requirements 17

18 Keepaway [Stone, Sutton, and Kuhlmann 2005] K 1 K 2 Goal: Maintain possession of ball 3 vs. 2: T 1 5 agents 3 (stochasuc acuons 13 (noisy & conunuous state variables K 3 T 2 Keeper with ball may hold ball or pass to either teammate Both takers move towards player with ball 4 vs. 3: 7 agents 4 acuons 19 state variables Transfer Learning in Keepaway 3 vs. 2 Keepaway Or 4 vs. 3 Can learning transfer from 3 vs. 2 to 4 vs. 3? 18

19 Learning Keepaway Sarsa update RBF and neural network approximauon successful Q π (s,a: Predicted number of steps episode will last Reward = +1 for every Umestep ρ s Effect on function approximator For each weight in 4 vs. 3 funcuon approximator: - Use inter- task mapping to find corresponding 3 vs. 2 weight 3 vs. 2 ρ 4 vs. 3 19

20 Keepaway Hand-coded χ A AcUons in 4 vs. 3 have similar acuons in 3 vs. 2 Hold 4v3 Hold 3v2 Pass1 4v3 Pass1 3v2 Pass2 4v3 Pass2 3v2 Pass3 4v3 Pass2 3v2 Value Function Transfer: Time Threshold in 4 vs. 3 No Transfer Simulator Hours Total Time # 3 vs. 2 Episodes } Target Task Time Avg. 4 vs. 3 Ume Avg. 3 vs. 2 Ume 20

Veloso, 2006] Mazes with different structures [Konidaris and

21 Example Transfer Domains Series of mazes with different goals [Fernandez and Veloso, 2006] Example Transfer Domains Series of mazes with different goals [Fernandez and Veloso, 2006] Mazes with different structures [Konidaris and Barto, 2007] 21

Example Transfer Domains Series of mazes with different goals [Fernandez and Veloso, 2006] Mazes with different structures [Konidaris and Barto, 2007] Keepaway with different numbers of players

22 Example Transfer Domains Series of mazes with different goals [Fernandez and Veloso, 2006] Mazes with different structures [Konidaris and Barto, 2007] Keepaway with different numbers of players [Taylor and Stone, 2005] Example Transfer Domains Series of mazes with different goals [Fernandez and Veloso, 2006] Mazes with different structures [Konidaris and Barto, 2007] Keepaway with different numbers of players [Taylor and Stone, 2005] Keepaway to Breakaway [Torrey et al, 2005] 22

23 Transfer Learning So far, we talked about transfer learning in a specific domain Task: MDP Domain: Setng for semanucally similar task What about Cross- Domain Transfer? Source task could be much simpler Source and Target can be less similar What about transferring policies (or just rules? Rule Transfer Overview Learn π Learn D source Translate (D source D target Use D target 1. Learn a policy (π : S A in the source task TD, Policy Search, Model- Based, etc. 2. Learn a decision list, D source, summarizing π 3. Translate (D source D target (applies to target task State variables and acuons can differ in two tasks 4. Use D target to learn a policy in target task Allows for different learning methods and funcuon approximators in source and target tasks 23

24 Rule Transfer Details Learn π Learn D source Translate (D source D target Use D target For example, use Sarsa o Q : S A Return Other learning methods possible Source Task Environment Ac<on State Reward Agent Rule Transfer Details Learn π Learn D source Translate (D source D target Use D target Use learned policy to record S, A pairs Extract a rules list (for example: use JRip (RIPPER in Weka Environment Ac<on State Reward IF s 1 < 4 and s 2 > 5 a 1 ELSEIF s 1 < 3 a 2 ELSEIF s 3 > 7 a 1 Agent State Ac<on 24

25 Rule Transfer Details Learn π Learn D source Translate (D source D target Use D target Inter- task Mappings χ x: s target s source Given state variable in target task (some x from s = x 1, x 2, x n Return corresponding state variable in source task χ A: a target a source Similar, but for acuons χ x χ A rule translate rule Rule Transfer Details Learn π Learn D source Translate (D source D target Use D target K 1 K 2 T 1 K 3 T 2 χ A χ x IF dist(player, Opponent > 4 Stay Stay Hold Ball Run Near Pass to K 2 Run Far Pass to K 3 dist(player, Opponent dist(k 1,T 1 IF dist(k 1,T 1 > 4 Hold Ball 25

26 Rule Transfer Details Learn π Learn D source Translate (D source D target Use D target Many possible ways to use D target o Value Bonus (shaping o Extra AcUon (iniually force agent to select o Extra Variable (iniually force agent to select Assuming TD learner in target task o Should generalize to other learning methods Evaluate agent s 3 acuons in state s = s 1, s 2 Q(s 1, s 2, as 32 1, a = 1 5 = 5 Q(s 1, s 2, as 32, a = 2 3 = Q(s 1, s 2, as 32, a = 3 4 = 4 Q(s 1, s 2, a 4 = 7 (take acuon a 2 D target (s = a 2 Problem Statement Meta- planning problem for agent learning Learn from scratch? Or learn through a series of similar tasks? Start with similar, simple task and move closer to target task MDP MDP MDP MDP 26

Week 6, Lecture 1. Reinforcement Learning (part 3) Announcements: HW 3 due on 11/5 at NOON Midterm Exam 11/8 Project draft due 11/15

Week 6, Lecture 1. Reinforcement Learning (part 3) Announcements: HW 3 due on 11/5 at NOON Midterm Exam 11/8 Project draft due 11/15 ME 537: Learning Based Control Week 6, Lecture 1 Reinforcement Learning (part 3 Announcements: HW 3 due on 11/5 at NOON Midterm Exam 11/8 Project draft due 11/15 Suggested reading : Chapters 7-8 in Reinforcement