Reinforcement Learning

Size: px

Start display at page:

Download "Reinforcement Learning"

Julianna Loraine Bishop
5 years ago
Views:

1 Reinforcement Learning Inverse Reinforcement Learning Inverse RL, behaviour cloning, apprenticeship learning, imitation learning. Vien Ngo Marc Toussaint University of Stuttgart

2 Outline Introduction to Inverse RL Inverse RL vs. behavioral cloning IRL algorithms (Inspired from a lecture from Pieter Abbeel.) 2/??

3 Inverse RL: Informal Definition Given Measurements of an agent s behaviour π over time (s t, a t, s t), in different circumstances. If possible, given transition model (not given reward function). Goal: Find the reward function R π (s, a, s ). 3/??

4 Inverse Reinforcement Learning RL Agent Reward Dynamics Model Policy Imitation/Apprenticeship Learning IRL Expert's Demonstration inspired from a poster of Boularias, Kober, Peters. 4/??

5 Motivation: Two Sources The potential use of RL/related methods as computational model for animal and human learning: bee foraging (Montague et al 1995), song-bird vocalization (Doya & Sejnowski 1995),... 5/??

6 Motivation: Two Sources The potential use of RL/related methods as computational model for animal and human learning: bee foraging (Montague et al 1995), song-bird vocalization (Doya & Sejnowski 1995),... Construction of an intelligent agent in a particular domain: Car driver, helicopter (Ng et al),... (imitation learning, apprenticeship learning) 5/??

7 Examples Car driving simulation Abbeel et al 2004, etc. Autonomous Helicopter Flight Andrew Ng et. al. Urban navigation Ziebart, Maas, Bagnell and Dey, AAAI 2008 (route recommendation, and destination prediction) etc. 6/??

8 Problem Formulation Given State space S, action space ca. Transition model T (s, a, s ) = P (s s, a) not given reward function R(s, a, s ). Teacher s demonstration (from teacher s policy π ): s 0, a 0, s 1, a 1,..., IRL: Recover R. Apprenticeship learning via IRL Use R to compute a good policy. Behaviour cloning: Using supersived-learning to learn the teacher s policy. 7/??

9 IRL vs. behavioral cloning 8/??

10 IRL vs. Behavioral cloning Behavioral cloning: Formulated as a supervised-learning problem. (Using SVM, Neural networks, deep learning,...) Given (s 0, a 0 ), (s 1, a 1 ),..., generated from a policy π. Estimate a policy mapping s to a. 9/??

11 IRL vs. Behavioral cloning Behavioral cloning: Formulated as a supervised-learning problem. (Using SVM, Neural networks, deep learning,...) Given (s 0, a 0 ), (s 1, a 1 ),..., generated from a policy π. Estimate a policy mapping s to a. Behavioral cloning: can only mimic the trajectory of the teacher, then can not: with change of goal/destination, and non-markovian environment (e.g. car driving). 9/??

12 IRL vs. Behavioral cloning Behavioral cloning: Formulated as a supervised-learning problem. (Using SVM, Neural networks, deep learning,...) Given (s 0, a 0 ), (s 1, a 1 ),..., generated from a policy π. Estimate a policy mapping s to a. Behavioral cloning: can only mimic the trajectory of the teacher, then can not: with change of goal/destination, and non-markovian environment (e.g. car driving). IRL vs. Behavioral cloning is ˆR vs. ˆπ. 9/??

13 Inverse Reinforcement Learning 10/??

14 IRL: Mathematical Formulation Given State space S, action space ca. Transition model T (s, a, s ) = P (s s, a) not given reward function R(s, a, s ). Teacher s demonstration (from teacher s policy π ): s 0, a 0, s 1, a 1,..., Find R, such that [ E γ t R (s t ) π ] [ ] E γ t R (s t ) π, π t=0 t=0 11/??

15 IRL: Mathematical Formulation Given State space S, action space ca. Transition model T (s, a, s ) = P (s s, a) not given reward function R(s, a, s ). Teacher s demonstration (from teacher s policy π ): s 0, a 0, s 1, a 1,..., Find R, such that [ E γ t R (s t ) π ] [ ] E γ t R (s t ) π, π t=0 Challenges? R = 0 is a solution (rewrad function ambiguity), and multiple R satisfy the above condition. π is only given partially through trajectories, then how to evaluate the expectation terms. t=0 11/??

16 IRL: Finite state spaces Bellman equations V π = (I γp π ) 1 R Then IRL finds R such that (P a P a )(I γp a ) 1 R 0, a (if consider only deterministic policies) 12/??

17 IRL: Finite state spaces Bellman equations V π = (I γp π ) 1 R Then IRL finds R such that (P a P a )(I γp a ) 1 R 0, a (if consider only deterministic policies) IRL as linear programming with l 1 s.t. max S { min b A/a i=1 } (P a (i) P b (i))(i γp a ) 1 R λ R 1 (P a P b )(I γp a )R 0 R(i) R max Maximize the sum of differences between the values of the optimal action and the next-best. With l 1 penalty. 12/??

18 IRL: With FA in large state spaces Using FA: R(s) = w.φ(s), where w R n, and φ : S R. Thus, [ ] [ ] E γ t R(s t ) π = E γ t w φ(s t ) π [ ] = w E γ t φ(s t ) π = w.η(π) 13/??

19 IRL: With FA in large state spaces Using FA: R(s) = w.φ(s), where w R n, and φ : S R. Thus, [ ] [ ] E γ t R(s t ) π = E γ t w φ(s t ) π [ ] = w E γ t φ(s t ) π = w.η(π) The optimization problem: finding w such that w.η(π ) w.η(π) η(π) can be evaluated with sampled trajectories from π. η(π) = 1 N N T i γ t φ(s t ) i=1 t=0 13/??

20 Apprenticeship learning Abbeel & Ng, /??

21 Apprenticeship learning Finding a policy π whose performance is as close to the expert policy s performance as possible w.η(π ) w.η(π) ɛ 15/??

22 Apprenticeship learning Finding a policy π whose performance is as close to the expert policy s performance as possible w.η(π ) w.η(π) ɛ 1: Assume R(s) = w.φ(s), where w R n, and φ : S R. 2: Initialize π 0 3: for i = 1, 2,... do 4: Find a reward function such that the teacher maximally outperforms all previously found controllers. max γ γ, w 1 s.t. w.η(π) w.η(π) + γ, π {π 0, π 1,..., π i 1 } 5: Find optimal policy π i for the reward function R w w.r.t current w. 6: end for 15/??

23 Examples 16/??

24 Simulated Highway Driving Given dynamic model T (s, a, s ) Each teacher demonstrates 1 minute. Abbeel et. al /??

25 Simulated Highway Driving expert demonstration (left), learned control (right) 18/??

26 Urban Navigation picture from a tutorial of Pieter Abbeel. 19/??

27 References Andrew Y. Ng, Stuart J. Russell: Algorithms for Inverse Reinforcement Learning. ICML 2000: Pieter Abbeel, Andrew Y. Ng: Apprenticeship learning via inverse reinforcement learning. ICML 2004 Pieter Abbeel, Adam Coates, Morgan Quigley, Andrew Y. Ng: An Application of Reinforcement Learning to Aerobatic Helicopter Flight. NIPS 2006: 1-8 Adam Coates, Pieter Abbeel, Andrew Y. Ng: Apprenticeship learning for helicopter control. Commun. ACM 52(7): (2009) 20/??

Reinforcement Learning

Reinforcement Learning Inverse Reinforcement Learning LfD, imitation learning/behavior cloning, apprenticeship learning, IRL. Hung Ngo MLR Lab, University of Stuttgart Outline Learning from Demonstrations