Reinforcement Learning

Size: px

Start display at page:

Download "Reinforcement Learning"

Kenneth Wade
5 years ago
Views:

1 Reinforcement Learning Inverse Reinforcement Learning LfD, imitation learning/behavior cloning, apprenticeship learning, IRL. Hung Ngo MLR Lab, University of Stuttgart

2 Outline Learning from Demonstrations (LfD) Behavioral Cloning/Imitation Learning Inverse Reinforcement Learning (IRL) Algorithms 2/14

3 Learning from Demonstrations (LfD) Setting: An oracle teaches an agent how to perform a given task. Given: Samples of an MDP agent s behavior over time and in different circumstances, from a supposedly optimal policy π o, i.e., A set of trajectories {ξ i } n i=1, ξ i = {(s t, a t )} Hi 1 t=0, a t π o (s t ). Reward signal r t = R(s t, a t, s t+1 ) unobserved Transition model T (s, a, s ) = P (s s, a) known/unknown. 3/14

4 Learning from Demonstrations (LfD) Setting: An oracle teaches an agent how to perform a given task. Given: Samples of an MDP agent s behavior over time and in different circumstances, from a supposedly optimal policy π o, i.e., A set of trajectories {ξ i } n i=1, ξ i = {(s t, a t )} Hi 1 t=0, a t π o (s t ). Reward signal r t = R(s t, a t, s t+1 ) unobserved Transition model T (s, a, s ) = P (s s, a) known/unknown. Goals: Recover teacher s policy π o directly: behavioral cloning, or imitation learning. Recover teacher s latent reward function R o (s, a, s ): IRL. Recover teacher s policy π o indirectly by first recovering R o (s, a, s ): apprenticeship learning via IRL. 3/14

5 Behavioral cloning Formulated as a supervised-learning problem Given training data {ξ i } n i=1, ξ i = {(s t, a t )} Hi 1 t=0, a t π o (s t ). Learn policy mapping ˆπ o : S A. Solved using SVM, (deep) ANN, etc. 4/14

6 Behavioral cloning Formulated as a supervised-learning problem Given training data {ξ i } n i=1, ξ i = {(s t, a t )} Hi 1 t=0, a t π o (s t ). Learn policy mapping ˆπ o : S A. Solved using SVM, (deep) ANN, etc. Behavioral cloning/il: can only mimic the trajectory of the teacher, no transfer w.r.t. task (e.g., env. changed but similar goals), may fail in non-markovian environments (e.g. in driving; sometimes states from several time-steps needed). 4/14

7 Behavioral cloning Formulated as a supervised-learning problem Given training data {ξ i } n i=1, ξ i = {(s t, a t )} Hi 1 t=0, a t π o (s t ). Learn policy mapping ˆπ o : S A. Solved using SVM, (deep) ANN, etc. Behavioral cloning/il: can only mimic the trajectory of the teacher, no transfer w.r.t. task (e.g., env. changed but similar goals), may fail in non-markovian environments (e.g. in driving; sometimes states from several time-steps needed). IRL vs. behavioral cloning: ˆRo vs. ˆπ o. Why not recover V πo instead? reward function is more succint (easily generalizable/transferrable) values are trajectory dependent 4/14

8 Why IRL? As computational model for learning behaviors in natural world Bee foraging (Montague et al 1995) Song-bird vocalization (Doya & Sejnowski 1995) 5/14

Sejnowski 1995) Construction of an intelligent agent in a particular domain Modeling humans and other adversarial/cooperative

9 Why IRL? As computational model for learning behaviors in natural world Bee foraging (Montague et al 1995) Song-bird vocalization (Doya & Sejnowski 1995) Construction of an intelligent agent in a particular domain Modeling humans and other adversarial/cooperative agents. Collaborative robots (learn reward func. & plan for cooperative tasks) Intermediate step in apprenticeship learning (autonomous driving, driver preferences, autonomous flight, e.g., helicopter, etc.) Abbeel et al 04 Ziebart et al. 08 Andrew Ng et. al. 5/14

10 Example: Urban Navigation picture from a tutorial of Pieter Abbeel. 6/14

11 IRL Formulation #1: Small, Discrete MDPs Given: An incomplete MDP M = S, A, T, R, γ. known transition model T (s, a, s ) = P (s s, a), s, a, s unobserved but bounded reward signal, R(s, a, s ) r max, s, a, s (for simplicity, consider state-dependent reward functions, R(s)) known, supposedly optimal policy π o (s), s S, instead of {ξ i } n i=1. Find R : S [ r max, r max ] such that teacher s policy π o is optimal, furthermore: simple, and robust reward function Notes: in the following we fix an enumeration on the state space: S = {s 1,..., s S }. Then R is a column vector in R S, with R i = R(s i ). Andrew Ng, Stuart Russell: Algorithms for Inverse Reinforcement Learning. ICML /14

12 IRL Formulation #1: Small, Discrete MDPs Find R R S such that teacher s policy π o is optimal: recall Bellman optimality theorem (for a known MDP): π o (s) is optimal π o (s) argmax a Q πo (s, a), s S Q πo (s, π o (s)) Q πo (s, a), s S, a A ( ) 1 x y denotes vectorial (component-wise) inequality: x i y i for every index i. 8/14

13 IRL Formulation #1: Small, Discrete MDPs Find R R S such that teacher s policy π o is optimal: recall Bellman optimality theorem (for a known MDP): π o (s) is optimal π o (s) argmax a Q πo (s, a), s S Q πo (s, π o (s)) Q πo (s, a), s S, a A ( ) define policy-conditioned transition matrices P o and P a [0, 1] S S : [P o ] ij := P (s j s i, π o (s i )), and [P a ] ij := P (s j s i, a), s i, s j S 1 x y denotes vectorial (component-wise) inequality: x i y i for every index i. 8/14

14 IRL Formulation #1: Small, Discrete MDPs Find R R S such that teacher s policy π o is optimal: recall Bellman optimality theorem (for a known MDP): π o (s) is optimal π o (s) argmax a Q πo (s, a), s S Q πo (s, π o (s)) Q πo (s, a), s S, a A ( ) define policy-conditioned transition matrices P o and P a [0, 1] S S : [P o ] ij := P (s j s i, π o (s i )), and [P a ] ij := P (s j s i, a), s i, s j S we can represent the constraints 1 (*) on R as: (P o P a )(I γp o ) 1 R 0, a A ( ) Proof: Bellman equations Q πo (s, a) = R(s) + γ s P (s s, a)v πo (s ), and V πo = (I γp o ) 1 R. Denote by Q πo π a length- S column vector with elements Q πo π (s) := Qπo (s, π(s)), i.e., Q πo π = R + γp π V πo. The set of S A constraints in (*) can be written in matrix form (by fixing an action a for all starting states s S) as: Q πo o Qπo a 0, a A (**). 1 x y denotes vectorial (component-wise) inequality: x i y i for every index i. 8/14

15 IRL Formulation #1: Small, Discrete MDPs Challenges: What if noisy teacher? (i.e., a t π o (s t ) at some t) instead of full π o (s), s S, only given sampled trajectories {ξ i } n i=1? computationally expensive/infeasible: S A constraints for each R reward function ambiguity: IRL is ill-posed! (R = 0 is a solution.) From reward-shaping theory: If the MDP M with reward function R admits π o as an optimal policy, then M with affine-transformed reward function below also admits π o as an optimal policy: R (s, a, s ) = αr(s, a, s ) + γψ(s ) ψ(s), with ψ : S R, α 0. 9/14

16 IRL Formulation #1: Small, Discrete MDPs Challenges: What if noisy teacher? (i.e., a t π o (s t ) at some t) instead of full π o (s), s S, only given sampled trajectories {ξ i } n i=1? computationally expensive/infeasible: S A constraints for each R reward function ambiguity: IRL is ill-posed! (R = 0 is a solution.) From reward-shaping theory: If the MDP M with reward function R admits π o as an optimal policy, then M with affine-transformed reward function below also admits π o as an optimal policy: R (s, a, s ) = αr(s, a, s ) + γψ(s ) ψ(s), with ψ : S R, α 0. One solution (to the reward ambiguity issue): find simple, and robust R, e.g., use l 1 -norm penalty R 1, and maximize sum of value-margins V πo (s) of π o & second-best action, V πo (s) = Q πo (s, π o (s)) max a π o (s) Qπo (s, a) = min a π o (s) [Qπo (s, π o (s)) Q πo (s, a)] 9/14

17 IRL Formulation #1: Small, Discrete MDPs Combining altogether: { max R R S min a A\π o (s) s S { } } (Ps o Ps a )(I γp o ) 1 R λ R 1 s. t. (P o P a )(I γp o ) 1 R 0, a A R(s) r max, s S with P a s the row vector of transition probabilities P (s s, a), s S, i.e., P o s, P a s are the s-th rows of P o, P a, respectively. 10/14

18 IRL Formulation #1: Small, Discrete MDPs Combining altogether: { max R R S min a A\π o (s) s S { } } (Ps o Ps a )(I γp o ) 1 R λ R 1 s. t. (P o P a )(I γp o ) 1 R 0, a A R(s) r max, s S with P a s the row vector of transition probabilities P (s s, a), s S, i.e., P o s, P a s are the s-th rows of P o, P a, respectively. Linear Program: hints We can use two dummy length- S { column vectors U = } R and Γ a vector with s-th element as min a A\π o (s) (Ps o P s a)(i γp o ) 1 R, and create a length-3 S column vector x = (R, U, Γ). Let c denote a length-3 S column vector c = (0, 1, λ1), the LP becomes max x c x s.t. U R U, 0 U r max1, Γ 0, A a R 0, Ā a R Γ a, a A, with A a = (P o P a )(I γp o ) 1, and Āa, Γ a are the resulting matrices and vectors after deleting from A a, Γ the rows s such that π o (s) = a. 10/14

19 IRL Formulation #2: With LFA For large/continuous domains, with sampled trajectories. Assume s 0 P 0 (S); for teacher s policy π o to be optimal: [ E γ t R(s t ) π o] [ ] E γ t R(s t ) π, π t=0 t=0 11/14

20 IRL Formulation #2: With LFA For large/continuous domains, with sampled trajectories. Assume s 0 P 0 (S); for teacher s policy π o to be optimal: [ E γ t R(s t ) π o] [ ] E γ t R(s t ) π, π t=0 Using LFA: R(s) = w φ(s), where w R n, w 2 1, and φ : S R n. [ ] E γ t R(s t ) π = E t=0 [ t=0 t=0 ] γ t w φ(s t ) π [ ] = w E γ t φ(s t ) π = w η(π) The problem becomes find w such that w η(π o ) w η(π), π t=0 η(π): feature expectation of policy π can be evaluated with sampled trajectories from π. [ ] η(π) = E γ t φ(s t ) π 1 N t=0 N T i γ t φ(s t ) i=1 t=0 11/14

21 Apprenticeship learning: Literature Pieter Abbeel, Andrew Ng: Apprenticeship learning via inverse RL. ICML 04 Pieter Abbeel et al.: An Application of RL to Aerobatic Helicopter Flight. NIPS 06. Ratliff, Nathanet al., Maximum margin planning. ICML 06 Ziebart, Brian D., et al. Maximum Entropy Inverse Reinforcement Learning. AAAI 08 Adam Coates et al.. Apprenticeship learning for helicopter control. Commun. ACM 09 12/14

22 Apprenticeship learning via IRL: Max-margin From IRL formulation #2, find a policy π whose performance is as close to performance of oracle s policy π o as possible: w η(π o ) w η(π) ɛ 13/14

23 Apprenticeship learning via IRL: Max-margin From IRL formulation #2, find a policy π whose performance is as close to performance of oracle s policy π o as possible: w η(π o ) w η(π) ɛ ] Also maximize the value margin γ = min π [w η(π o ) w η(π), 13/14

24 Apprenticeship learning via IRL: Max-margin From IRL formulation #2, find a policy π whose performance is as close to performance of oracle s policy π o as possible: w η(π o ) w η(π) ɛ ] Also maximize the value margin γ = min π [w η(π o ) w η(π), Constraints Generation Algorithm: 1: Initialize π 0 (depending on chosen RL alg, e.g., tabular, approximate RL, etc.) 2: for i = 1, 2,... do 3: Find a reward function such that the teacher maximally outperforms all previously found controllers. max γ γ, w 1 s.t. w η(π o ) w η(π) + γ, π {π 0, π 1,..., π i 1 } 4: Find optimal policy π i for the reward function R w w.r.t current w (using any RL algs, e.g., tabular, approximate RL, etc.). 13/14

25 Other Resources Excellent survey on LfD and various formulations see also section ~/Current_Work Pieter Abbeel s simulated highway driving MLR Lab s learning to open door Relational activity processes for toolbox assembly task LfD 14/14

26 Appendix: Quick Review on Convex Optimization Slides from Marc Toussaint s Introduction to Optimization lectures Solvers: CVX (MATLAB), CVXOPT (Python), etc /14

27 Linear and Quadratic Programs Linear Program (LP) LP in standard form Quadratic Program (QP) min x c x s.t. Gx h, Ax = b min x c x s.t. x 0, Ax = b 1 min x 2 x Qx + c x s.t. Gx h, Ax = b where x R n, Q is positive definite. 16/14

28 Transforming an LP problem into standard form LP problem: Define slack variables: min x c x s.t. Gx h, Ax = b min x,ξ c x s.t. Gx + ξ = h, Ax = b, ξ 0 Express x = x + x with x +, x 0: min c (x + x ) x +,x,ξ s.t. G(x + x ) + ξ = h, A(x + x ) = b, ξ 0, x + 0, x 0 where (x +, x, ξ) R 2n+m Now this is conform with the standard form (replacing (x +, x, ξ) z, etc) min z w z s.t. z 0, Dz = e 17/14

29 Algorithms for Linear Programming Constrained optimization methods augmented Lagrangian (LANCELOT software), penalty log barrier ( interior point method, [central] path following ) primal-dual Newton The simplex algorithm, walking on the constraints (The emphasis in the notion of interior point methods is to distinguish from constraint walking methods.) Interior point and simplex methods are comparably efficient Which is better depends on the problem 18/14

30 Quadratic Programming 1 min x 2 x Qx + c x s.t. Gx h, Ax = b Efficient Algorithms: Interior point (log barrier) Augmented Lagrangian Penalty Highly relevant applications: Support Vector Machines Similar types of max-margin modelling methods 19/14

31 Example: Support Vector Machine Primal: Dual: max M s.t. i : y i (φ(x i ) β) M β, β =1 min β β 2 s.t. i : y i (φ(x i ) β) 1 y B A x 20/14

Reinforcement Learning

Reinforcement Learning Inverse Reinforcement Learning Inverse RL, behaviour cloning, apprenticeship learning, imitation learning. Vien Ngo Marc Toussaint University of Stuttgart Outline Introduction to