Maximum Margin Planning

Size: px

Start display at page:

Download "Maximum Margin Planning"

Edwina Day
6 years ago
Views:

1 Maximum Margin Planning Nathan Ratliff, Drew Bagnell and Martin Zinkevich Presenters: Ashesh Jain, Michael Hu CS6784 Class Presentation

2 Theme 1. Supervised learning 2. Unsupervised learning 3. Reinforcement learning Reward based learning Inverse Reinforcement Learning

3 Motivating Example (Abbeel et. al)

4 Outline Basics of Reinforcement Learning State and action space Activity Introducing Inverse RL Max-margin formulation Optimization algorithms Activity Conclusion

5 Reward-based Learning

6 Markov Decision Process (MDP) MDP is defined by (X, A, p, R) X A p R State space Action space Dynamics model p(x x, a) Reward function R(x) or R(x, a) 0.1 Y X

7 Markov Decision Process (MDP) MDP is defined by (X, Policy A, p, R) X A p R π: X A State space Action space Dynamics model p(x x, a) Reward function R(x) or R(x, a) 0.1 Y Example policy X

8 Graphical representation of MDP a t x t R t a t+1 x t+1 R t+1 Goal: Learn an optimal policy π: X A π = arg max E π γt R t (x t, π(x t )) t=0 x t, π x t x t+1, π x t+1

9 Activity Very Easy!!! Deterministic Dynamics Draw the optimal Policy!

10 Activity Very Easy!!!

11 Activity Solution

12 Outline Basics of Reinforcement Learning State and action space Activity Introducing Inverse RL Max-margin formulation Optimization algorithms Activities Conclusion

13 RL to Inverse-RL (IRL) State space: X Action space:a Optimal Reward Function Policy π π: R(x, a) a Inverse RL Algorithm Dynamics Model p(x x, a) Optimal Reward Function Policy π π: R(x, a) a Inverse problem: Given optimal policy π or samples drawn from it, recover the reward R(x, a)

14 Why Inverse-RL? State space: X Action space:a Reward Function R(x, a) RL Algorithm Specifying R(x, a) is hard, but samples from π are easily available Ratliff et. al. Kober et. al. Abbeel et. al.

15 IRL framework π : x a Expert Interacts y = x 1, a 1 x 2, a 2 x 3, a 3 x n, a n Demonstration w T f y = w T w T w T w T Reward

16 Paper contributions 1. Formulated IRL as structural prediction Maximum margin approach 2. Two optimization algorithms Problem specific solver Sub-gradient method 3. Robotic application: 2D navigation etc.

17 Outline Basics of Reinforcement Learning State and action space Activity Introducing Inverse RL Max-margin formulation Optimization algorithms Activities Conclusion

18 Formulation n Input X i, A i, p i, f i, y i, L i i=1 X i A i p i y i f i ( ) State space Action space Dynamics model Expert demonstration Feature function L i (, y i ) Loss function 0.1 Y X

19 Max-margin formulation λ min w,ξ i 2 w n s. t. i w T f i y i i β i ξ i max y Y i w T f i y + L y, y i Features linearly decompose over path f i y ξ i = F i μ F i = d μ = Feature map X A Frequency counts X A

20 Max-margin formulation μ λ min w,ξ i 2 w n s. t. i i β i ξ i w T F i μ i max μ G i w T F i μ + l i T μ ξ i satisfies Bellman-flow constraints x x,a μ x,a p i x x, a + s i x = a μ x,a Inflow Outflow

21 Outline Basics of Reinforcement Learning State and action space Activity Introducing Inverse RL Max-margin formulation Optimization algorithms Problem specific QP-formulation Sub-gradient method Activities Conclusion

22 Problem specific QP-formulation λ min w,ξ i 2 w n s. t. i i β i ξ i w T F max w T F i μ + l T i μ i s T i v i ξ i i μ ξ i μ G i μ x,a p i x x x, a + s i = μ x,a i, x, a v x i w T F i + l x,a i + p i x x x x, a v i x,a x a Inflow Outflow Can be optimized using off-the-shelf QP solvers

23 Outline Basics of Reinforcement Learning State and action space Activity Introducing Inverse RL Max-margin formulation Optimization algorithms Problem specific QP-formulation Sub-gradient method Activities Conclusion

24 Optimizing via Sub-gradient λ min w,ξ i 2 w n s. t. i i β i ξ i w T F i μ i max μ G i w T F i μ + l i T μ ξ i Re-writing constraint Its tight at optimality ξ i max μ G i w T F i μ + l i T μ w T F i μ i 0 ξ i = max μ G i w T F i μ + l i T μ w T F i μ i μ satisfies Bellman-flow constraints

25 Optimizing via Sub-gradient c w λ = min w,ξ i 2 w n i β i max μ G i w T F i μ + l i T μ w T F i μ i Ratliff, PhD Thesis

26 Weight update via Sub-gradient Standard gradient descent update w t+1 w t α αg wt c(w) λ min w,ξ i 2 w n β i μ = argmax μ G i w t T F i + l i T μ i max μ G i w T F i μ + l i T μ w T F i μ i Loss-augmented reward g wt = λw t + 1 n i β i F i (μ μ i )

27 Convergence summary n c w = λ 2 w n i=1 r i w w w αg w How to chose α? If w g w G then a constant step size 0 < α 1 λ gives linear convergence to a region around the minimum w t+1 w 2 κ t+1 w 0 w 2 + αg2 λ Optimization error 0 < κ < 1

28 Outline Basics of Reinforcement Learning State and action space Activity Introducing Inverse RL Max-margin formulation Optimization algorithms Activity Conclusion

29 Convergence proof We know w g w G w t+1 w 2 = w t αg t w 2 = w t w 2 + α 2 g t 2 2αg t T (w t w )? c(w) w w c w c w + g T w w w + λ 2 λ Substitute w w w w t 2 w t w 2 g t T (w t w ) Use c w c(w t ) w w 2 1 αλ w t w 2 + α 2 G 2

30 Outline Basics of Reinforcement Learning State and action space Activity Introducing Inverse RL Max-margin formulation Optimization algorithms Activity Conclusion

31 2D path planning Expert s Demonstration Learned Reward Planning in a new Environment

32 Car driving (Abbeel et. al)

33 Bad driver (Abbeel et. al)

34 Recent works 1. N. Ratliff, D. Bradley, D. Bagnell, J. Chestnutt. Boosting Structures Prediction for Imitation Learning. In NIPS B. Ziebart, A. Mass, D. Bagnell, A. K. Dey. Maximum Entropy Inverse Reinforcement Learning. In AAAI P. Abbeel, A. Coats, A. Ng. Autonomous helicopter aerobatics through apprenticeship learning. In IJRR S. Ross, G. Gordon, D. Bagnell. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. In AISTATS B. Akgun, M. Cakmak, K. Jiang, A. Thomaz. Keyframe-based leaning from demonstrations. In IJSR A. Jain, B. Wojcik, T. Joachims, A. Saxena. Learning Trajectory Preferences for Manipulators via Iterative Improvement. In NIPS 2013

35 Questions Summary of key points: Reward-based learning can be modeled by the MDP framework. Inverse Reinforcement Learning takes perception features as input and outputs a reward function. The max-margin planning problem can be formulated as a tractable QP. Sub-gradient descent solves an equivalent problem with provable convergence bound.

Reinforcement Learning

Reinforcement Learning Inverse Reinforcement Learning Inverse RL, behaviour cloning, apprenticeship learning, imitation learning. Vien Ngo Marc Toussaint University of Stuttgart Outline Introduction to