Optimal Control with Learned Forward Models

Size: px

Start display at page:

Download "Optimal Control with Learned Forward Models"

Jeffry McCormick
5 years ago
Views:

1 Optimal Control with Learned Forward Models Pieter Abbeel UC Berkeley Jan Peters TU Darmstadt 1

2 Where we are? Reinforcement Learning Data = {(x i, u i, x i+1, r i )}} x u xx r u xx V (x) π (u x) Now V (x) π (u x) Policy Search 2 π (u x) Optimal Control with Model Learning Value Function Methods

3 Goal of this Lecture 1. Step: Learn an Forward Model 2. Step: Use your favorite Optimal Control Method to get an optimal policy action Robot Policy Forward Model state action Forward Model state 3

4 Outline of the Lecture 1.Introduction to Optimal Control 2.Solving Linear-Quadratic Optimal Control Problems 3.Optimal Control with Learned Models 4.Hot story: Marc Deisenroth s PILCO Approach 5.Final Remarks 4

5 Example: Ball Paddling What are the states x? Ball Velocity Racket Velocity Ball Position Racket Position 5

6 Example: Ball Paddling What are the actions u? All motor torques? If you do not have a model... Joint Accelerations? Perfect, if you have a good model... Maybe identify the proper degrees of freedom? Accelerations in Operational Space? Ideally!... but only if you have a good operational space control law! 6

7 Example: Ball Paddling What are good rewards r? Task knowledge or success/failure? For some algorithms rewards in {1,0} are perfect... Real problems often require reward shaping... What s a good reward for our problem? Height of the ball? Distance between ball and the paddle? Ball needs to move in a certain region? All of the above? Additional punishments? All of these together do the job! 7

8 So can we get this to work? The state space has at least 12 dimensions. The action space has at least 3 dimensions. Can discretizations deal with such spaces? No! Finding an Optimal Value Function is limited by the curse of dimensionality. 8

9 Example: Real world application... So how can you get this by RL? 9

10 Outline of the Lecture 1.Introduction to Optimal Control 2.Solving Linear-Quadratic Optimal Control Problems 3.Optimal Control with Learned Models 4.Hot story: Marc Deisenroth s PILCO Approach 5.Final Remarks 10

11 Optimal Control Goal The goal of optimal control is find a policy u ~ π(x) such that J(π) = lim T 1 T E T k=0 r(x t, u t ) is maximal for a given reward function such as r(x, u) = x T Qx u T Ru system: x = Ax + Bu + For Simplicity of Derivation! Nothing Changes! 11...so how do we solve this?!

12 Plot the Problem x = Ax + Bu + r(x, u) = x T Qx u T Ru 12

Bellman Recipe: Steps 1+2 1.At the last step, we have the value function V T (x) =0! 2.

13 Bellman Recipe: Steps At the last step, we have the value function V T (x) =0! 2.For t=t-1, compute optimal policy such that πt (u x) =argmax π r(x, u)+v t+1 (f(x, u)) determined by d du 13 r(x, u)+v t+1 (f(x, u)) = 0 d du x T Qx u T Ru) = 0 u = 0

14 Bellman Recipe: Step Obtain next value function V t (x) =max π r(x, u)+v t+1 (f(x, u)) V t+1(x) = r(x, u )+V t+1(f(x, u )) = x T Qx u T Ru = x T Qx 4.As not converged, go back to Step 2. 14!

15 Bellman Recipe: Step 2 again! 2.For t<t-1, compute optimal policy such that πt (u x) =argmax π r(x, u)+v t+1 (f(x, u)) determined by d r(x, u)+v du t+1 (f(x, u)) = 0 d x T Qx u T Ru f(x, u) T P t+1 f(x, u) du = 0 d x T Qx u T Ru (Ax + Bu) T P t+1 (Ax + Bu) du = 0 which implies d du Ru B T P t+1 (Ax + Bu) = 0 Policy Parameters! 15 u = (R + B T P t+1 B) 1 B T P t+1 Ax = θ t x!

16 Bellman Recipe: Step Obtain next value function V t (x) =max π r(x, u)+v t+1 (f(x, u)) 4. We have a converged to Recursion: P t = Q θ T t+1rθ t+1 (A + Bθ t+1 ) T P t+1 +(A + Bθ t+1 ) 16 V t (x) = r(x, u )+V t+1(ax + Bu ) = x T Qx (θ t x) T R(θ t x) (Ax + Bθ t+1 x) T P t+1 +(Ax + Bθ t x) = x T [Q θ T t Rθ t (A + Bθ t ) T P t+1 +(A + Bθ t )]x = x T P t x θ t = (R + B T P t+1 B) 1 B T P t+1 A!

17 Optimal Solution u = θ t x 17!

18 Outline of the Lecture 1.Introduction to Optimal Control 2.Solving Linear-Quadratic Optimal Control Problems 3.Optimal Control with Learned Models 4.Hot story: Marc Deisenroth s PILCO Approach 5.Final Remarks 18

19 Example: Swing-Up System Reward 19

20 Possible: Learn Solutions only where needed! If you know places where we start we can just look ahead! 20 Atkeson & Schaal, 1995

21 21 Atkeson & Schaal, 1995 Local Solutions Every smooth function can be modeled with a Taylor expansion f(x) =f(a)+ df dx (x a)+ 1 (x a)t d2 x=a 2 f dx 2 (x a)+... x=a Hence, we can also approximate: x f(ˆx, û)+ df dx (x ˆx)+ df (u û) ˆx,û duˆx,û r r(ˆx, û)+ = a 0 t + A t(x ˆx)+B t (u û) dr dx dr du T x ˆx u û x ˆx u û T d 2 r dx 2 d 2 r dxdu d 2 r dxdu d 2 r du 2 x ˆx u û

22 Bellman Recipe: Steps At the last step, we have the value function V T (x) =0! 2.For t=t-1, compute optimal policy such that πt (u x) =argmax π r(x, u)+v t+1 (f(x, u)) u = û gives. 3.Obtain next value function 4.As not converged, go back to Step V t (x) =max π r(x, u)+v t+1 (f(x, u)) V t (x) = r t+1 (x ˆx t ) T Q t+1 (x ˆx t )

23 Bellman Recipe: Step 2-4 again! 2.For t<t-1, compute optimal policy such that πt (u x) =argmax π r(x, u)+v t+1 (f(x, u)) determined by u = û t (R t + B T t P t B t ) 1 B T t P t+1 (a 0 t + A t (x ˆx t )) = θ 1 t (x ˆx t )+θ 0 t 3.Obtain the recursions θ 1 t = û t (R t + B T t P t+1 B t ) 1 B T t P t+1 A t (x ˆx t ) θ 0 t = û t (R t + B T t P t+1b t ) 1 B T t P t+1a 0 t P t = Q t θ T t R t θ 1 t +(A + B t θ 1 t )P t+1 (A + B t θ 1 t ) 23!

24 How do we get the Optimal Policy 1.Forward Propagation: Run Simulator to Obtain Linearizations 2.Backward Solution: Compute Optimal Control Law 3.If not converged, go to 1. 24

25 Model Learning with subsequent Policy Optimization 25 Atkeson & Schaal, 1996; Schaal, 1997

26 Model Learning with subsequent Policy Optimization 26 Atkeson & Schaal, 1996; Schaal, 1997

27 Outline of the Lecture 1.Introduction to Optimal Control 2.Solving Linear-Quadratic Optimal Control Problems 3.Optimal Control with Learned Models 4.Hot story: Marc Deisenroth s PILCO Approach 5.Final Remarks 27

Probabilistic Forward Models Marc Deisenroth Many forward models explain measured data. Choosing a bad model destroys will cause an optimization bias.

28 Probabilistic Forward Models Marc Deisenroth Many forward models explain measured data. Choosing a bad model destroys will cause an optimization bias. Can we average ensure robustness towards bad approximations? Yes! Even in a Bayesian way with Gaussian Process Regression! 28 Deisenroth, Fox, Rasmussen, R:SS 2011

29 Basic Idea Marc Deisenroth With a GP, you can compute the distributions over all future states based on all forward models weighted by their likelihood. The propagation is still approximate and done by moment matching From the distribution: Expected Return: Gradient: We can do policy updates! 29 Deisenroth, Fox, Rasmussen, R:SS 2011

30 Applications Marc Deisenroth 30 Deisenroth, Fox, Rasmussen, R:SS 2011

31 Outline of the Lecture 1.Introduction to Optimal Control 2.Solving Linear-Quadratic Optimal Control Problems 3.Optimal Control with Learned Models 4.Hot story: Marc Deisenroth s PILCO Approach 5.Final Remarks 31

32 Conclusions You have learned about optimal control today! Only two cased are solvable: linear & discrete! Linear scales but does not generalize. Discrete generalizes but does not scale. Using Learned Models, you can compute at least optimal policy tubes. If you have many many tubes, in good regions, you have a policy. We will continue with Value Function and Policy Search Methods. 32

33 Further Reading C. G. Atkeson (1994), Using Local Trajectory Optimizers to Speed Up Global Optimization in Dynamic Programming, Proceedings, Neural Information Processing Systems, Denver, Colorado, December, 1993, In: Neural Information Processing Systems 6, J. D. Cowan, G. Tesauro, and J. Alspector, eds. Morgan Kaufmann, Schaal, S. (1997). Learning from demonstration. In: M.C. Mozer, M. Jordan, & T. Petsche (eds.), Advances in Neural Information Processing Systems 9, pp Cambridge, MA: MIT Press Marc P. Deisenroth, Carl E. Rasmussen, Dieter Fox (2011). Learning to Control a Low-Cost Robotic Manipulator Using Data-Efficient Reinforcement Learning, Robotics: Science & Systems (RSS 2011) 33

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol