Optimal Control, Trajectory Optimization, Learning Dynamics

Size: px

Start display at page:

Download "Optimal Control, Trajectory Optimization, Learning Dynamics"

Aron Burns
6 years ago
Views:

1 Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Optimal Control, Trajectory Optimization, Learning Dynamics Katerina Fragkiadaki

2 So far.. Most Reinforcement Learning literature: Learning policies without knowing how the world works (model-free) Model based RL: build a model from data and use it to sample experience as opposed to just use experience by interacting with the world Imitation learning: learn from a teacher -> most recent and successful formulations suggest to train a classifier/regressor with data augmentation/scheduling to handle compounding errors Planners (e.g. Monte Carlo Tree Search)-> Search for actions while knowing how the world works (model-based)->search in Discrete space. Imitation learning: learn from a planner: imitate MCTS to learn to play Atari games better than what you would get from model-free RL Sidd assumed he knew some simplified models of Physics and that allowed him to do formidable look-aheads using MCTS and particle trajectories.

3 Today s lecture Most Reinforcement Learning literature: Learning policies without knowing how the world works (model-free) Model based RL: build a model from data and use it to sample experience as opposed to just use experience by interacting with the world Imitation learning: learn from a teacher -> most recent and successful formulations suggest to train a classifier/regressor with data augmentation/scheduling to handle compounding errors Planners (e.g. Monte Carlo Tree Search)-> Search for actions while knowing how the world works (model-based)->search in Discrete space. Imitation learning: learn from a planner: imitate MCTS to learn to play Atari games better than what you would get from model-free RL Sidd assumed he knew some simplified models of Physics and that allowed him to do formidable look-aheads using MCTS and particle trajectories. Optimal Control: Search for actions while knowing how the world works (model-based) -> Search in Continuous space. (We will use derivates).

4 Today s lecture Optimal Control, trajectory optimization formulation Special but important case: Linear dynamics, quadratic costs (LQR) iterative-lqr / Differential Dynamic programming for Non-linear dynamical systems Examples of when it works and when it does not Learning Dynamics: Global Models

5 Next two lectures Learning Dynamics: Local Models Learning local or global controllers with a trust region based on KL divergence of their trajectory distributions (i-lqr, TRPO) Imitating Optimal Controllers to find general policies: execute desired task from any initial state Successful examples from the literature

6 Optimal Control (Open Loop) The optimal control problem: min x,u TX t=0 c t (x t,u t ) s.t. x 0 = x 0 x t+1 = f(x t,u t ) t =0,...,T 1

7 Optimal Control (Open Loop) The optimal control problem: min x,u TX t=0 c t (x t,u t ) s.t. x 0 = x 0 Solution: x t+1 = f(x t,u t ) t =0,...,T 1 Sequence of controls uand resulting state sequence x In general non-convex optimization problem, can be solved with sequential convex programming (SCP): ee364b/lectures/seq_slides.pdf

8 Optimal Control (Closed Loop a.k.a. MPC) Given: x 0 For t =0, 1, 2,...,T Solve min x,u TX k=t c k (x k,u k ) s.t. x k+1 = f(x k,u k ), 8k 2 {t, t +1,...,T 1} x t = x t Execute u t Observe resulting state, x t+1 Initialize with solution from t 1 to solve fast at time t

9 Shooting methods vs collocation methods Collocation Method: optimize over actions and state, with constraints min u 1,...,u T,x 1,...,x T TX t=1 c(x t,u t ) s.t x t = f(x t 1,u t 1 ) Diagram: Sergey Levine

10 Shooting methods vs collocation methods Shooting Method: optimize over actions only min c(x 1,u 1 )+c(f(x 1,u 1 ),u 2 )+ + c(f(f(...)...),u T ) u 1,...,u T Indeed, x are not necessary since every u results (following the dynamics) in a state sequence x, for which in turn the cost can be computed Not clear how to initialize in a way that nudges towards a goal state Diagram: Sergey Levine

11 Bellman s Curse of Dimensionality n-dimensional state space Number of states grows exponentially in n (for fixed number of discretization levels per coordinate) In practice Discretization is considered only computationally feasible up to 5 or 6 dimensional state spaces even when using Variable resolution discretization Highly optimized implementations

12 Linear case: LQR Very special case: Optimal Control for Linear Dynamic Systems and Quadratic Cost (a.k.a. LQ setting) Can solve continuous state-space optimal control problem exactly Running time: O(Tn 3 ) linear quadratic

13 Linear dynamics: Newtonian Dynamics x t+1 = x t + tẋ t + t 2 F x y t+1 = y t + tẏ t + t 2 F y ẋ t+1 =ẋ t + ẏ t+1 =ẏ t + tf x tf y

14 What is the state x? In most robotic tasks, state is hand engineered and includes: position and velocities of the robotic joints position and velocity of the object being manipulated Those are both known: the robot knows its state and we perceive the state of the objects in the world. In tasks where we do not even want to bother with object state, we just concatenate the robotic state across multiple time steps to implicitly infer the interaction (collision with the object) Later lecture: learning the state from raw pixels, Embed to Control

15 What is the cost c(x t,u t ) c(x t,u t )=kx t x k + ku t k x is a final goal configuration I want to achieve In the final time step, you can add a term with higher weight: Final cost c(x T,u T ) = 2(kx T x k + ku T k) For object manipulation, includes not only desired pose of the end effector but also desired pose of the objects x

executing controlu t V (x t ): optimal state value

16 Linear Quadratic Regulator (LQR) Definitions: Q(x t,u t ): optimal action value function, cost-to-go at state x t executing controlu t V (x t ): optimal state value function, cost-to-go from state V (x t )=min u t Q(x t,u t ) x t

17 Linear Quadratic Regulator (LQR) Value iteration: backward propagation! Start from u T and work backwards

18 Linear Quadratic Regulator (LQR) Value iteration: backward propagation! Start from u T and work backwards Cost matrices for the last time step:

19 Linear Quadratic Regulator (LQR) Value iteration: backward propagation! Start from u T and work backwards Cost matrices for the last time step:

20 Linear Quadratic Regulator (LQR) Remember: V (x t )=minq(x t,u t ) u t Substituting the minimizer u T into Q(x T,u T ) gives us V (x T )! cost-to-go as a function of the final state

21 Linear Quadratic Regulator (LQR) We propagate the optimal value function backwards!! Immediate cost best cost-to-go q (s, a) =r(s, a)+ X s 0 2S T (s 0 s, a)v (s 0 ) We can eliminate x_t by writing only in terms of quantities of T-1! quadratic We have written V (x T ) only in terms of x T 1,u T 1! linear linear

22 Linear Quadratic Regulator (LQR) quadratic linear linear We have written optimal action value function x T 1,u T 1! Q(x T 1,u T 1 ) only in terms of

23 Linear case: LQR Linear case: LQR x_0 Diagram: Sergey Levine

24 Stochastic dynamics Same control solution u t = K t x t + k t for stochastic but Gaussian dynamics: f(x t,u t )=F t apple xt u t +f t x t+1 p(x t+1 x t,u t ) apple xt p(x t+1 x t,u t )=N F t u t +f t, X t

25 Non-linear case:use iterative approximations! First order Taylor expansion for the dynamics around a trajectory ˆx t, û t,t=1 T : Second order Taylor expansion for the cost around a trajectory ˆx t, û t,t=1 T:

26 Iterative LQR (i-lqr) û 0...û T Initialization: Given ˆx 0, pick a random control sequence and obtain corresponding state sequence ˆx 0...ˆx T 8t 8t 8t u t =û t + K t (x t 8t 8t ˆx T )+k t 8t

27 Iterative LQR (i-lqr) Initialization: Given ˆx 0, pick a random control sequence and obtain corresponding state sequence ˆx 0...ˆx T û 0...û T 8t 8t 8t Linear approximation around ˆx, û Find u t,t=1...t û t + u t so that minimizes the linear approximation 8t u t =û t + K t (x t ˆx T )+k t 8t 8t Go to the ˆx 0 =ˆx + x t and û 0 =û + u t

28 Iterative LQR (i-lqr) i-lqr approximates Newton s method for solving: Same as with Newton method, since the original objective is non-convex, the optimization can get stuck in local minima. What are we missing to be exact Newton s method?

29 Differential Dynamic Programming Second order approximation for the dynamics: The resulting method is called differential dynamic programming. Reference: Jacobson and Mayne, Differential dynamic programming, 1970

30 Nonlinear case: DDP/iterative LQR The quadratic approximation in invalid too far away from the reference trajectory Instead of finding the argmin i do a line search line search for \alpha Run forward pass with real nonlinear dynamics and u t =û t + K t (x t ˆx T )+ k t More principled ways of doing this for stochastic policies in the next lecture

31 Nonlinear case: DDP/iterative LQR So far we have been planning (e.g. 100 steps) and then we close our eyes and hope our modeling was accurate enough.. At convergence of ilqr and DDP, we end up with linearization around the (state, input) trajectory the algorithm converged to. In practice: the system could not be on this trajectory due to perturbations / initial state being off / dynamics model being off / Can we handle such noise better?

32 Model Predictive Control Yes! If we close the loop! Model predictive control! Solution: at time t when asked to generate control input u_t, we could resolve the control problem using ilqr or DDP over the time steps t through T Re-planning entire trajectory is often impractical -> in practice: replay over horizon H (receding horizon control)

33 i-lqr: When it works Synthesis and stabilization of complex behaviors with online trajectory optimization, Tassa, Erez, Todorov, IROS 2012

34 i-lqr: When it works Cost: kx t x k x Direction for minimizing the cost x t

35 i-lqr: When it doesn t work Cost: kx t x k x x t Due to discontinuities of contact, the local search fails! Solution? Initialize using a human demonstration instead of random! Learning Dexterous manipulation Policies from Experience and Imitation, Kumar, Gupta, Todorov, Levine 2016

38 Learning Dynamics So far we assumed we knew how the world works: We used discrete or continuous searches (e.g., MCTS(samples) and i- LQR(derivatives)) to look-ahead. We got good results, better than model-free RL. Dynamics help. However knowing the transition model (other than for rigid objects) is unrealistic! In fact, we do not have good physics simulators easily accessible for even basic things like turning a wheel. A huge part (deformable objects, object interactions, etc ) we do not know how to model well. Even for rigid objects, where we know the equations of motion, we do not know the coefficients, e.g., friction, mass etc. Thus: instead of assuming them known, we need to learn dynamics.

39 Learning Dynamics Newtonian Physics equations VS general parametric form (no priors from Physics knowledge) System identification: when we assume the dynamics equations given and only have few unknown parameters Neural networks: tons of unknown parameters Much easier to learn but suffers from undermodeling, bad models Very flexible, very hard to get it to generalize

40 Learning Dynamics: Regression Search with learnt dynamics: 1. Run base policy 0 (u t x t )(e.g., random policy) to collect D = {(x, u, x 0 ) i } X 2. Learn dynamics model f(x, u) to minimize kf(x i,u i ) x 0 ik 2 3. Plan under the learnt dynamics to choose actions (e.g., MCTS or i-lqr) i

A robot randomly interacts (pokes objects) and collects data (x,u,x ) 2. Fit dynamics model 3. Plan actions to accomplish a particular go

41 Learning Dynamics: Regression Search with learnt dynamics: 1. Run base policy 0 (ut xt )(e.g., random policy) to collect D = {(x, u, x0 )i } X 2. Learn dynamics model f (x, u)to minimize kf (xi, ui ) x0i k2 i 3. Plan under the learnt dynamics to choose actions (e.g., MCTS or i-lqr) Let s apply it: 1. A robot randomly interacts (pokes objects) and collects data (x,u,x ) 2. Fit dynamics model 3. Plan actions to accomplish a particular goal, e.g., push an object as far away as possible What goes wrong: Does it work? No! Part of the state space explored during go right to get higher! step 1 Part of the state space visited during step 3 state distribution mismatch between dynamics learning and policy execution

42 Learning Dynamics: DAGGER 1. Run base policy 0 (u t x t )(e.g., random policy) to collect D = {(x, u, x 0 ) i } X 2. Learn dynamics model f(x, u) to minimize kf(x i,u i ) x 0 ik 2 3. Plan under the learnt dynamics to choose actions (e.g., MCTS or i-lqr) i 4. Execute those actions and add the resulting data {(x, u, x 0 ) j } to D

43 Learning Dynamics: DAGGER 1. Run base policy 0 (u t x t )(e.g., random policy) to collect D = {(x, u, x 0 ) i } X 2. Learn dynamics model f(x, u) to minimize kf(x i,u i ) x 0 ik 2 3. Plan under the learnt dynamics to choose actions (e.g., MCTS or i-lqr) i 4. Execute those actions and add the resulting data {(x, u, x 0 ) j } to D If the action sequence is long the model errors accumulate in time. It is impossible our dynamic model to be perfect and tolerate long chaining. It is also not biologically plausible, e.g., humans are very bad at predicting ball collisions accurately.

44 Learning Dynamics: MPC every N steps 1. Run base policy 0 (u t x t ) (e.g., random policy) to collect D = {(x, u, x 0 ) i } X 2. Learn dynamics model f(x, u) to minimize kf(x i,u i ) x 0 ik 2 3. Plan under the learnt dynamics to choose actions (e.g., MCTS or i-lqr) 4. Execute the first planned action, observe resulting state (MPC) 5. Append (x, u, x 0 ) to dataset D i x 0

45 Learning Dynamics (Summary) Regression problem: collect random samples, train dynamics, plan Pro: simple, no iterative procedure Con: distribution mismatch problem DAGGER: iteratively collect data, replan, collect data Pro: simple, solves distribution mismatch Con: open loop plan might perform poorly, esp. in stochastic domains MPC: iteratively collect data using MPC (replan at each step) Pro: robust to small model errors Con: computationally expensive, but have a planning algorithm available

46 Examples of Learning Dynamics

47 F 47

48 48

49 49

50 50

51 51

52 52

53 Learning Action-Conditioned Dynamics Newtonian Physics Galileo: Perceiving Physical Object Properties by Integrating a Physics Engine with Deep Learning, Wu et al. Newtonian Image Understanding: Unfolding the Dynamics of Objects in Static Images, Mottaghi et al. Predictive Physics 53

54 Learning Action-Conditioned Dynamics Predictive Visual Models of Physics for Playing Billiards, K.F.*, Pulkit Agrawal*, Sergey Levine, Jitendra Malik, ICLR

55 Learning Action-Conditioned Dynamics F F CNN Problems: Forces are translation invariant! 55

56 Object-centric prediction F F World-Centric Prediction Object-Centric Prediction 56

57 57

58 58

59 59

60 60

61 61

62 62

63 CNN Displacements of the centered ball Forces applied to the centered ball 63

64 Network Architecture decode3 decode-relu2 decode-relu1 lstm2 lstm1 encode-relu2 encode-relu1 decode3 decode-relu2 decode-relu1 lstm2 lstm1 encode-relu2 encode-relu1 conv-relu4 conv-relu3 conv-relu-pool2 conv-relu-pool1 conv-relu4 conv-relu3 conv-relu-pool2 conv-relu-pool1 64

65 Test Time: Visual Imaginations For all objects simulator visual renderer 65

66 66 fi id

67 Playing Billiards 67

68 Playing Billiards 68

69 Learning Action Policy Learning Conditioned Dynamics F F 69

70 Handling uncertainty in prediction CNN Forces applied to the centered ball Displacements of the centered ball The model presented so far was deterministic: regression It cannot handle uncertainty! Regression to the mean: Solutions: Gaussian Mixture Model Stochastic networks

71 Text/Handwriting Generation using RNNs Machine generated sentence: Besides these markets (notably a son of humor). Machine generated handwriting: Graves et al., Sutskever et al.

72 Text/Handwriting Generation using RNNs decode1 lstm3 lstm2 lstm1 encode1 decode1 lstm3 lstm2 lstm1 encode1 Graves et al. 72

73 Learning Dynamics from Motion Capture decode1 lstm3 lstm2 lstm1 encode1 noise decode1 lstm3 lstm2 lstm1 encode1 noise Graves et al. 73

74 Decoder Recurrent Encoder decode3 decode-relu2 decode-relu1 lstm2 lstm1 encode-relu2 encode-relu1 decode3 decode-relu2 decode-relu1 lstm2 lstm1 encode-relu2 encode-relu1 74

75 Ground-truth ERD LSTM-3LR CRBM NGRAM GP green: conditioning blue: generation 75

76 Video prediction under multimodal future distributions Stochastic networks

Recognition network KL During training the loss is a

77 Video prediction under multimodal future distributions Train it with (conditional) variational autoencoder Generator Recognition network KL During training the loss is a combination of KL divergence and reconstruction loss of the recognition network

78 Video prediction under multimodal future distributions

79 Video prediction under multimodal future distributions green: input, red: sampled future motion field and corresponding frame completion

80 Video prediction under multimodal future distributions green: input red: sampled future completion more future completio ns green: input red: sampled future completion more future completio ns

81 Global models Despite the abundance of research papers on the forecasting problem, such approaches so far have not been successful in aiding model predictive control Naturally, not all part of the state space will have accurate dynamics, and such inaccuracies will be exploited buy the planner For MPC, local models currently dominate

82 Today s lecture Optimal Control, trajectory optimization formulation Special but important case: Linear dynamics, quadratic costs (LQR) iterative-lqr / Differential Dynamic programming for Non-linear dynamical systems Examples of when it works and when it does not Learning Dynamics: Global Models

83 Next two lectures Learning Dynamics: Local Models Learning local or global controllers with a trust region based on KL divergence of their trajectory distributions (i-lqr, TRPO) Imitating Optimal Controllers to find general policies that do not depend on initial state x_0 Successful examples from the literature

Nonlinear Optimization for Optimal Control Part 2. Pieter Abbeel UC Berkeley EECS. From linear to nonlinear Model-predictive control (MPC) POMDPs

Nonlinear Optimization for Optimal Control Part 2 Pieter Abbeel UC Berkeley EECS Outline From linear to nonlinear Model-predictive control (MPC) POMDPs Page 1! From Linear to Nonlinear We know how to solve