Real Time Stochastic Control and Decision Making: From theory to algorithms and applications

Size: px

Start display at page:

Download "Real Time Stochastic Control and Decision Making: From theory to algorithms and applications"

Kimberly Cunningham
5 years ago
Views:

1 Real Time Stochastic Control and Decision Making: From theory to algorithms and applications Evangelos A. Theodorou Autonomous Control and Decision Systems Lab

2 Challenges in control Uncertainty Stochastic uncertainty Parametric uncertainty Stochastic and parametric uncertainty Total uncertainty Scalability State space Control parameterization Computational Efficiency Real time Few data/interactions Optimality Global optimality 2

First principles Stochastic Control Theory

Lev Pontryagin Parallel Computation 8 Machine

Kac 7 Velocity (m/s) 6 5 4 3 2 0 8 7 6 5 4 3 2 U x U

3 First principles Stochastic Control Theory Statistical Physics Richard Bellman Richard Feynman Lev Pontryagin Parallel Computation 8 Machine Learning Velocity at 6 m/s and 8 m/s Targets Mark Kac 7 Velocity (m/s) U x U y k(u x,u y)k 6 m/s Target 8 m/s Target Time (s)

4 Autonomy Aerospace Engineering Lineariza8on & Uncertainty Representa8on Machine Learning Control Theory & Dynamical Systems Stochas8c Control (SC) Sampling, Feynman-Kac & Informa8on Theore8c Physics & Mechanics Robo8cs Applica8on to Neuroscience Bio-control Theore8cal Areas Neuroscience & Bioinspired Robo8cs

5 Dynamic Programming Problem Formulation: Cost under minimization: V (x,t i )=min u Running Cost: E Q apple (x,t N )+ Z tn L(x, u,t)=q 0 (x,t)+q (x,t)u + 2 ut Ru Controlled Diffusion Dynamics: L(x, u,t)dt dx = F (x, u)dt + B(x)d! t i Noise F(x, u) =f(x)+g(x)u Control E Q : Expectation w.r.t controlled dynamics E P : Expectation w.r.t uncontrolled dynamics Bellman Principle Discrete: Cost to go(current state) = min[cost to reach(next state) + Cost to go(next state)] V (x ) Start Stage V (x 2 ) V (x 3 ) 2 3 Stage 2 V (x 4 ) 4 5 V (x 5 ) Stage3 L 4,6 L 4,7 L 4,8 6 V (x 6 ) 7 V (x 7 ) 8 V (x 8 ) Stage 4 L 6,9 L 7,9 L 8,9 9 Goal

6 Dynamic Programming Problem Formulation: Cost under minimization: V (x,t i )=min u Running Cost: E Q apple (x,t N )+ Z tn L(x, u,t)=q 0 (x,t)+q (x,t)u + 2 ut Ru Controlled Diffusion Dynamics: L(x, u,t)dt dx = F (x, u)dt + B(x)d! t i Noise F(x, u) =f(x)+g(x)u Control E Q : Expectation w.r.t controlled dynamics E P : Expectation w.r.t uncontrolled dynamics Bellman Principle Discrete: Cost to go(current state) = min[cost to reach(next state) + Cost to go(next state)] V (x ) Start Stage V (x 2 ) 4 V (x 5 ) V (x 6 ) L 6,9 ) Backward Process L L 7 7,9 2) Partial Differential Equation 4,8 hard to solve V (x 3 ) 2 3 Stage 2 V (x 4 ) 5 Stage3 L 4,6 L 4,7 6 V (x 7 ) 8 V (x 8 ) Stage 4 L 8,9 9 Goal

7 Making Control Linear Hamilton Jacobi t V = q +(r x V ) T f 2 (r xv ) T GR G T (r x V )+ 2 tr (r xxv )BB T Desirability: (x,t)= V (x,t) Noise regulates control authority: G(x)R G(x) T = B(x)B(x) T Chapman t = q + f T (r x )+ 2 tr (r xx )BB T Statistical Physics Feynman - Kac Sample from the uncontrolled dynamics: dx = f(x)dt + B(x)d! Evaluate the ectation: Find optimal Controls: u PI (x) = apple (x,t i )=E P R q (x,t) Z tn t i q(x)dt G(x) T r x (x,t) (x,t) (x,t N ) Push the dynamics to states with high desirability. Richard Feynman Mark Kac

8 Making Control Linear Hamilton Jacobi t V = q +(r x V ) T f 2 (r xv ) T GR G T (r x V )+ 2 tr (r xxv )BB T Desirability: (x,t)= V (x,t) Noise regulates control authority: G(x)R G(x) T = B(x)B(x) T Chapman t = q + f T (r x )+ 2 tr (r xx )BB T Statistical Physics Feynman - Kac Sample from the uncontrolled dynamics: dx = f(x)dt + B(x)d! Evaluate the ectation: (x,t i )=E P apple Z tn q(x)dt (x,t N ) Richard Feynman t i Mark Kac

9 Making Control Linear Hamilton Jacobi t V = q +(r x V ) T f 2 (r xv ) T GR G T (r x V )+ 2 tr (r xxv )BB T Desirability: (x,t)= V (x,t) Noise regulates control authority: G(x)R G(x) T = B(x)B(x) T Chapman t = q + f T (r x )+ 2 tr (r xx )BB T Statistical Physics Feynman - Kac Sample from the uncontrolled dynamics: dx = f(x)dt + B(x)d! Evaluate the ectation: Lower bound: apple (x,t i )=E P apple V (x,t i )= log E P apple apple E Q (x,t N )+ Z tn t i Z tn t i Z tn t i q(x,t)dt L(x, u,t)dt q(x)dt (x,t N ) (x,t N ) Richard Feynman Mark Kac

10 Generalized Path Integral Control Path Integral Control: Local Controls: u L (x ti,t i )=R G(x ti ) T G(x ti )R G(x ti ) T G(x ti )dw(t i ) Y Trajectory Probability: P(~x) = S(~x) R S(~x) d~x Intuition... Start dw () t 0 dw (M) t 0 dw (2) t 0 S 2 (~x) S (~x) S M (~x) Goal u = X i P i dw (i) t o P i = P ( S i(~x)) i ( S i(~x)) Sample from the uncontrolled dynamics: 8 E.Theodorou, J. Buchli,S. Schaal. AISTATS2009, JMLR200

11 Iterative Path Integral Control Sampling with uncontrolled dynamics: (x,t i )= Z q(xt,t)dt (x,t N )dq Uncontrolled Sampling with controlled dynamics: (x,t i )= Z q(xt,t)dt (x,t N ) dq Uncontrolled dp Controlled dp Controlled Find optimal Controls: u PI (x) = R q (x,t) G(x) T r x (x,t) (x,t) Radon Nikodym derivative Iterative Path Integral Control: u new (t k )=u old (t k )+ X i P i dw (i) (t k ) Path Probability: P i = P i S(~x i ) + correction S(~x i ) + correction 0 E.Theodorou, J. Buchli,S. Schaal JMLR200 E.Theodorou, E. Todorov. CDC202

12 Overview Optimal Control Theory Constrained Optimization Dynamic Programming Hamilton Jacobi Bellman PDE V (x,t) V (x,t)= log (x,t) Backward Chapman Kolmogorov (x,t) Feynman Kac Lemma Fundamental Bound E. Theodorou, E Todorov. CDC 202 E. Theodorou, Entropy 205

13 Information Theoretic View Free Energy: F = log Z J(x) dp(!) Generalized Entropy: S(Q P) = Z dq(!) dp(!) log dq(!) dp(!) dp(!) log Z apple J(x) dp(!) apple E Q J(x) + KL Q P Free Energy apple Work Temperature Generalized Entropy Optimal Measure: dq = R J(x) dp J(x) dp

14 Non-Classical View Uncontrolled dynamics: Controlled dynamics: P :dx = f(x)dt + p B(x)d! (0) Q :dx = f(x)dt + B(x) udt + p d! () Fundamental Relationship: = apple log E P J (x) apple apple E Q J (x) + KL Q P Radon Nikodym: Z dq tn dp = 2 t i u T udt + p Z tn t i u T dw () (t) Fundamental Relationship: (x,t i )= apple log E P J (x) apple E Q applej (x)+ Z tn t i 2 ut udt Free Energy Cost Function There is a lower bound in the cost function How to find this lower bound Permits generalizations to other classes of stochastic dynamics.

15 Non-Classical View Basic Relationship: (x,t i )= Z log (x,t) apple J(x) dp(!) apple E Q J(x)+ Z tn t i 2 ut Rudt Backward Chapman t = q 0 + f T (r x )+ 2 tr (r xx )BB T Exponential Transformation: (x,t)=( (x,t)) Hamilton Jacobi t = q 0 +(r x ) T f 2 (r x ) T BB T (r x ) + 2 tr (r xx )BB T (x,t) : is a Value function (x,t) : is a desirability function

16 Overview Optimal Control Theory Constrained Optimization Dynamic Programming Information Theory Free Energy & Relative Entropy Jensens Inequality & Girsanov Theorem Hamilton Jacobi Bellman PDE V (x,t) Fundamental Bound V (x,t)= log (x,t) (x,t)= (x,t) V (x,t)= (x,t) Feynman Kac Lemma Backward Chapman Kolmogorov (x,t) Backward Chapman Kolmogorov (x,t) Feynman Kac Lemma (x,t)= log (x,t) Fundamental Bound Hamilton Jacobi Bellman PDE E. Theodorou, E Todorov, CDC 202. E. Theodorou, Entropy 205.

17 Non-Classical View Fundamental Relationship: Z log apple J(x) dp(!) apple E Q J(x) + KL Q P Uncontrolled dynamics: P :dx = F(x, 0)dt + C(x)dw (0) Controlled dynamics: Q :dx = F(x, u)dt + C(x)dw () Grady et all, ICRA 206.

18 Non-Classical View Fundamental Relationship: Z log apple J(x) dp(!) apple E Q J(x) + KL Q P Uncontrolled dynamics: P :dx = F(x, 0)dt + C(x)dw (0) Controlled dynamics: Q :dx = F(x, u)dt + C(x)dw () Optimal Measure: dq = R J(x) dp J(x) dp New Optimal Control Formulation: u = argmin KL Q Q(u) Grady et all, ICRA 206.

19 Non-Classical View Fundamental Relationship: Z log apple J(x) dp(!) apple E Q J(x) + KL Q P Uncontrolled dynamics: P :dx = F(x, 0)dt + C(x)dw (0) Controlled dynamics: Q :dx = F(x, u)dt + C(x)dw () Optimal Measure: dq = R J(x) dp J(x) dp New Optimal Control Formulation: u = argmin KL Q Q(u) Optimal Measure: dq dp Radon Nikodym: dp dq(u) Grady et all, ICRA 206.

20 Non-Classical View Fundamental Relationship: Z log apple J(x) dp(!) apple E Q J(x) + KL Q P Uncontrolled dynamics: P :dx = F(x, 0)dt + C(x)dw (0) Controlled dynamics: Q :dx = F(x, u)dt + C(x)dw () Optimal Measure: dq = R J(x) dp J(x) dp New Optimal Control Formulation: u = argmin KL Q Q(u) Grady et all, ICRA 206.

21 Non-Classical View Fundamental Relationship: Z log apple J(x) dp(!) apple E Q J(x) + KL Q P Uncontrolled dynamics: P :dx = F(x, 0)dt + C(x)dw (0) Controlled dynamics: Q :dx = F(x, u)dt + C(x)dw () Optimal Measure: dq = R J(x) dp J(x) dp New Optimal Control Formulation: u = argmin KL Q Q(u) Control parameterization: 8 u 0 if 0 apple t< t u if t apple t<2 t >< u t =. u j if j t apple t<(j + ) t >:. Grady et all, ICRA 206.

22 Non-Classical View Fundamental Relationship: Z log apple J(x) dp(!) apple E Q J(x) + KL Q P Uncontrolled dynamics: P :dx = F(x, 0)dt + C(x)dw (0) Controlled dynamics: Q :dx = F(x, u)dt + C(x)dw () Optimal Measure: dq = R J(x) dp J(x) dp New Optimal Control Formulation: u = argmin KL Q Q(u) Control parameterization: 8 u 0 if 0 apple t< t u if t apple t<2 t >< u t =. u j if j t apple t<(j + ) t >:. Generalized Importance Sampling: E P apple f(x) = E Qm, applef(x) dp dq m, Grady et all, ICRA 206.

23 MPPI with offline model learning 7~8 m/sec 20

24 Y Position (m) Applications in Robotics-High Speed MPPI with offline learned dynamics Paths Taken for 6 m/s and 8 m/s Targets X Position (m) 6 m/s Target 8 m/s Target Position Error (m) Heading Error (rad) P x P y Multi-Step Prediction Error U x U y Velocity Error (m/s) Time (s) Hea. Rate Error (rad/s) Controlled dynamics: f i (x) N (µ i (x), 2 i (x)) Mean: µ i (x) = 2 n ( x) T A i ( X)Y i Variance: 2 i ( x) = ( x) T A i ( x) A i = 2 n ( x) ( x) T + p

25 Applications in Robotics-MPPI with online model learning 8~9 m/sec

26 Model Predictive Path Integral Control using Artificial Neural Networks ~ m/sec 23

27 Applications in Multi-vehicle

Applications in Robotics-Comparisons Average Running Cost 30 25 20 5 0 DDP Solution = 50 = 00 = 50 = 300 5.0.5 2.0 2.5 3.0 3.5 4.

28 Applications in Robotics-Comparisons Average Running Cost DDP Solution = 50 = 00 = 50 = Number of Rollouts (Log Scale) DDP Cornering Maneuver MPC-DDP MPPI Cornering Maneuver MPPI

29 Applications in Multi-vehicle Success Rate Static Obstacles Moving Obstacles 9 Quadrotors Optimization Iteration Time (ms) Quadrotor 3 Quadrotors 9 Quadrotors Real-Time CUDA Optimization Times Rollouts (Thousands) Number of Rollouts

30 Learning and optimization in different Time scales u(x,t,) = (x,t) (k) PI (t i) (k) PI (t i+) (k) PI (t N ) u(x,t,) = (x)(t) 32

31 Applications in Robotics-Manipulation

32 Applications in Robotics-Manipulation

33 Nonlinear Feynman-Kac Cost Function: apple J(,x ; u( )) = E g(x(t )) + Z T q 0 (t, x(t)) + q (t, x(t)) > u(t) dt Stochastic Dynamics: dx(t) =f(t, x(t))dt + G(t, x(t))u(t)dt + (t, x(t))dw t Hamilton-Jacobi-Bellman: v t +inf u2u 2 tr(v xx > )+v > x f + v > x G + q > D(sgn(u)) u + q 0 =0 Hamilton-Jacobi-Bellman: v t + 2 tr(v xx > )+v > x f + q 0 + X i= min (v > x G + q > i umax i, 0, (v > x G q > i umin i =0

34 Nonlinear Feynman-Kac Forward SDE: dx s = b(s, X s )ds + (s, X s )dw s Backward SDE: dy s = h(s, Xs t,x,y s,z s )ds + Z s > dw s b(t, x) f(t, x) h(t, x, z) q 0 (t, x)+ X i= min (z > + q > i umax i, 0, (z > q > i umin i Importance Forward SDE: d X s =[b(s, X s )+ (s, X s )K s ]ds + (s, X s )dw s Compensated Backward SDE: dỹs =[ h(s, X s, Ỹs, Z s )+ Z > s K s ]ds + Z > s dw s

35 Nonlinear Feynman-Kac Forward SDE: dx s = b(s, X s )ds + (s, X s )dw s Backward SDE: dy s = h(s, Xs t,x,y s,z s )ds + Z s > dw s b(t, x) f(t, x) h(t, x, z) q 0 (t, x)+ X i= min (z > + q > i umax i, 0, (z > q > i umin i Importance Forward SDE: d X s =[b(s, X s )+ (s, X s )K s ]ds + (s, X s )dw s Compensated Backward SDE: dỹs =[ h(s, X s, Ỹs, Z s )+ Z > s K s ]ds + Z > s dw s v t + 2 tr(v xx > )+v > x (b + K)+h(t, x, v, > v x ) v > x K =0

Nonlinear Feynman-Kac Cost Mean and Variance 70 Cost per iteration 60 Cost (mean + 3 std dev) 50 40 30 20 0 0 0 5 0 5

Angle 3 2 Angular Vel. 2 4 0 Det. Open Loop Det. Feedback Stoch.

36 Nonlinear Feynman-Kac Cost Mean and Variance 70 Cost per iteration 60 Cost (mean + 3 std dev) Iteration Mean of the Controlled System Trajectories of each Iteration 6 Cost Mean 5 Cost Variance Angle 3 2 Angular Vel Det. Open Loop Det. Feedback Stoch. Feedback 0 Det. Open Loop Det. Feedback Stoch. Feedback Stochastic Cooperative Games Risk Sensitive Stochastic Control affine, State/Control Multiplicative Time t Time t

Statistical Physics Summary Control Theory Richard Feynman Richard Bellman Mark Kac Parallel Computation Neuro-Morphic Velocity (m/s) 8 7 6 5 4 3 2 0 8

37 Statistical Physics Summary Control Theory Richard Feynman Richard Bellman Mark Kac Parallel Computation Neuro-Morphic Velocity (m/s) Machine Learning Ux Uy Velocity at 6 m/s and 8 m/s Targets k(ux,uy)k Time (s) 6 m/s Target 8 m/s Target Lev Pontryagin

38 Autonomous Control and Decision Systems Lab My#world#is#delayed# My#world#is#quantum#mechanical# What#these#guys#are# talking#about?# My#world#is#infinite#dimensional# My#world#is#Varia-onal# My#world#is#parallel## My#world#is# probabilis-c# My#world#is#mul-ple# Collaborators: Sponsors: 40

39 Thanks!

Path Integral Stochastic Optimal Control for Reinforcement Learning

Preprint August 3, 204 The st Multidisciplinary Conference on Reinforcement Learning and Decision Making RLDM203 Path Integral Stochastic Optimal Control for Reinforcement Learning Farbod Farshidian Institute