Stochastic and Adaptive Optimal Control Robert Stengel Optimal Control and Estimation, MAE 546 Princeton University, 2018! Nonlinear systems with random inputs and perfect measurements! Stochastic neighboring-optimal control! Linear-quadratic (LQ) optimal control with perfect measurements! Adaptive Critic Control! Information Sets and Expected Cost Copyright 2018 by Robert Stengel. All rights reserved. For educational use only. http://www.princeton.edu/~stengel/mae546.html http://www.princeton.edu/~stengel/optconest.html 1 Nonlinear Systems with Random Inputs and Perfect Measurements Inputs and initial conditions are uncertain, but the state can be measured without error = f x( t),u t z( t) = x( t) x t E x( 0) = x( 0) x ( 0 ) T { x( 0) } = 0 E x 0 x 0,w( t),t Assume that random disturbance effects are small and additive = f x t x t,u( t),t E w( t)w T τ + L( t)w( t) E w( t) = 0 = W t δ t τ 2
Cost Must Be an Expected Value Deterministic cosunction cannot be minimized because disturbance effect on state cannot be predicted state and control are random variables min u(t ) min u(t ) J = φ x( ) + J = E φ x( ) + t o t o L [ x(t),u(t) ] However, the expected value of a deterministic cost function can be minimized L [ x(t),u(t) ] dt dt 3 Stochastic Euler-Lagrange Equations? There is no single optimal trajectory Expected values of Euler-Lagrange necessary conditions may not be well defined 1) E λ( ) = E φ[x(t )] f x 2) E λ(t) = E H[x(t),u(t),λ(t),t] x T T 3) E H[x(t),u(t),λ(t),t] u = 0 4
Stochastic Value Function for a Nonlinear System Expected values of terminal and integral cost are well defined Apply Hamilton-Jacobi-Bellman (HJB) to expected terminal and integral costs Principle of Optimality Optimal expected value function at t 1 V * t 1 = E φ x *( ) = min u t 1 [ ] L x *(τ ),u*(τ ) E t 1 φ x *(t ) f L x *(τ ),u(τ ) [ ] dτ dτ 5 Rate of Change of the Value Function Total time-derivative of V* dv * = E L x * (t 1 ),u *(t 1 ) dt t =t 1 { [ ]} x(t) and u(t) can be known precisely; therefore dv * dt t =t 1 = L x * (t 1 ),u * (t 1 ) [ ] 6
ΔV* = dv * dt Incremental Change in the Value Function Apply chain rule to total derivative dv * dt = E V * + V * t x x Δt = E V * Δt + V * t x!xδt + 1 2 V * 2!xT!x x 2 Δt 2 +" = E V * Δt + V * t x f. Incremental change in value function, Expand to second degree ( + Lw (.)) Δt + 1 2 Next: cancel Δt ( f (.) + Lw (.)) T 2 V * f. x 2 ( + Lw (.)) Δt 2 +" 7 dv * dt Tr x T Qx Introduction of the Trace Trace of a matrix product is scalar Tr( ABC) = Tr( CAB) = Tr( BCA) = Tr( xx T Q) = Tr( Qxx T ) dim Tr E V * + V * t x f. = E V * + V * t x f. ( + Lw (.)) + 1 2 Tr f (.) + Lw (.) ( + Lw (.)) + 1 2 Tr 2 V * = 1 1 T 2 V * f. x 2 f (.) + Lw (.) x 2 f. + Lw (.) + Lw (.) Δt T Δt 8
dv * dt = E V * t Toward the Stochastic HJB Equation + V * x = V * + V * t x f. Because x(t) and u(t) can be measured, f (.) + Lw (.) + 1 2 Tr 2 V * f (.) + Lw (.) x 2 f. + E V * x Lw. + 1 2 Tr 2 V * f (.) + Lw (.) x 2 f. + Lw (.) + Lw (.) T Δt 1 st two terms can be taken outside the expectation T Δt 9 Toward the Stochastic HJB Equation Disturbance is assumed to be zero-mean white noise E w( t) = 0 w T ( τ ) E f. = 0 E w( t)w T ( τ ) = W( t)δ t τ E w( t)w T ( τ ) Δt W t Δt 0 dv * dt = V * + V * t x f (.) + 1 2 lim = V * t Uncertain disturbance input can only increase the value function rate of change Tr 2 V * E( f (.)f (.) T )Δt + LE( w( t)w( τ ) T )L T Δt 0 x 2 Δt 0 ( t) + V * ( x t)f (.) + 1 2 Tr 2 V * ( t)l( t)w( t)l( t) T x 2 10
Stochastic Principle of Optimality (Perfect Measurements) dv * dt min u = V * t E dv * dt ( t) + V * ( x t)f. + V * x + 1 2 Tr 2 V * V * t ( t) = x 2 ( t)l( t)w( t)l t! Substitute for total derivative, dv*/dt = L(x*,u*)! Solve for the partial derivative, V*/ t! Expected value of HJB Equation -> Stochastic HJB Equation = min u T ( t)f x *( t),u( t),t + 1 2 Tr 2 V * ( t)l( t)w( t)l t x 2 E L x *( t),u( t),t + V * ( x t)f x *( t),u( t),t + 1 2 Tr 2 V * ( t)l( t)w( t)l t x 2 Boundary (terminal) condition : V * = E φ ( ) T T 11 min u Observations of Stochastic Principle of Optimality (Perfect Measurements) V * t ( t) = E L x *( t),u( t),t + V * ( x t)f x *( t),u( t),t + 1 2 Tr 2 V * ( t)l( t)w( t)l t x 2! Control has no effect on the disturbance input! Criterion for optimality is the same as for the deterministic case! Disturbance uncertainty increases the magnitude of the total optimal value function, E[V*(0)] T 12
Adaptive Critic Controller 13 Adaptive Critic Controller Nonlinear control law, c, takes the general form x(t) : = c x(t),a,y * ( t) u t a : y *(t) : state [ ] parameters defining an operating point command input On-line adaptive critic controller Nonlinear control law ( action network ) Criticizes non-optimal performance via critic network Adapts control gains to improve performance, respond to failures, and accommodate parameter variation Adapts cost model to improve performance evaluation 14
Adaptive Critic Iteration Cycle Ferrari, Stengel, Model-Based Adaptive Critic Designs, in Learning and Approximate Dynamic Programming, Si et al, ed., 2004 15 Overview of Adaptive Critic Designs 16
Modular Description of Typical Iteration Cycle 17 Gain Scheduled Controller Design PI-LQ controllers (LQR with integral compensation) that satisfy requirements at N operating points Scheduling variable, a, identifies each operating point u( t k ) = C F ( a i )y *( t k ) + C B ( a i )Δx( t k ) + C I ( a i ) Δy τ c x(t k ),a i,y *( t k ), i = 1, N t k t k 1 dτ 18
Replace Gain Matrices by Neural Networks Replace control gain matrices by sigmoidal neural networks = u t k NN F y *( t k ),a( t k ) + NN B x( t k ),a( t k ) + NN I Δy( τ )dτ,a( t k ) = c x(t k ),a,y *( t k ) 19 Sigmoidal Neuron Input-Output Characteristic Logistic sigmoid function Sigmoid with two inputs, one output u = s(r) = or 1 1+ e r u = s(r) = er 1 e r +1 = tanh r 2 u = s(r) = 1 1+ e w 1r 1 +w 2 r 2 +b 20
Algebraically Trained Neural Control Law Algebraic training of neural networks produces exacit of PI- LQ control gains and trim conditions at N operating points Interpolation and gain scheduling via neural networks One node per operating point in each neural network See Ferrari, Stengel, JGCD, 2002 [on Blackboard] 21 On-line Optimization of Adaptive Critic Neural Network Controller Critic adapts neural network weights to improve performance using approximate dynamic programming Ferrari, Stengel, JGCD, 2004 22
Heuristic Dynamic Programming Adaptive Critic Dual Heuristic, Dynamic Programming Adaptive Critic for receding-horizon optimization problem V x a ( t k ) = L x a ( t k ),u( t k ) + V x a ( t k+1 ) [ ] V u = L u + V x a x a u = 0 V x a t x a t [ ] = NN C x a ( t),a t Critic and Action (i.e., Control) networks adapted concurrently LQ-PI cosunction applied to nonlinear problem Modified resilient backpropagation for neural network training on 2 nd Time Step Ferrari, Stengel, JGCD, 2004 23 Action Network On-line Training Train action network, at time t, holding the critic parameters fixed x a (t) a(t) NN A Aircraft Model Transition Matrices State Prediction Utility Function Derivatives NN A Target Optimality Condition NN C Target Generation 24
Critic Network On-line Training Train critic network, at time t, holding the action parameters fixed x a (t) a(t) NN A Aircraft Model Transition Matrices State Prediction NN C Utility Function Derivatives NN C (old) NN C Target Target Cost Gradient Target Generation 25 Adaptive Critic Flight Simulation 70-deg Bank Angle Command Multiple Failures (Thrust, Stabilator, Rudder) Ferrari, Stengel, JGCD, 2004 26
Stochastic Linear-Quadratic Optimal Control 27 Stochastic Principle of Optimality Applied to the Linear-Quadratic (LQ) Problem Quadratic value function = E φ x( ) V t o t o L [ x(τ ),u(τ ) ] dτ = 1 t 2 E o Q(t) M(t) xt ( )S( )x( ) x T (t) u T (t) M T (t) R(t) x(t) u(t) dt Linear dynamic constraint x ( t) = F(t)x(t) + G(t)u(t) + L(t)w(t) 28
Components of the LQ Value Function Quadratic value function has two parts V ( t) = 1 2 xt (t)s(t)x(t) + v( t) Certainty-equivalent value function V CE ( t) 1 2 xt (t)s(t)x(t) Stochastic value function increment v( t) = 1 2 t Tr S( τ )L( τ )W( τ )L τ T dτ 29 Value Function Gradient and Hessian Certainty-equivalent value function V CE ( t) 1 2 xt (t)s(t)x(t) Gradient with respect to the state V x t = xt (t)s(t) Hessian with respect to the state 2 V x 2 ( t) = S(t) 30
Linear-Quadratic Stochastic Hamilton-Jacobi-Bellman Equation (Perfect Measurements) V * t Certainty-equivalent plus stochastic terms = min u = min u 1 2 E x *T Qx *+2x * T Mu + u T Ru 1 2 x *T Qx *+2x * T Mu + u T Ru + x * T S Fx *+Gu + x * T S Fx *+Gu Terminal condition V ( ) = 1 2 xt ( )S( )x( ) + Tr SLWL T + Tr SLWL T 31 Optimal Control Law (Perfect Measurements) Differentiate right side of HJB equation w.r.t. u and set equal to zero ( V t ) u u t + x T SG = 0 = x T M + u T R Solve for u, obtaining feedback control law = R 1 ( t) G T t! C( t)x( t) S t + M T ( t) x( t) 32
LQ Optimal Control Law u t (Perfect Measurements) = R 1 ( t) G T t C( t)x( t) S t + M T ( t) x( t) Zero-mean, white-noise disturbance has no effect on the structure and gains of the LQ feedback control law 33 Matrix Riccati Equation for Control S t S t Substitute optimal control law in HJB equation 1 Sx 2 xt + v = 1 2 xt Q + MR 1 M T Matrix Riccati equation provides S(t) = Q(t) + M(t)R 1 (t)m T (t) F(t) G(t)R 1 (t)m T (t) ( F GR 1 M T ) T S S( F GR 1 M T ) + SGR 1 G T S + 1 Tr u ( SLWLT ) ( t) = R 1 t 2 G T t S t F(t) G(t)R 1 (t)m T T (t) S t + S( t)g(t)r 1 (t)g T (t)s( t), S + M T ( t) = φ xx ( )! Stochastic value function increases cost due to disturbance! However, its calculation is independent of the Riccati equation v = 1 2 Tr( SLWLT ) 34 x x t
Evaluation of the Total Cost (Imperfect Measurements)! Stochastic quadratic cosunction, neglecting cross terms J = 1 t f 2 Tr E xt ( )S( )x( ) + E xt (t) u T (t) t o Q(t) 0 0 R(t) x(t) u(t) dt = 1 2 Tr S(t )E x( f )xt ( ) + t o { Q(t)E x(t)xt (t) + R(t)E u(t)u T (t) }dt or where J = 1 2 Tr S( )P( ) + t o Q(t)P(t) + R(t)U( t) dt P(t) E x(t)x T (t) U t E u(t)u T (t) 35 Optimal Control Covariance U t = C t Optimal control vector = C( t) ˆx ( t) u t Optimal control covariance P( t)c T ( t) = R 1 ( t)g T ( t)s( t)p( t)s( t)g( t)r 1 t 36
Express Cost with State and Adjoint Covariance Dynamics J = 1 2 Tr S(t o)p t o Integration by parts to express S( )P( ) t S(t)P(t) f to = S(t)P(t) + S(t) P(t) dt S( )P( ) = S(t o )P(t o ) + + Q(t)P( t) + R(t)U( t) + S(t)P(t) + S(t) P(t) t o t o S(t)P(t) + S(t) P(t) dt Rewrite cosunction to incorporate initial cost t o dt 37 Evolution of State and Adjoint Covariance Matrices (No Control) u( t) = 0; U( t) = 0 State covariance response to random disturbance and initial uncertainty P ( t) = F( t)p( t) + P( t)f T ( t) + L( t)w( t)l T ( t), P( t o ) given Adjoint covariance response to state weighting and terminal cost weight S ( t) = F T ( t)s( t) S( t)f( t) Q( t), S given 38
P t Evolution of State and Adjoint Covariance Matrices (Optimal Control) = F( t) G t State covariance response to random disturbance and initial uncertainty C t P t + P( t) F( t) G t C t T + L( t)w( t)l T t Dependent on S(t) Adjoint covariance response to state weighting and terminal cost weight S ( t) = F T ( t)s( t) S( t)f( t) Q( t) S( t)g( t)r 1 ( t)g T ( t)s( t) Independent of P(t) 39 Total Cost With and Without Control With no control J no control = 1 t 2 Tr f S(t o)p( t o ) + S( t)l( t)w( t)l T ( t)dt t o With optimal control, the equation for the cost is the same J optimal control = 1 2 Tr S(t o)p t o + S t t o L( t)w( t)l T ( t)dt... but evolutions of S(t) and S(t o ) are different in each case 40
Information Sets and Expected Cost 41 The Information Set, I! Sigma algebra(wikipedia definitions)! The collection of sets over which a measure is defined! The collection of events that can be assigned probabilities! A measurable space! Information available at current time, t 1! All measurements from initial time, t o! All control commands from initial time [ ] = { z[ t o,t 1 ],u[ t o,t 1 ]} I t o,t 1! Plus available model structure, parameters, and statistics [ ] = { z[ t o,t 1 ],u[ t o,t 1 ],f ( ),Q,R, } I t o,t 1 42
A Derived Information Set, I D! Measurements may be directly useful, e.g.,! Displays! Simple feedback control!... or they may require processing, e.g.,! Transformation! Estimation! Example of a derived information set! History of mean and covariance from a state estimator [ ] = { ˆx [ t o,t 1 ],P[ t o,t 1 ],u[ t o,t 1 ]} I D t o,t 1 43 Additional Derived Information Sets! Markov derived information set! Most current mean and covariance from a state estimator I MD t 1 = ˆx t 1 {,P( t 1 ),u( t 1 )}! Multiple model derived information set! Parallel estimates of current mean, covariance, and hypothesis probability mass function = I MM t 1 ˆx A ( t 1 ),P A ( t 1 ),u( t 1 ),Pr( H A ), ˆx B ( t 1 ),P B ( t 1 ),u( t 1 ),Pr( H B ),! { } 44
Required and Available Information Sets for Optimal Control! Optimal control requires propagation of information back from the final time! Hence, it requires the entire information set, extending from t o to I t o,! Separate information set into knowable and future sets I t o, [ ] + I t 1, = I t o,t 1! Knowable set has been received! Future set is based on current knowledge of model, current estimates, and statistics of future uncertainties 45 Expected Values of State and Control Expected values of the state and control are conditioned on the information set E x( t) I D = ˆx ( t) E{ T x( t) ˆx ( t) x( t) ˆx ( t) I } D = P t... where the conditional expected values are estimates from an optimal filter 46
Dependence of the Stochastic Cost Function on the Information Set J = 1 t 2 E E Tr S( )x( )x T ( ) I D + E Tr Qx ( f { t xt t) }dt + E Tr Ru( t)u T { ( t) }dt 0 0 P t Expand the state covariance T { x( t) ˆx ( t) I D } = E x( t)x T ( t) ˆx ( t)x T ( t) x t = E x( t) ˆx ( t) { ˆx T ( t) + ˆx ( t) ˆx T ( t) I D } E { ˆx T ( t) I D } = E ˆx t x t { x T ( t) I D } = ˆx t ˆx T ( t) P t or E = E x t x( t)x T t { x T ( t) I D } ˆx t ˆx T ( t) { I D } = P( t) + ˆx ( t) ˆx T ( t)... where the conditional expected values are obtained from an optimal filter 47 Certainty-Equivalent and Stochastic Incremental Costs { + ˆx( )ˆx T ( ) } + Tr Q P t J = 1 2 E Tr S( ) P { + ˆx ( t) ˆx T ( t) }dt + Tr Ru( t)u T ( t) dt 0 0 J CE + J S! Cosunction has two parts! Certainty-equivalent cost! Stochastic increment cost J CE = 1 t 2 E Tr S( )ˆx( )ˆx T ( ) + Tr Qˆx ( t ) f { ˆx T ( t) }dt + Tr Ru( t)u T ( t) dt 0 0 J S = 1 2 E Tr S( )P + 0 Tr QP ( t ) dt 48
Expected Cost of the Trajectory V *( t o ) J * = E ξ I[ t o,t 1 ] E ξ E J * Optimized cosunction = E φ x *( ) Pr I t o,t 1 t F [ ] + L x *(τ ),u * (τ ) Law of total expectation { [ ]} + E ξ I t 1, = E E ( ξ I ) t 0 ( )Pr I t 1, dτ { } Because the past is established at t 1 ( [ ]) 1 = E( J* I[ t o,t 1 ]) + E J* I t 1, = E J* I t o,t 1 [ ]+ E( J* I t 1, )Pr I t 1, ( )Pr{ I t 1, } { } 49 Expected Cost of the Trajectory E J * E J* I( t 0,t 1 ) For planning = E J* I t 0, E J * For post-trajectory analysis (with perfect measurements) For real-time control, t 1 = E J * + E J* I( t 1, ) ( )Pr I( t 0, ) = E J* I t 0, { } Pr I t,t 0 f { } = E( J* I( t 0, )) 1 { } Pr I t,t 1 f [ ] Known Estimated 50
min u Dual Control (Fel dbaum, 1965)! Nonlinear system! Uncertain system parameters to be estimated! Parameter estimation can be aided by test inputs! Approach: Minimize value function with three increments! Nominal control! Cautious control! Probing control min u V* = ( V * nominal + V * cautious + V * probing ) Estimation and control calculations are coupled and necessarily recursive 51 Next Time: Linear-Quadratic-Gaussian Regulators 52