Information Theoretic views of Path Integral Control.

Size: px

Start display at page:

Download "Information Theoretic views of Path Integral Control."

Douglas Rodgers
5 years ago
Views:

1 Information Theoretic views of Path Integral Control. Evangelos A. Theodoro and Emannel Todorov Abstract We derive the connections of Path IntegralPI and Klback-LieblerKL control as presented in machine learning and robotics commnities 4 with earlier work in controls theory on the fndamental dalities between relative entropy and free energy and the logarithmic transformations of diffsions processes 5 0. Or analysis provides an information theoretic view of PI stochastic optimal control based on the dality between Free Energy and Relative Entropy as expressed by the Legendre-Fenchel transformation in Statistical Mechanics. Finally we discss the cases of partial observability and min-max control as stdied in controls and conclde. Introdction Recent developments in nonlinear stochastic optimal control and machine learning sggest new ways to solve optimization problems for dynamical systems in continos state action spaces. The mathematical analysis is based on the qantm-mechanical concept of PI. The PI formalism provides an nified view of Newtonian and Qantm mechanics. In stochastic optimal control theory, path integrals can be sed to represent soltions of partial differential eqations. Here we provide the information theoretic view of path integral control and show its connection to mathematical developments in control theory. The Feynman-Kac lemma will play an essential role in or analysis since it creates bridges between cost fnctions sbject to stochastic differential eqations and partial differential eqations. This paper is organized as follows, in section 2 we show the dalities between free energy and relative entropy. In section 3 we derive PI control based on these dalities and discss the cases of min-max control and partial observability. In section 4 we derive PI based on the Bellman principle for continos and discrete time. In the last section 5 we compare the different approaches to PI control and conclde. 2 Free Energy and Relative Entropy Dalities In this section we derive the fndamental dality relationships between free energy and relative entropy 9. This relationship is important for the derivation of stochastic optimal control. Let Z, Z denote a measrable space and PZ the corresponding probability measre defined on the measrable space. For or analysis we consider the following definitions. Definition : Let P PZ and the fnction J x : Z R be a measrable fnction. Then the term: E J x = log exp ρj xdp is called free energy of J x with respect to P. Definition 2: Let P PZ and Q PZ, the relative entropy of P with respect to Q is defined as: { log H Q P = dp if Q << P and log dp L + otherwise

2 We will also consider the objective fnction: ξx = ρ J E x = ρ 0 log E τ i exp ρj x with J x = φx tn + t N qxds the state dependent cost. The objective fnction above takes the form ξx = E τ 0 i J + ρ 2 V ar J as ρ 0. This form allows s to get the basic intition for constrcting sch objective fnctions. Essentially for small ρ the coss a fnction of the mean the variance. When ρ > 0 the cost fnction is risk sensitive while for ρ < 0 is risk seeking. To derive the basic relationship between free energy and relative entropy we express the expectation E τ 0 i taken nder the measre P as a fnction of the expectation E taken nder the probability measre. More precisely will have: E 0 exp ρj x = exp ρj xdp = exp ρj x dp By taking the logarithm of both sides of the eqations above and making se of the Jensen s ineqality we will have: log E τ 0 i exp ρj x = log exp ρj x dp log exp ρj x dp. We mltiply the ineqality above with ρ for case of ρ < 0 or ρ = and ths we have: ξx = E J x E J x + H Q P where E J x = J x. The ineqality above gives s the dality relationship between relative entropy and free energy. Essentially one cold define the following two minimization problems: The infimm in 2 is attained at Q given by: E J x = inf E J x + Q H Q P = exp J xdp 3 exp J xdp When ρ > 0 the ineqality in becomes from to and the inf in 2 and becomes sp. Therefore we will have that: E J x = sp E J x Q H Q P 4 In the next section we show how ineqality 2 is transformed to a stochastic optimal control problem for the case of markov diffsion processes. 3 Information theoretic view of stochastic optimal control. We consider the ncontrolled and controlled stochastic dynamics of the form: dx = fxdt + Bxdw 0 t and dx = fxdt + Bx dt + dw t with x t R n denoting the state of the system, Bx, t : R n R R n n is the control and diffsions matrix, fx, t : R n R R n the passive dynamics, t R n the control vector and dw R p brownian noise. Notice that the difference between the two diffsions above is on the controls. These controls together with the passive dynamics define a new drift term. For or analysis here we assme B exists. Expectations evalated on trajectories generated by the controlled dynamics and ncontrolled dynamics are represented as E 0 and E respectively. The corresponding probability measres of the aforementioned expectations are P and Q. We contine or analysis with the main resln dp and the definition of the Radon-Nikodým derivative: dp = exp ζ and = exp ζ where according to Girsanov s theorem 2 adapted to the diffsion processes considered here, the 2 2

3 term ζ is expressed as: ζ = 2 t N T dt + t N T dw t. Sbstittion of the Radon-Nikodým derivative into 2 gives the following reslt: ξx = 0 log E exp J x E J x + ζ = E J x + tn T dt 2 5 The right term of the ineqality above corresponds to the cost fnction of a stochastic optimal control problem thas bonded from below by the free energy. Besides providing a lower bond on the objective fnction of the stochastic optimal control problem ineqality 5 expresses also how this lower bond shold be compted. This comptation involves forward sampling of the ncontrolled dynamics, evalation of the expectation of the exponentiated state depended part φx tn and qx t and the logarithmic transformation of this expectation. Srprisingly, ineqality 5 was derived withot relying on any principle of optimality. It only takes the application of Girsanov theorem between controlled and ncontrolled stochastic dynamics and the se of dal relationship between free energy and relative entropy to find the lower bond in 5. Essentially ineqality 5 defines a minimization problem in which the right part of the ineqality is minimized with respect ζ and therefore with respect to control. At the minimm, when = then the right part of the ineqality in 5 reaches its optimal ξx. An important qestion to is the link between 5 and the dynamic programming principle. To find this link the next step is to show that ξx satisfies the HJB eqations and therefore is the corresponding vale fnction. More precisely, we introdce a new variable Φx, t defined as Φx, t = E 0 exp ρj x. The Feynman-Kac lemma 3 tells s that this fnction satisfies the backward Chapman Kolmogorov PDE. Therefore the following eqation holds: t Φ = ρq 0 Φ + f T x Φ + 2 tr xx ΦBB T 6 For ρ = < 0 and since ξx = ρ log Φx, t = log Φx, t we will have that tφ = Φ t ξ, x Φ = Φ x ξ and xx Φ = Φ xx ξ + 2 Φ x ξ x ξ T it can be shown that ξx satisfies the nonlinear PDE: t ξ = q 0 + x ξ T f 2 xξ T BB T x ξ + 2 tr xx ξbb T 7 The nonlinear PDEs above corresponds to the HJB eqation 4 for the case of the minimizing optimal control problem with control weight R = I and therefore, ξx is the corresponding minimizing vale fnction. Note than order to derive the PDEs above we did not se any principle of optimality. Similar reslts can be derived for ρ = > 0 as shown in 6. The analysis so far is smmarized by the following corollary in which we se the fnction signx = x < 0 and signx = x > 0. More precisely we will have: Corollary Consider the expectation operators E 0, E evalated on state trajectories sampled according to ncontrolled and controlled dynamics respectively. The fnction ξx, t specified as: ξx, t = signρ log E 0 exp signρj x is the vale fnction of the stochastic optimal control problems: ξx, = min E t N qx 2 T dt, ρ > 0 and ξx, = max E t N qx + 2 T dt, ρ < Information theoretic view of stochastic optimal control: The case of partial observability. In the partial observable case 9, besides the stochastic dynamics there are also the observation diffsions: 3

4 dy = hxdt + Cxdv 0 t and dy = hxdt + Cx bdt + Cxdv t 8 The term y denotes the observations, b plays the role of control that act on observations, and dv 0 and dv 0 is the observation noise nder the ncontrolled and controlled measrement dynamics. In this case the Radon-Nikodým derivative is expressed as dp = exp ζ with ζ = 2 t N T + b T b dt + t N T dw t + b T dv t. Under the observation dynamics above the free energy is the lower bond of the following expression: ξx = 0 log E exp J x E J x + tn T + b T b dt 9 2 The expectation E 0 is with respect to the process and observations noise dw 0 t, dv 0 t while the expectation E is with respect to dw t, dv t. Again the free energy is the lower bond on a cost thas typically fond in stochastic optimal control. The partial observable case incldes controls in the observations. 3.2 Information theoretic view of stochastic optimal control: The case of min-max optimal control. We consider stochastic dynamics : dx = fxdt + Bx dt + dw 0 t and cost fnction Sx = E 0 exp Lx, dt = E exp 0 t N Lx, dt. Next we define free energy as follows E Lx, = log exp ρlx, dp and se the Legendre transformation that leads to maximization problem: ELx, = sp E Lx,. H Q P By taking into accont then stochastic dynamics and the Radon-Nikodým derivative we will have: inf inf log log exp ρlx, dp = inf sp Q exp ρlx, dp = inf sp π E Lx, E Lx, 2 H Q P tn π T πdt 0 where E is the expectation on trajectories generated based on the diffsion dx = fxdt + Bx dt + πdt + dw t. The term π can be thoght as a destabilizing controller. Eqation 0 is a short bt elegant way to show the eqivalence of min-max control and differential game theory with risk sensitivity. Essentially all that takes is an appropriate definition of free energy and the se of the Legendre transformation. Notice that no dynamic programming argments were sed in this analysis. 4 Derivation based on Bellman Principle: The continos case. We consider stochastic optimal control in the classical sense, as a constrained optimization problem, with the cost fnction nder minimization given by the mathematical expression: V x = min E Jx, = min E tn to Lx,, tdt sbject to the nonlinear stochastic dynamics: dx = Fx, dt + Bxdw with x R n denoting the state of the system, R p the control vector and dw R p brownian noise. 4

5 The fnction Fx, is a nonlinear fnction of the state x and affine in controls and therefore is defined as Fx, = fx + Gx. The matrix Gx R n p is the control matrix, Bx R n p is the diffsion matrix and fx R n are the passive dynamics. The cost fnction Jx, is a fnction of states and controls. Under the optimal controls the cost fnction is eqal to the vale fnction V x. The term Lx,,t is the rnning cost and is expressed as: Lx,, t = q 0 x, t + q x, t + 2 T R. Essentially, the rnning cost has three terms, the first q 0 x t, t is a state-dependent cost, the second term depends on states and controls and the third is the control cost with the term R > 0 the corresponding weight. The stochastic HJB eqation 5, 4 associated with this stochastic optimal control problem is expressed as follows: t V = min L + x V T F + 2 tr xx V BB T. The corresponding optimal control is given by the eqation: x t = R q x, t + Gx T x V x, t. These optimal controls will psh the system dynamics in the direction opposite that of the gradient of the vale fnction x V x, t. The vale fnction satisfies nonlinear, second-order PDE: t V = q + x V T f 2 xv T GR G T x V + 2 tr xx V BB T with qx, t and fx, t defined as qx, t = q 0 x, t 2 q x, t T R q x, t and fx, t = fx, t Gx, tr q x, t and the bondary condition V x tn = φx tn. Given the exponential transformation V x, t = λ log Ψx, t and the assmption λgxr Gx T = BxBx T = Σx t = Σ the reslting PDE is formlated as follows: t Ψ = λ qψ + f T x Ψ + 2 tr xxψσ 2 with bondary condition: Ψxt N = exp λ φxt N. By applying the Feynman-Kac lemma to the Chapman-Kolmogorov PDE 2 yields its soltion in form of an expectation over system trajectories. This soltion is mathematically expressed as: tn Ψ x ti = E 0 exp λ qxdt Ψx tn The expectation E 0 is taken on sample paths generated with the forward sampling of the ncontrolled diffsion eqation dx = fx t δt + Bxdw. The optimal controls are specified as: P I x = R q x, t λgx T xψx,t. Since, the initial vale the fnction V x, t is Ψx,t the minimm of the expectation of the objective fnction Jx,, it can be trivially shown that: 3 tn V x, = λ log E 0 exp λ qxdt Ψx tn E Jx, 4 Note that the ineqality above in similar to 5 when the following eqations hold: q x = 0, R = I, λ =, G = B, B = B. The first three eqalities garantee that Jx, = tn J x ρ T dt are identical, and the last two eqalities make sre that the expectations are evalated nder the same diffsions and therefore E 0 E 0 and E E. Under the conditions above the Kolmogorov PDEs 6 and 2 and the HJB eqations and 7 are identical. 4. Derivation based on Bellman Principle: The discrete case. In the KL control framework 3,4,5 the analysis starts with the application of the Bellman principle of optimality on Markov Decision Processes MDP and nder the rnning cost specified as a sm of a state depended term and the Kllback Leibler Divergence between the transition densities of the controlled and ncontrolled dynamics. In particlar, the rnning coss specified as L x, = qx + H Q P = qx + E log px x, px x. The transition probabilities nder the controlled and ncontrolled dynamics are represented as px x, and px x. Application of the Bellman principle of optimality reslts in the minimization of the qantity: 5

6 V t x = min qx + E log px x, + V t+ x U px x Depending on the stochastic optimal control problem wx is eqal to V x, αv x, V t+ x. For or presentation here we choose wx = V t+ x that corresponds to finite horizon case. The dependent terms in the fnctional above are minimized and ths we will have that: E log px x, px + V t+ x = E px x, log x px x exp V t+ x For the prposes the normalization term GΦx is introdced with Φx = exp wx being the desirability fnction defined as GΦx = px xφx = E Φx 0, we will have that: E log x x px x + V t+x = log GΦx + H px x, px xφx GΦx Sbstittion of the expression above into the Bellman minimization eqation reslts in: min U qx log GΦx + H x x px xφx GΦx. The minimm of the Bellman eqation is attained by: p x x, = px xφx GΦx. The last eqation provides the transition probability nder the optimal control law and in that sense it the optimal transition probability. Clearly the optimal distribtion above is identical to eqations 3. Sbstittion of the optimal distribtion above will resln the Bellman eqation: Φx = exp qxgφx. The link with the continos case is established by writing the Bellman eqation for an MDP with continos state space Φ δt x = exp qxδtgφ δt x. Rearrangement of the terms reslts in: exp qxδt Φ δt x = E 0 Φ δt x Φ δt x. Under the limit δ 0 the eqation reslts the backward Chaplman Kolmogorov PDE in 2 for ρ =. 5 Discssion We show the connection of path integral control framework as presented in the machine learning and robotic commnities 3, 6 8 with work in the control theoretic commnity on risk sensitivity 5, 6, 8, 9. Essentially there are two methodological approaches to derive the path integral framework. In the first, stochastic optimal control is specified as minimization of the objective E Jx, sbject to the controlled dynamics. The HJB PDE is derived based on the Bellman principle of optimality. The exponential transformation of the vale fnction V x and the connection between control cost and variance resln the transformation of the HJB in to the backward Chapman Kolmogorov. The Feynman-Kac lemma is applied and the soltion of the Chapman Kolmogorov PDE together with the lower bond on the objective fnction are provided. The second methodological approach starts with the dality between free energy and relative entropy and the reslting optimization problem as expressed in 2. For diffsion processes affine in control and noise and nder the se of Girsanov s theorem, the aforementioned optimization reslts in formlating the bond ξx of the objective fnction E Jx, which is typically fond in stochastic optimal control. The link to Bellman optimality is established by showing that, this bond ξx satisfies the HJB eqation and therefore is a vale fnction. Inside the class of the stochastic dynamics of markov diffsion processes affine in control and noise, Dynamic Programming is more general since incorporates general cost fnctions and stochastic dynamics. This generalization however, is redced by the assmption regarding control cost and the variance of the noise λgxr Gx T = BxBx T. In the second approach the lower bond ξx of the accmlated trajectory cost E Jx, is derived withot relying on the Bellman Principle. In fact, this lower bond defines a new form of optimality which, as is shown in 5,6 as well as in this work, for the case of diffsion processes is eqivalent to the Bellman principle of optimality. In the KL stochastic optimal control framework the derivation relies on the Bellman Principle of Optimality in discrete time. The reslting distribtion 6

7 p k x x, is optimal since is the distribtion that reslts when actions are optimal. In that sense the KL framework, in its initial formlation 3 does not explicitly provide an optimal control law bnstead it provides the optimal distribtion or optimal transition probability nder the se of optimal control law. For the case of control affine diffsions, KL control framework incorporates control-only and state-only depended terms in contrast to PI derived based on the Bellman principle in which cross terms between controls and states may be considered. The compositionally of optimal controls incldes control laws and ths it can incorporate any analytically derived optimal control as well as PI control. The information theoretic view and in particlar the se of the Legendre transformation and the dality between free energy and relative is an elegant way to derive many important reslts in control theory and machine learning. Frthermore, it allows for generalizations in the sense thaneqality 5 holds for more general models of stochasticity sch as the case of jmp diffsions. References H. J. Kappen, Path integrals and symmetry breaking for optimal control theory, Jornal of Statistical Mechanics: Theory and Experiment, vol., p. P0, H. J. Kappen, An introdction to stochastic control theory, path integrals and reinforcement learning, in Cooperative Behavior in Neral Systems J. Marro, P. L. Garrido, and J. J. Torres, eds., vol. 887 of American Institte of Physics Conference Series, pp. 49 8, Feb E. Todorov, Efficient comptation of optimal actions, Proc Natl Acad Sci U S A, vol. 06, no. 28, pp , E. Todorov, Linearly-solvable markov decision problems, in Advances in Neral Information Processing Systems 9 NIPS 2007 B. Scholkopf, J. Platt, and T. Hoffman, eds., Vancover, BC, Cambridge, MA: MIT Press, W. H. Fleming and H. M. Soner, Controlled Markov processes and viscosity soltions. Applications of mathematics, New York: Springer, 2nd ed., W. H. Fleming and H. M. Soner, Controlled Markov processes and viscosity soltions. Applications of mathematics, New York: Springer, nd ed., W. Fleming, Exit probabilities and optimal stochastic control, Applied Math. Optim, vol. 9, pp , W. H. Fleming and W. M. McEneaney, Risk-sensitive control on an infinite time horizon, SIAM J. Control Optim., vol. 33, pp , November P. Dai Pra, L. Meneghini, and W. Rnggaldier, Connections between stochastic control and dynamic games, Mathematics of Control, Signals, and Systems MCSS, vol. 9, no. 4, pp , S. K. Mitter and N. J. Newton, A variational approach to nonlinear estimation, SIAM J. Control Optim., vol. 42, pp , May H. Tochette, The large deviation approach to statistical mechanics, Physics Reports, vol. 478, pp. 69, I. Karatzas and S. E. Shreve, Brownian Motion and Stochastic Calcls Gradate Texts in Mathematics. Springer, 2nd ed., Agst A. Friedman, Stochastic Differential Eqations And Applications. Academic Press, R. F. Stengel, Optimal control and estimation. Dover books on advanced mathematics, New York: Dover Pblications, E. Todorov, Compositionality of optimal control laws, In Advances in Neral Information Processing Systems, vol. 22, pp , E. Theodoro, J. Bchli, and S. Schaal, A generalized path integral approach to reinforcement learning, Jornal of Machine Learning Research, no., pp , E. Theodoro, Iterative Path Integral Stochastic Optimal Control: Theory and Applications to Motor Control. PhD thesis, niversity of sothern California, May J. Bchli, E. Theodoro, F. Stlp, and S. Schaal, Variable impedance control - a reinforcement learning approach, in Robotics: Science and Systems Conference RSS,

Real Time Stochastic Control and Decision Making: From theory to algorithms and applications

Real Time Stochastic Control and Decision Making: From theory to algorithms and applications Evangelos A. Theodorou Autonomous Control and Decision Systems Lab Challenges in control Uncertainty Stochastic