Dynamic Consistency for Stochastic Optimal Control Problems

Dynamic Consistency for Stochastic Optimal Control Problems Cadarache Summer School CEA/EDF/INRIA 2012 Pierre Carpentier Jean-Philippe Chancelier Michel De Lara SOWG June 2012

Lecture outline Introduction Distributed formulation SOC with constraints Reduction to finite-dimensional problem

Informal definition of dynamically consistent problems Informal definition For a sequence of dynamic optimization problems, we aim at discussing a notion of consistency over time. At the very first time step t 0, formulate an optimization problem that yields optimal decision rules for all the forthcoming time steps t 0,t 1,...,T; At the next time step t 1, formulate a new optimization problem starting at time t 1 that yields a new sequence of optimal decision rules. This process can be continued until the final time T is reached. A family of optimization problems is said to be dynamically consistent if the optimal strategies obtained when solving the original problem remain optimal for all subsequent problems.

Informal definition of dynamically consistent problems Decision at time t 0 State u 0,0 u 1 0,k u 2 0,k u 3 0,k depends on decisions computed at later times for possible future states. t 0 t k T Time

Informal definition of dynamically consistent problems Time consistent decision at time t k State u 0,0 t 0 u 0,k t k u k,k T Time Decision u k,k computed using a problem formulated at time t k knowing where I am at time t k or decision u 0,k. computed at time t 0 and to be applied at time t k for the same position should be the same.

Informal definition of dynamically consistent problems An example: deterministic shortest path If A B C D E is optimal then C D E is optimal. Arriving at C, If I behave rationally I should not deviate from the originally chosen route starting at A. t=1 B t=2 C E A t=0 D

Dynamic Programming and dynamically consistent problems Dynamic Programming and dynamically consistent problems Dynamic Programing gives dynamically consistent controls It gives a family of problems indexed by time: the solution of problem at time t uses the solution of problem at time t +1: V k x = min E [ L k x,u,wk+1 +Vk+1 ftk x,u,w u k+1 ] Choose control at time t k optimizing the sum of a current cost and of a running cost. The running cost has an interpretation as the optimal cost function of a problem at time t k+1.

Dynamic Programming and dynamically consistent problems Dynamic Programming and dynamically consistent problems 2 The problem at time t k+1 is the following: V k+1 x = min X,U T 1 E L t Xt,U t,w t+1 +K XT Xtk+1 = x, t=t k+1 s.t. X t+1 = f t Xt,U t,w t+1, t = tk+1,...,t 1, U t X t t = t k+1,...,t 1.

Deterministic case Deterministic case State u 0,k u 0,0 u k,k Exact state match at time t k t 0 t k T Time

Deterministic case Deterministic case State u 0,k u 0,0 u k,k Independence of the initial condition. t 0 t k T Time

Deterministic case Deterministic case State u 0,0 u 0,k u k,k Taking care of all possible initial states at time t k through state feedback. t 0 t k T Time

Deterministic case A first example in the deterministic case 1 min u t0,...,u T 1,x t0,...,x T T 1 t=t 0 L t xt,u t +K xt, Dt0 subject to x t+1 = f t xt,u t, xt0 given. Suppose a solution to this problem exists: u t 0,...,u T 1 : controls indexed by time t, the state variable. x t0,xt 1,...,x T : optimal path for No need for more information since the model is deterministic. these controls depend on the hidden parameter x t0, these controls are usually not optimal for x t 0 x t0.

Deterministic case A first example in the deterministic case 2 Consider the natural subsequent problems for every t i t 0 : min u ti,...,u T 1,x ti,...,x T T 1 t=t i L t xt,u t +K xt, Dti subject to x t+1 = f t xt,u t, xti given. One makes the two following observations. 1. Independence of the initial condition. In the very particular case where the solution to Problem D ti does not depend on x ti, Problems {D ti } ti are dynamically consistent. 2. True deterministic world. Suppose that the initial condition for Problem D ti is given by xt i = f ti x ti 1,ut i 1 exact model, then Problems {D ti } ti are dynamically consistent. Otherwise, adding disturbances to the problem brings inconsistency.

Deterministic case A first example in the deterministic case 3 Solve now Problem D t0 using Dynamic Programming DP: φ t 0,...,φ T 1 : controls depending on both t and x. The following result is a direct application of the DP principle. 3. Right amount of information. Suppose that one is looking for strategies as feedback functions φ t depending on state x. Then Problems {D ti } ti are dynamically consistent. As a first conclusion, time consistency is recovered provided we let the decision rules depend upon a sufficiently rich information.

Deterministic case Independence of the initial condition for every t = t 0,...,T 1, functions l t : U t R and g t : U t R, and assume that x t is scalar. Let K be a scalar constant and consider the following deterministic optimal control problem: min x,u T 1 t=t 0 l t u t x t +Kx T, s.t. x t0 given, x t+1 = g t u t x t, t = t 0,...,T 1.

Deterministic case Independence of the initial condition 2 Variables x t can be recursively replaced using dynamics g t leading to: min u T 1 t=t 0 l t u t g t 1 u t 1...g t0 u t0 x t0 +Kg T 1 u T 1...g t0 u t0 x t0. The optimal cost is linear with respect to x t0. Suppose that x t0 only takes positive values. Then x t0 has no influence on the minimizer. Remains true at subsequent time steps provided that dynamics are such that x t remains positive for every time step. The dynamic consistency property holds true. This example which may look very special but we will see later on that it is analogous to familiar SOC problems.

Deterministic case The guiding principle of the lecture To obtain Dynamical Consistency, let the decision rules depend upon a sufficiently rich information set.

Introduction Informal definition of dynamically consistent problems Dynamic Programming and dynamically consistent problems Deterministic case Distributed formulation A classical SOC Problem Equivalent distributed formulation SOC with constraints Constraints typology A toy example with finite probabilities Back to time consistency Reduction to finite-dimensional problem An equivalent problem with added state and control Dynamic programing

A classical SOC Problem A classical SOC Problem Control variables U = U t t=t0,...,t 1. Noise variables W = W t t=t1,...,t. Markovian setting: noise variables X t0,w t1,...,w T are independent. The problem S t0 starting at t 0 writes: [ T 1 min X,U E s.t. X t0 given, ] L t Xt,U t,w t+1 +K XT, t=t 0 X t+1 = f t Xt,U t,w t+1, U t X t0,w t1,...,w t, t = t 0,...,T 1, We know that there is no loss of optimality in looking for the optimal strategy as a state feedback control U t = φ t X t. 1

A classical SOC Problem Stochastic optimal control: the classical case 2 Can be solved using Dynamic Programming: V T x = Kx, Vtx = min E L t x,u,w u U t+1 +Vt+1 ft x,u,w t+1. It is clear while inspecting the DP equation that optimal strategies {φ t} t t0 remain optimal for the subsequent optimization problems: min U ti,...,u T 1,X ti,...,x T T 1 E L t X t,u t,w t+1 +KX T, t=t i subject to: X ti given, S ti X t+1 = f t X t,u t,w t+1, U t X ti,w ti +1,...,W t.

Equivalent distributed formulation An equivalent distributed formulation Ξ t a linear space of R-valued measurable functions on X t. Υ t space of signed measures on X t. We consider a dual pairing between Ξ t and Υ t by considering the bilinear form: ξ,µ = ξxdµx for ξ Ξ t and µ Υ t. X t When X t is a random variable taking values in X t and distributed according to µ t. Then ξ t,µ t = E ξ t X t.

Equivalent distributed formulation Fokker-Planck equation: state probability law dynamics For a given feedback laws φ t : X t U t we define T φt t : Ξ t+1 Ξ t : T φ t t ξ t+1 x def = E ξ t+1 X t+1 X t = x x X t. Using state dynamics and Markovian setting we obtain: T φ t t ξ t+1 x = E ξ t+1 ft x,φ t x,w t+1 x X t. Denoting by T φt t the adjoint operator of T φ t t T φ t ξ,µ = ξ, T φ µ t The state probability law driven by the chosen feedback follows: µ t+1 = T φt µt t t = t 0,...,T 1, µ t0 given.

Equivalent distributed formulation Revisiting the expected cost Next we introduce the operator Λ φt t : X t R: Λ φt t x def = E L t x,φt x,w t+1 meant to be the expected cost at time t for each possible state value when feedback function φ t is applied. Using Λ φt t we have: E L t Xt,φ t Xt,Wt+1 = E Λ φ t t X t = Λ φt t,µ t.

Equivalent distributed formulation Equivalent deterministic infinite-dimensional problem We obtain an equivalent deterministic infinite-dimensional optimal control problem D t0 : min Φ,µ T 1 t=t 0 s.t. µ t0 given, µ t+1 = Λ Φt t,µ t + K,µ T, T Φt t µt, t = t 0,...,T 1, Very similar to the Independence of the initial condition example

Equivalent distributed formulation Solving D t0 using Dynamic Programming V T µ = K,µ. A V T 1 µ = min Λ φ φ T 1,µ φ µ +V T T 1. Optimal feedback Γ T 1 : µ φ µ a priori depends on x and µ. V T 1 µ = min Λ φ φ T 1 +Aφ T 1 K,µ, = min Λ φ T 1 φ +Aφ T 1 K xµdx. X

Equivalent distributed formulation Solving D t0 using Dynamic Programming 2 Interchanging minimization and expectation operators leads to: optimal Γ T 1 does not depend on µ : Γ T 1 φ T 1, V T 1 again depends on µ in a multiplicative manner.

Constraints typology Constraints in stochastic optimal control Different kinds of constraints in stochastic optimization: almost-sure constraint : gx T a P-a.s., chance constraint : P gx T a p, expectation constraint : E gx T a,... A chance constraint can be modelled as an expectation constraint: P gx T a = E 1 X adx T, with X ad = { x X, gx a }. Chance constraints bring both theoretical and numerical difficulties, especially convexity [Prékopa, 1995]. However the difficulty we are interested in is common to chance and expectation constraints. In the sequel, we concentrate on adding an expectation constraint.

Constraints typology Probability constraint on state trajectories Water level P { X t x t,t [0,T] } a Minimal water level Time

Constraints typology Transformed [ ] into a Probability constraints on final state : E g X T a Motivated by a joint probability constraint: P { γ t Xt bt, t = t 0,...,T } a. Introducing a new binary state variable Y t {0,1}: Y t0 = 1, Y t+1 = Y t 1 { γ t+1 X t+1 b t+1 }, E [ Y T ] = 1 P { γt Xt bt, t = t 0,...,T } a

Constraints typology Problem setting with expectation constraint Control variables U = U t t=t0,...,t 1. Noise variables W = W t t=t1,...,t. Markovian setting: noises variables X t0,w t1,...,w T are independent. The problem starting at t 0 writes: min X,U [ T 1 ] E L t Xt,U t,w t+1 +K XT, t=t 0 s.t. X t0 given, X t+1 = f t Xt,U t,w t+1, U t X t0,w t1,...,w t, t = t 0,...,T 1, E [ g X T ] a, a R

Constraints typology Problem at time t i min X,U [ T 1 ] E L t Xt,U t,w t+1 +K XT, t=t i s.t. X t+1 = f t Xt,U t,w t+1, Xti given, U t X ti,w ti+1,...,w t, t = t i,...,t 1, E [ g X T ] a. Not dynamically consistent with the usual state variable. With new appropriate state variable, one regains dynamical consistency. Optimal strategy at time t is a function of Φ t0,t : X U since information structure is not modified by the last constraint.

Constraints typology Equivalent deterministic infinite-dimensional problem We obtain an equivalent deterministic infinite-dimensional optimal control problem. min Φ,µ T 1 t=t 0 s.t. µ t0 given, µ t+1 = Λ Φt t,µ t + K,µ T +χ { g,µ a} µ T T Φt t g,µ T a. µt, t = t 0,...,T 1, The last constraint implies that the optimal feedback laws depend on the initial condition µ t0 and we do not have dynamical consistency.

A toy example with finite probabilities Introduction Informal definition of dynamically consistent problems Dynamic Programming and dynamically consistent problems Deterministic case Distributed formulation A classical SOC Problem Equivalent distributed formulation SOC with constraints Constraints typology A toy example with finite probabilities Back to time consistency Reduction to finite-dimensional problem An equivalent problem with added state and control Dynamic programing

A toy example with finite probabilities Simple case-study: a discrete controlled Markov chain The state space X = {1,2,3}: Discrete possible values of a reservoir level. x = 1 resp. x = 3 the lower resp. upper level of the reservoir. The control action u takes values in U = {0,1}, the value 1 corresponding to using some given water release to produce some electricity. The noise variable: stochastic inflow takes its values in W = { 1,0,1}. The discrete probability law of each noise variable W t being characterized by the weights {, 0, +}. Finally, the reservoir dynamics is: X t+1 = min x,maxx,x t U t +W t+1.

A toy example with finite probabilities Simple case-study: a discrete controlled Markov chain Let µ t be the discrete probability law associated with the state random variable X t, the initial probability law µ t0 being given. In such a discrete case, it is easy to compute the transition matrix P u giving the Markov chain transitions for each possible value u of the control, with the following interpretation: P u ij = P X t+1 = j X t = i,u t = u = P {min x,maxx,i u +Wt+1 = j }.

A toy example with finite probabilities Simple case-study: a discrete controlled Markov chain We obtain for the reservoir problem: + 0 + 0 P 0 = 0 +, P 1 = 0 0+ + 1 0 0 + 0 + 0 0 + and we denote by P u i the i th row of matrix P u.

A toy example with finite probabilities Simple case-study: a discrete controlled Markov chain Let φ t : X U be a feedback law at time t. We denote by Φ t the set of admissible feedbacks at time t. In our discrete case, cardφ t = cardu cardx = 8 for all t. The transition matrix P φt associated with such a feedback is obtained by properly selecting rows of the transition matrices P u, namely: P φt1 1 P φt = P φt2 2. P φt3 3 Then the dynamics given by the Fokker-Planck equation writes: µ t+1 = P φt µt, and the state µ t involved in this dynamic equation is a three-dimensional column vector µ t = µ 1,t,µ 2,t,µ 3,t

A toy example with finite probabilities The cost at time t is supposed to be linear, equal to p t U t, p t being a negative deterministic price. No final cost K 0. The reservoir level at final time T must be equal to x with a probability level at least equal to π. The distributed formulation associated with the reservoir control problem is: min {φ t Φ t} T 1 t=t 0 T 1 t=t 0 p t φt,µ t, 6a subject to: µ t+1 = P φt µt, µ t0 given, 6b 1{x},µ T π, 6c with 1 {x} x = 1 if x = x, and 0 otherwise.

A toy example with finite probabilities The previous problem can be solved using DP: the state equation is a two-dimensional µ 1,t +µ 2,t +µ 3,t = 1 Suppose now that the number of reservoir discrete levels is n rather than 3, with n big enough: the resolution of the distributed formulation suffers from the curse of dimensionality n 1-dimensional state, The ultimate curse being to consider that the level takes values in the interval [x,x], so that the µ t s are continuous probability laws.

Back to time consistency Back to Dynamical Consistency We can use Dynamic Programing: Let V t µ t be the optimal cost of the problem starting at time t with initial condition µ t. { K,µ if g,µ a, V T µ = + otherwise, and, for every t = t 0,...,T 1 and every probability law µ on X: V t µ = min Λ Φt t,µ +V t+1 Φ t T Φt t µ. Optimal feedback functions Φ t depend on µ t. Consistency recovered for strategies depending on X t and µ t.

Back to time consistency Back to Dynamical Consistency 2 Optimal feedback functions Φ t depend on µ t µ min Λ Φt t,µ +V t+1 T Φt t. Φ t But it also depends on X t because of the distributed formulation U t = φ t X t! Consistency recovered for strategies depending on X t and µ t. The new state variable to consider X t,µ t is an infinite dimensional object!

An equivalent problem with added state and control Equivalent problem: added state process Z and control V min U,V,X,Z T 1 E L t X t,u t,w t+1 +KX T, t=t 0 X t0 = x t0, X t+1 = f t X t,u t,w t+1, Z t0 = 0, Z t+1 = Z t +V t, Almost sure final constraint: gx T Z T a.. Measurability constraints: U t F t, V t F t+1. Additional time constraints: E V t F t = 0.

An equivalent problem with added state and control Equivalent problem 2 Let U t0,...,u T 1 and X t0,...,x T satisfying the constraints of Initial problem. We define V t0,...,v T 1 and Z t0,...,z T as follow: V t = E gx T F t+1 E gxt F t, Zt+1 = Z t +V t. Noting that the σ-field F t0 is the minimal σ-field and that the random variable X T is F T -measurable, we have: Z T = gx T E gx T. Hence, U, X, V and Z satisfy the set of constraints of the new problem.

An equivalent problem with added state and control Equivalent problem 3 Let U t0,...,u T 1, X t0,...,x T, V t0,...,v T 1 and Z t0,...,z T satisfying the constraints of V,Z-problem be given. Then, Z T = T 1 t=t 0 V t. Thus we have: E T 1 Z T = E E V t F t = 0. t=t 0 If follows that gx T Z T a E gx T a, and therefore the processes U t0,...,u T 1 and X t0,...,x T satisfy the constraints of Initial problem To conclude, since the two problems share the same sets as admissible constraints and share the same criteria to optimize, they are equivalent.

An equivalent problem with added state and control Equivalent problem 4 1. Dynamic Programming will give feedbacks as functions of X t,z t. 2. The new problem is more intricate: added state and added control, Vt depends on W t+1 Hazard Decision some new constraints on the controls are to be taken into account. 3. E gx T F t : perception of the risk constraint at time t. 4. V t = E gx T F t+1 E gxt F t : variation between time t and time t +1 of this perception. 5. Z t: cumulative variation over time of the risk constraint.

Dynamic programing Dynamic programing In order to use Dynamic programming on problem 7, we will first recall on a simplified model how Dynamic Programming works on a classical SOP. min X,U E ] L t Xt,U t,w t+1 +K XT, t=t 0 [ T 1 s.t. X t0 given, X t+1 = f t Xt,U t,w t+1, U t F t t = t 0,...,T 1, Then we will show the modifications induced by hazard-decision control or E V t F t = 0.

Dynamic programing Dynamic programing steps in classical framework 1. Rewrites criterion using conditional expectations: E E L 0 X 0,U 0 +E L 1 X 1,U 1 +KX 2 F 1 F0 2. minimization with respect to U t and X t enters up to the conditional expectation with respect to F t. 3. Recursive computation from the most internal problem to the outer one t = t 0. The most internal t = T 1: min E hx T 1,U T 1,W T F T 1. U F T 1 T 1 hx,u,w = L T 1 x,u+k f T 1 x,u,w

Dynamic programing Hazard Decision framework Hazard Decision: U t F t+1 Interchange of expectation and minimization can be done one step further. E min hx T 1,U T 1,W T F T 1. U F T 1 T with hx,u,w = L T 1 x,u,w+k f T 1 x,u,w Optimal feedback U T 1 can be chosen as a X T 1,W T -measurable function.

Dynamic programing Hazard Decision framework with constraint Control constraint: E U t F t = 0. We are led to the following problem at time t = T 1: min E hx T 1,U T 1,W T F T 1, U F T 1 T subject to: E U T 1 F T 1 = 0. Optimal feedback U T 1 can still be chosen as a X T 1,W T -measurable function? Yes and the proof is actually available when F T 1 is generated by finite valued random variables.

Dynamic programing Back to the equivalent problem The Hazard-Decision V and Decision-Hazard U controls. Initialization of the Bellman function at time T: V T x,z = Kx+χ G ad x,z, χ G ad characteristic function of Gad = { x,z gx z a } ; Bellman function at time t: V t x,z = min u U min E hx,u,z,v,w V W t+1, t+1 subject to: E V = 0. hx,u,z,v,w def = L t x,u,w+v t+1 ft x,u,w,z +v U t X t,z t and V t X t,z t,w t+1.