Risk-Averse Control of Partially Observable Markov Systems. Andrzej Ruszczyński. Workshop on Dynamic Multivariate Programming Vienna, March PDF Free Download

Risk-Averse Control of Partially Observable Markov Systems Workshop on Dynamic Multivariate Programming Vienna, March 2018

Partially Observable Discrete-Time Models Markov Process: fx t ; Y t g td1;:::;t on the Borel state space X Y The process fx t g is observable, while fy t g is not observable Control sets: U t W X U, t D 1; : : : ; T Transition kernel: PŒ.X tc1 ; Y tc1 / 2 C j x t ; y t ; u t D Q t.x t ; y t ; u t /.C/ Costs: Z t D c t.x t ; Y t ; U t /, t D 1; : : : ; T Two relevant filtrations X ;Y F t defined by the full state process fxt ; Y t g defined by the observed process fxt g F X t n Space of costs: Z t D Classical Problem: Z W! R ˇˇ o Z is F X ;Y t -measurable and bounded min E c 1.X 1 ; Y 1 ; U 1 / C c 2.X 2 ; Y 2 ; U 2 / C C c T.X T ; Y T ; U T /

Partially Observable Discrete-Time Models Markov Process: fx t ; Y t g td1;:::;t on the Borel state space X Y The process fx t g is observable, while fy t g is not observable Control sets: U t W X U, t D 1; : : : ; T Transition kernel: PŒ.X tc1 ; Y tc1 / 2 C j x t ; y t ; u t D Q t.x t ; y t ; u t /.C/ Costs: Z t D c t.x t ; Y t ; U t /, t D 1; : : : ; T Two relevant filtrations X ;Y F t defined by the full state process fxt ; Y t g defined by the observed process fxt g F X t n Space of costs: Z t D Risk-Averse Problem: Z W! R ˇˇ o Z is F X ;Y t -measurable and bounded min 1;T c1.x 1 ; Y 1 ; U 1 /; c 2.X 2 ; Y 2 ; U 2 /; : : : ; c T.X T ; Y T ; U T /

Dynamic Risk Measures Probability space. ; F ; P/ with filtration F 1 F T F Adapted sequence of random variables (costs) Z 1 ; Z 2 ; : : : ; Z T Spaces: Z t of F t -measurable functions and Z t;t D Z t Z T Dynamic Risk Measure A sequence of conditional risk measures t;t W Z t;t! Z t, t D 1; : : : ; T. Monotonicity condition: t;t.z/ t;t.w / for all Z; W 2 Z t;t such that Z W Local property: For all A 2 F t t;t.1 A Z/ D 1 A t;t.z/

Time Consistency and Nested Representation A dynamic risk measure t;t T td1 is time-consistent if for all 1 t < T Z t D W t and tc1;t.z tc1 ; : : : ; Z T / tc1;t.w tc1 ; : : : ; W T / imply that ;T.Z ; : : : ; Z T / ;T.W ; : : : ; W T / Define one-step mappings: t.z tc1 / D t;t.0; Z tc1 ; 0; : : : ; 0/ Nested Decomposition Theorem Suppose a dynamic risk measure t;t T td1 is time-consistent, and t;t.0; : : : ; 0/ D 0 t;t.z t ; Z tc1 ; : : : ; Z T / D Z t C t;t.0; Z tc1 ; : : : ; Z T / Then for all t we have the representation t;t.z t ; : : : ; Z T / D Z t C t Z tc1 C tc1 Z tc2 C C T 1.Z T /

Issues with General Theory in the Markov Setting Probability measure P, processes X t and Z t depend on policy We have to deal with a family of risk measures t;t./ The values of the risk measures may depend on history, and Markov policies cannot be expected The cost may not be observable Motivating Example X D f0; 1g, T D 2, and Z t D Z t.x t / (cost depends on state). Consider the risk measure 2;2.Z 2 /.x 1 ; x 2 / D Z 2.x 2 / 1;2.Z 1 ; Z 2 /.x 1 / D Z 1.x 1 / C Z 2.x 1 / (assumes that x 1 will not change) It is time-consistent and has the normalization, translation, and local properties. As 1;2 does not depend on the distribution of x 2, given x 1, it is useless for controlling Markov models. In fact, it is much worse than expectation.

Conditional Risk Evaluators Space of observable random variables: n t D S W! R ˇˇ o S is F X t -measurable and bounded ; t D 1; : : : ; T A mapping t;t W Z t Z T! t is a conditional risk evaluator (i) It is monotonic if Z s W s for all s D t; : : : ; T, implies that t;t.z t ; : : : ; Z T / t;t.w t ; : : : ; W T / (ii) It is normalized if t;t.0; : : : ; 0/ D 0; (iii) It is translation equivariant if 8.Z t ; : : : ; Z T / 2 t Z tc1 Z T, t;t.z t ; : : : ; Z T / D Z t C t;t.0; Z tc1 ; : : : ; Z T /; (iv) It is decomposable if a mapping t W Z t! t exists such that: t.z t / D Z t ; 8Z t 2 t ; t;t.z t ; : : : ; Z T / D t.z t / C t;t.0; Z tc1 ; : : : ; Z T /; 8 Z 2 Z t;t

Risk Filters and their Time Consistency A risk filter t;t td1;:::;t t;t W Z t;t! t. is a sequence of conditional risk evaluators We have index risk filters by policy, because affects the measure P History: H t D.X 1 ; X 2 ; : : : ; X t /, h t D.x 1 ; x 2 ; : : : ; x t / 2 td1;:::;t A family of risk filters t;t is stochastically conditionally time consistent if for any ; 0 2, for any 1 t < T, for all h t 2 X t, all.z t ; : : : ; Z T / 2 Z t;t and all.w t ; : : : ; W T / 2 Z t;t, the conditions tc1;t.z tc1; : : : ; Z T / j H t Z t D W t D h t st 0 tc1;t.w tc1; : : : ; W T / j Ht 0 D h t imply t;t.z t; Z tc1 ; : : : ; Z T /.h t / 0 t;t.w t; W tc1 ; : : : ; W T /.h t / The relation st is the conditional stochastic order

Bayes Operator Belief State: Conditional distribution of Y t given initial distribution 1 and history g t D. 1 ; x 1 ; u 1 ; x 2 ; : : : ; u t 1 ; x t / t.g t /.A/ D PŒY t 2 A j g t ; 8 A 2 B.Y/; t D 1; : : : ; T Conditional distribution of the observable part: Z P ŒX tc1 2 B j g t ; u t D Q X t.x t ; ; u t /.B/ d t.g t /; where Q X t.x t; y t ; u t / is the marginal of Q t.x t ; y t ; u t / on the space X Transition of the belief state - Bayes operator Y tc1.g tc1 / D t.x t ; t.g t /; u t ; x tc1 / Example: Y D fy 1 ; : : : ; y n g and Q t.x; y; u/ has density q t.x 0 ; y 0 jx; y; u/ t.x; ; u; x 0 / P n.fy k id1 g/ D q t.x 0 ; y k j x; y i ; u/ i P ǹ P n D1 id1 q t.x 0 ; y ` j x; y i ; u/ i

Markov Risk Filters Extended state history (including belief states): h t D.x 1 ; 1 ; x 2 ; 2 ; : : : ; x t ; t / 2 H t Policies D. 1 ; : : : ; T / with decision rules t.h t / 2 U t.x t / Markov Policy For all h t ; h 0 t 2 H t, if x t D x 0 t and t D 0 t, then t.h t / D t.h 0 t / D t.x t ; t / Policy value function: v t.h t/ D t;t c t.x t ; Y t ; t.h t //; : : : ; c T.X T ; Y T ; T.H T //.h t / 2 td1;:::;t A family of risk filters t;t is Markov if for all Markov policies 2, for all h t D.x 1 ; : : : ; x t / and h 0 t D.x 1 0 ; : : : ; x t 0/ in Xt such that x t D xt 0 and t D t 0, we have v t.h t/ D v t.h0 t / D v t.x t; t /

Structure of Markov Risk Filters (with Jingnan Fan) 2 A family of risk filters t;t is normalized, translation-invariant, td1;:::;t stochastically conditionally time consistent, decomposable, and Markov if and only if transition risk mappings exist: t W x t ; t ; Q t.h t/ W 2 ; h t 2 X t V! R; t D 1 : : : T 1; (i) t.x; ; ; / is normalized and strongly monotonic with respect to stochastic dominance (ii) for all 2, for all t D 1; : : : ; T 1, and for all h t 2 X t, v t.h t/ D r t.x t ; t ; t.h t // C t x t ; t ; Q t.h t/; v tc1.h t; / Evaluation of a Markov policy : v t.x t; t / D r t.x t ; t ; t.x t ; t // C t x t ; t ; Q t.x t; t /; x 0 7! v tc1.x 0 ; t.x t ; t ; t.x t ; t /; x 0 //

Examples of Transition Risk Mappings Average Value at Risk.x; ; m; v/ D min 2R where.x; / 2 Œ min ; max.0; 1. n C 1 Z.x; / X v.x 0 / o C m.dx 0 / Mean Semideviation of Order p Z Z.x; ; m; v/ D v.x 0 / m.dx 0 / C.x; / X ƒ X E m Œv where.x; / 2 Œ0; 1. v.x 0 / E m Œv 1 p C m.dx 0 p / Entropic Mapping.x; ; m; v/ D 1.x; / ln E m he.x;/ v.x0 / i ;.x; / > 0

Dynamic Programming Risk-averse optimal control problem: c1.x 1 ; Y 1 ; U 1 /; c 2.X 2 ; Y 2 ; U 2 /; : : : ; c T.X T ; Y T ; U T / Theorem min 1;T If the risk measure is Markovian (+ general conditions), then the optimal solution is given by the dynamic programming equations: v t vt.x; / D min.x; / D min Z t x; ; u2u t.x/ Y r T.x; ; u/; x 2 X; 2 P.X/ r t.x; ; u/ C u2u T.x/ K X t.x; y; u/.dy/; x 0 7! v tc1 x 0 ; t.x; ; u; x 0 / ; x 2 X; 2 P.Y/; t D T 1; : : : ; 1 Optimal Markov policy Ŏ D f O 1 ; : : : ; O T g - the minimizers above

Risk-Averse Clinical Trials (Darinka Dentcheva and Curtis McGinity) In stages t D 1; : : : ; T successive patients are given drugs (cytotoxic agents), to which severe toxic response (even death) is possible Probability of toxic response (x tc1 D 1) depends on the unknown optimal dose and the administered dose (control) u t : 1 F.u t ; / D 1 e '.u t;/ The belief state t, the conditional probability distribution of the unknown optimal dose, is the current state of the system The state evolves according to Bayes operator, depending on the response of the patient: for 2 Y (the range of doses) ( F.u t ; / t./ if toxic (x tc1 D 1) tc1./ 1 F.u t ; / t./ if not toxic (x tc1 D 0) Cost per stage: c t.; u t / D t ju t j (other forms possible) Medical ethics naturally motivates risk-averse control

Total Cost Models Find the best policy D. 1 ; : : : ; T / to determine doses u t D t. t / Expected Value Model TX C1 min E t ju t 2 td1 j T C1 is the weight of the final recommendation u T C1 Risk-Averse Model Two sources of risk n min 2 1;T C1 t ju t jo td1;:::;t C1 Unknown state (only belief state t available at time t) Unknown evolution of f t g due to random responses of patients

Dynamic Programming Equations All memory is carried by the belief state t For each t and u t, only two next states are possible, corresponding to x tc1 D 0 or 1 Simplified equation Z v t./ D min r t.; u/ C ; PŒx 0 D 1jy; u.dy/; v u tc1 t.x; ; u; / Y Examples: r t.; u/ D E ju j ; p; './ D E max '.x 0 / x 0 2f0;1g Any law invariant risk measure on the space of functions on U (for r t ) or on U f0; 1g (in the case of t ) can be used here.

Limited Lookahead Policies At each time t, assume that this is the last test before the final recommendation, and solve the two-stage problem Risk-Neutral h min E t t ju t j C x tc1 E response min E tc1 ju tc1 u t u tc1 Risk-Averse h min E t t ju t j C x tc1 max min E tc1 ju tc1 u t response u tc1 i j i j x tc1 D TX C1 DtC1 (weight of the future)

Simulation Results for Expected Value and Risk-Averse Policies Comparison of Policies: Risk-Averse vs. Expected Value Lookahead 0.4 Distribution of Dosage 0.3 T = 20 η = 2.6 Risk-AverseLookahead ExpectedValueLookahead 0.2 0.1 0.0 η T+1 * 0 AndrzejMcGinity, Ruszczyński Dentcheva, Ruszczyński Risk-Averse Risk-Averse ControlOptimal of Partially Learning Observable for ClinicalMarkov Trial DoseSystems Escalation

Machine Deterioration (with Jingnan Fan) We consider the problem of minimizing costs of a machine in T periods. Unobserved state: y t 2 f1; 2g, with 1 being the good and 2 the bad state Observed state: x t - cost incurred in the previous period Control: u t 2 f0; 1g, with 0 meaning continue, and 1 meaning replace The dynamics of Y is Markovian, with the transition matrices K Œu : K Œ0 1 p p D K Œ1 1 p p D 0 1 1 p p Distribution of costs: P x tc1 C ˇˇ yt D i; u t D 0 D P x tc1 C ˇˇ yt D i; u t D 1 D Z C 1 Z C 1 f i.x/ dx; i D 1; 2 f 1.x/ dx; i D 1; 2

Value and Policy Monotonicity Belief state: i 2 Œ0; 1 - conditional probability of the good state The optimal value functions: vt.x; / D x C w t./, t D 1; : : : ; T C 1 Dynamic programming equations w t./ D min nr C f 1 ; x 0 7! x 0 C w tc1.1 p/ I with the final stage value wt C1./ D 0. f 1 C.1 /f 2 ; x 0 7! x 0 C w tc1..; x 0 // o ; If f 1 f2 is non-increasing, then the functions wt./ are non-increasing and thresholds t 2 Œ0; 1 ; t D 1; : : : ; T exist, such that the policy ( ut D 0 if t > t ; 1 if t t ; is optimal

Numerical Illustration Cost distributions f 1 and f 2 : uniform with R 0 f 1.x/ dx R 0 f 2.x/ dx Transition risk mapping: mean semideviation Empirical distribution of the total cost for the risk-neutral model (blue) and the risk-averse model (orange)

Partially Observable Jump Process The unobserved process fy t g 0tT : Finite state Markov jump process on the space Y D f1; : : : ; ng with the generator.t/: 8 < lim ij.t/ D "#0 : 1 " P Y tc" D jˇˇyt D i if j i P k i ik.t/ if j D i The observed process fx t g 0tT : Diffusion following the SDE dx t D A.Y t ; t/ dt C B.t/ dw t ; 0 t T ; with the initial value x 0, independent of Y 0. fw t g is a Wiener process. Random final cost:.y T /

The Belief State Equation (Wonham, 1964) Filtration of observable events: fft X g 0tT The belief state: i.t/ D P Y t D i j Ft X ; i D 1; : : : ; n; Belief State Equation d i.s/ D. A.i; s/ NA.s/ / i.s/ ds C i.s/ B.s/ where. / i.s/ D nx ji.s/ j.s/; jd1 xa.s/ D dw s ; i.0/ D p i ; nx A.j; s/ j.s/; jd1 and fw t g 0tT is a Wiener process given by the formula W t D Z t 0 dx s NA.s/ ds : B.s/

Risk Filtering by BSDE (with Ruofan Yan) Suppose.t/ D p and we use a Markov risk filter % t;t 0tT. The value function for the final cost case: h V.t; p/ D % t;t Y t;p i T Structure of % t;t./ [using Coquet, Hu, Mémin, Peng (2002)] If the filter is monotonic, normalized, time consistent, and has the local property (+ minor growth conditions) then a driver g W Œ0; T R R n t;p! R exists, such that % t;t Y T D Vt, where.v ; Z/ solve backward stochastic differential equation h dv s D g.s; V s ; Z s / ds Z s dw s ; s 2 Œt; T ; V T D % T ;T Y t;p i T

Equivalent Forward-Backward System Under additional condition of law invariance of T ;T Œ, we obtain the following system. Forward SDE for the belief state: For i D 1; : : : ; n and 0 t s T d i.s/ D. A.i; s/ NA.s/ / i.s/ ds C i.s/ B.s/ dw s ; i.t/ D p i ; Backward SDE for the risk measure: for 0 t s T dv s D g.s; V s ; Z s / ds Z s dw s ; s 2 Œt; T ; V T D r T ;.T / The functional r T.; / is a law invariant risk measure.

The Running Cost Case Functional with Running Cost: Z T D Z T 0 c.t/ dt C.Y T / The value function: V.t; p/ D % t;t ŒZ T Forward SDE for the belief state: For i D 1; : : : ; n and 0 t s T d i.s/ D. A.i; s/ NA.s/ / i.s/ ds C i.s/ B.s/ dw s ; i.t/ D p i ; Backward SDE for the risk measure: for 0 t s T dv s D c.s/ C g.s; V s ; Z s / ds Z s dw s ; s 2 Œt; T ; V T D r T ;.T /

Controlled Process Controlled transition rates: ij.t; /, 2 U, where U is a bounded set. The rates are uniformly bounded. Piecewise-constant control: For 0 D t 0 < t 1 < t 2 < < t N D T, we define U N i D u 2 U ˇˇ u.t/ D u.tj /; 8 t 2 Œt j ; t jc1 /; 8 j D i; : : : ; N 1 where U is the set of U-valued processes, adapted to F t 0tT. Value function for a fixed control: Z tjc1 V u.t j ; p/ D tj ;t jc1 c. t j ;piu.s/; u r / ds C V u t jc1 ; t j ;piu.t jc1 / t j t j ;piu./ is the belief process restarted at t j from value p, while the system is controlled by u./ D u.t j / in the interval Œt j ; t jc1 /.

Dynamic Programming Optimal value function: OV.t j ; p/ D inf u2u N j V u.t j ; p/ Z tjc1 OV.t j ; p/ D inf tj ;t jc1 c. t j ;pi.r/; / dr C OV t jc1 ; t j ;pi.t jc1 / 2U t j Each tj ;t jc1 Œ is given by a controlled FBSDE system on Œt j ; t jc1 d t j ;pi i.s/ D../ / i.s/ ds C t j ;pi i.s/ A.i; s/ NA.s/ dw s ; B.s/ t j ;pi i.t j / D p i ; dv s D c t j ;pi.s/; C g.s; V s ; Z s / ds Z s dw s ; V tjc1 D OV t jc1 ; t j ;pi.t jc1 / Further research: Numerical methods for this FBSDE system

References A. Ruszczyński, Risk-averse dynamic programming for Markov decision processes, Mathematical Programming, Series B 125 (2010) 235 261 Ö. Çavuş and A. Ruszczyński, Computational methods for risk-averse undiscounted transient Markov models, Operations Research, 62 (2), 2014, 401 417. Ö. Çavuş and A. Ruszczyński, Risk-averse control of undiscounted transient Markov models, SIAM J. on Control and Optimization, 52(6), 2014, 3935 3966. J. Fan and A. Ruszczyński, Risk measurement and risk-averse control of partially observable discrete-time Markov systems, Mathematical Methods of Operations Research, 2018, 1 24. C. McGinity, D. Dentcheva and A. Ruszczyński, Risk-averse approximate dynamic programming for partially observable systems with application to clinical trials, in preparation R. Yan, J. Yao, A. Ruszczyński, Risk Filtering and Risk-Averse Control of Partially Observable Markov Jump Processes, submitted for publication. D. Dentcheva and A. Ruszczyński, Risk measures for continuous-time Markov chains, submitted for publication.

Risk-Averse Control of Partially Observable Markov Systems. Andrzej Ruszczyński. Workshop on Dynamic Multivariate Programming Vienna, March 2018