Risk-Sensitive and Average Optimality in Markov Decision Processes

Risk-Sensitive and Average Optimality in Markov Decision Processes Karel Sladký Abstract. This contribution is devoted to the risk-sensitive optimality criteria in finite state Markov Decision Processes. At first, we rederive necessary and sufficient conditions for average optimality of (classical) risk-neutral unichain models. This approach is then extended to the risk-sensitive case, i.e., when expectation of the stream of one-stage costs (or rewards) generated by a Markov chain is evaluated by an exponential utility function. We restrict ourselves on irreducible or unichain Markov models where risk-sensitive average optimality is independent of the starting state. As we show this problem is closely related to solution of (nonlinear) Poissonian equations and their connections with nonnegative matrices. Keywords: dynamic programming, stochastic models, risk analysis and management. JEL classification: C44, C6, C63 AMS classification: 90C40, 60J0, 93E20 Notation and Preinaries In this note, we consider unichain Markov decision processes with finite state space and compact action spaces where the stream of costs generated by the Markov processes is evaluated by an exponential utility function (so-called risk-sensitive models) with a given risk sensitivity coefficient. To this end, let us consider an exponential utility function, say ū γ ( ), i.e. a separable utility function with constant risk sensitivity γ R. For γ > 0 (risk averse case) ū γ ( ) is convex, if γ < 0 (risk seeking case) ū γ ( ) is concave. Finally if γ =0 (risk neutral case) ū γ ( ) is linear. Observe that exponential utility function ū γ ( ) is separable and multiplicative if the risk sensitivity γ 0 and additive for γ = 0. In particular, we have u γ (ξ + ξ 2 ) = u γ (ξ ) u γ (ξ 2 ) if γ 0 and u γ (ξ + ξ 2 ) ξ + ξ 2 for γ = 0. Then the utility assigned to the (random) outcome ξ is given by ū γ (ξ) := { (sign γ) exp(γξ), if γ 0, ξ for γ = 0. () For what follows let u γ (ξ) := exp(γξ), hence ū γ (ξ) = (sign γ) u γ (ξ). Obviously ū γ ( ) is continuous and strictly increasing. Then for the corresponding certainty equivalent, say Z γ (ξ), since ū γ (Z γ (ξ)) = E [ū γ (ξ)] (E is reserved for expectation), we immediately get Z γ (ξ) = { γ ln{e u γ (ξ)}, if γ 0 E [ξ] for γ = 0. (2) In what follows, we consider a Markov decision chain X = {X n, n = 0,,...} with finite state space I = {, 2,..., N} and a compact set A i of possible decisions (actions) in state i I. Supposing that in state i I action a A i is selected, then state j is reached in the next transition with a given probability p ij (a) and one-stage transition cost c ij will be accrued to such transition. Institute of Information Theory and Automation Academy of Sciences of the Czech Republic Pod Vodárenskou věží 4, 82 08 Praha 8, Czech Republic, e-mail: sladky@utia.cas.cz - 799 -

A (Markovian) policy controlling the decision process is given by a sequence of decisions (actions) at every time point. In particular, policy controlling the process, π = (f 0, f,...), with π k = (f k, f k+,...) for k =, 2,..., hence also π = (f 0, f,..., f k, π k ) is identified by a sequence of decision vectors {f n, n = 0,,...} where f n F A... A N for every n = 0,, 2,..., and f n i A i is the decision (or action) taken at the nth transition if the chain X is in state i. Policy π which selects at all times the same decision rule, i.e. π (f), is called stationary, hence X is a homogeneous Markov chain with transition probability matrix P (f) whose ij-th element equals p ij (f i ). Let ξn α = αk c Xk,X k+ with α (0, ), resp. ξ n = c X k,x k+, be the stream of α-discounted, resp. undiscounted, transition costs received in the n next transitions of the considered Markov chain X. Similarly let ξ (m,n) be reserved for the total (random) cost obtained from the mth up to the nth transition (obviously, ξ n = c X0,X + ξ (,n) ). Moreover, if the risk sensitivity γ 0 then ū γ (ξn) α = (sign γ) u γ (ξn), α resp. ū γ (ξ n ) = (sign γ) u γ (ξ n ), is the (random) utility assigned to ξn, α resp. to ξ n. Observe that ξ α := ξn α is well defined, hence u γ (ξ α ) = exp(γ α k c Xk,X k+ ). In the overwhelming literature on stochastic dynamic programming attention was mostly paid to the risk neutral case, i.e. if γ = 0. The following results and techniques adapted from [8] and [2] will be useful for derivation of necessary and sufficient average optimality conditions and for their further extensions to risk-sensitive models. Introducing for arbitrary g, w j R (i, j I) the discrepancy function we can easily verify the following identities: φ i,j (w, g) = c i,j w i + w j g (3) ξn α = αn α g + w X 0 α n w Xn + α k [ φ Xk,X k+ (w, g) ( α) w Xk+ ] (4) ξ n = ng + w X0 w Xn + φ Xk,X k+ (w, g). (5) If the process starts in state i and policy π = (f n ) is followed then for the expected α-discounted or undiscounted total cost Vi π(α, n) := E π i ξα n, Vi π(n) := E π i ξ n we immediately get by (4) (5) Vi π (α, n) = αn α g + w i + E π i { α k [ φ Xk,X k+ (w, g) ( α)w Xk+ ] α n w Xn } (6) V π Observe that i (n) = ng + w i + E π i { φ Xk,X k+ (w, g) w Xn }. (7) E π i φ Xk,X k+ (w, g) = p ij (f i ){ φ i,j (w, g) + E π j k= φ Xk,X k+ (w, g)}. (8) It is well-known from the dynamic programming literature (cf. e.g. [, 6, 9, 0, 6]) that If there exists state i 0 I that is accessible from any state i I for every f F then (i) For every f F the resulting transition probability matrix P (f) is unichain (i.e. P (f) have no two disjoint closed sets), (ii) There exists decision vector ˆf F along with numbers ŵ i, i I (unique up to additive constant), and ĝ being the solution of the set of (nonlinear) equations ŵ i + ĝ = min p ij (a)[c i,j + ŵ j ] = p ij ( ˆf i )[c i,j + ŵ j ], (9) a A i φ i (f, ˆf) := p ij (f)[c i,j + ŵ j ] ŵ i ĝ 0 with φ i ( ˆf, ˆf) = 0. (0) ( ) - 800 -

From (6), (7), (0) we immediately get for Vi π (α) := V π i (α, n) V ˆπ i (α) = αĝ + ŵ i ( α)e ˆπ i α k ŵ Xk+, V ˆπ i (n) = nĝ + ŵ i E ˆπ i ŵ n hence for stationary policy π ( ˆf) and arbitrary policy π = (f n ) n V i ˆπ (n) = ( α)vi ˆπ (α) = ĝ () α n V π i (n) = ĝ = α ( α)v π i (α) if and only if n E π i φ Xk (f n, ˆf) = 0. (2) 2 Risk-Sensitive Optimality On inserting the discrepancy function given by (3) in the exponential function u γ ( ) by (4) we get for the stream of discounted costs u γ (ξn) α = e γ α k c Xk,X k+ (3) = e γ[ and for U π i (γ, α, n) := E π i uγ (ξ α n) we have α k g+w X0 α n w Xn ] γ e α k [ φ Xk,X k+ (w,g) ( α) w Xk+ ] γ{ Ui π αn γ[ (γ, α, n) = e α g+wi] E π α k [ φ Xk,X k+ (w,g) ( α)w Xk+ ] α n w X n i e } (4) Observe that w i s are bounded, i.e. w Xk K for some K 0. Hence it holds e γ K e γ ( α) α k w Xk+ k= e γ K (5) and for n tending to infinity from (4) we immediately get for U π i (γ, α) := U π i (γ, α, n) Ui π (γ, α) = e γ[ α g+wi] E π i e γ α k [ φ Xk,X k+ (w,g)+( α)w Xk+ ] (6) Similarly for undiscounted models we get by (3), (4) Now observe that E π j {e γ k=m E π i e γ U π i (γ, n) = e γ[ng+w i] E π i e γ[ φ Xk,X k+ (w,g) = φ Xk,X k+ (w,g) Xm = j} = p jl (fj m l I φ Xk,X k+ (w,g) w X n ] (7) p ij (f 0 i ) e γ[ci,j wi+wj g] E π j ) e γ[c j,l w j +w l g] E πm+ l Employing (8) the following facts can be easily verified by (6), (7). e γ φ Xk,X k+ (w,g) k= (8) e γ φ Xk,X k+ (w,g) k=m+ (9) - 80 -

Result. (i) Let for a given stationary policy π (f) there exist g(f), w j (f) s such that p ij (f i ) e γ φi,j(w(f),g(f)) =, i I. (20) Then p ij (f i ) e γ[ci,j wi(f)+wj(f) g(f)] = Ui π (γ, α) = e γ( α g(f)+wi(f)) E π γ ( α) i e α k w Xk (f) k= (2) U π i (γ, n) = e γ(ng(f)+w i(f)) E π i e γ w Xn (f) (22) (ii) If it is possible to select g = g and w j = w j s, resp. g = ĝ and w j = ŵ j s, such that for any f F, all i I and some f F, resp. ˆf F, respectively p ij (f i ) e γ φi,j(w,g ) with p ij (f i ) e γ φi,j(ŵ,ĝ) with p ij (f i ) e γ φ i,j(w,g ) = (23) p ij ( ˆf i ) e γ φi,j(ŵ,ĝ) = (24) then e γ( α ĝ+ŵ i) e γ K U ˆπ i (γ, α) e γ(nĝ+ŵi) e γ K U ˆπ i (γ, n) U π i (γ, α) e γ( α g +w i ) e γ K (25) U π i (γ, n)) e γ(ng +w i ) e γ K. (26) Result 2. Let (cf. (2)) Zi π(γ, α) = γ ln U i π(γ, α), Zπ i (γ, n) = γ ln U i π (γ, n). Then by (25), (26) for stationary policies ˆπ ( ˆf), π (f ), and by (6), (7) for an arbitrary policy π = (f n ) n Z i ˆπ (γ, n) = ( α)zi ˆπ (γ, α) = ĝ, α n Zπ i (γ, n) = g, resp. n ln[e π i e γ 3 Poissonian Equations n Zπ i (γ, n) = ĝ, n Zπ i (γ, n) = α ( α)z π i (γ, α) = g (27) if and only if φ Xk,X k+ (w,g ) ] = 0, resp. n ln[e π i e γ φ Xk,X k+ (ŵ,ĝ) ] = 0. (28) The system of equations (20) for the considered stationary policy π (f) and the nonlinear systems of equations (23), (24) for finding stationary policy with maximal/minimal value of g(f) can be also written as e γ[g(f)+w i(f)] = e γ[g +w i ] = max e γ[ĝ+ŵ i] = min p ij (f i ) e γ[c i,j+w j (f)] (i I) (29) p ij (f i ) e γ[ci,j+w j ] (i I) (30) p ij (f i ) e γ[c i,j+ŵ j ] (i I) (3) respectively, for the values g(f), ĝ, g, w i (f), wi, ŵ i (i =,..., N); obviously, these values depend on the selected risk sensitivity γ. Eqs. (30), (3) can be called the γ-average reward/cost optimality equation. In particular, if γ 0 using the Taylor expansion by (29), resp. (3), we have g(f) + w i (f) = p ij (f i ) [c i,j + w j (f)], resp. ĝ + ŵ i = min that well corresponds to (9). p ij (f i ) [c i,j + ŵ j ] - 802 -

On introducing new variables v i (f) := e γwi(f), ρ(f) := e γg(f), and on replacing transition probabilities p ij (f i ) s by general nonnegative numbers defined by q ij (f i ) := p ij (f i ) e γcij (29) can be alternatively written as the following set of equations ρ(f)v i (f) = q ij (f i ) v j (f) (i I) (32) and (30), (3) can be rewritten as the following sets of nonlinear equations (here ˆv i := e γŵ i, v i := eγw i, ˆρ = e γĝ, ρ := e γg ) ρ v i = max q ij (f i ) vj, ˆρ ˆv i = min called γ-average reward/cost optimality equation in multiplicative form. q ij (f i ) ˆv j (i I) (33) For what follows it is convenient to consider (32), (33) in matrix form. To this end, we introduce the N N matrix Q(f) = [q ij (f i )] with spectral radius (Perron eigenvalue) ρ(f) along with its right Perron eigenvector v(f) = [v i (f i )], hence (cf. [5]) ρ(f)v(f) = Q(f)v(f). Similarly, for v(f ) = v, v( ˆf) = ˆv (33) can be written in matrix form as ρ v = max Q(f)v, ˆρ ˆv = min Q(f)ˆv. (34) Recall that vectorial maximum and minimum in (34) should be considered componentwise and ˆv, v are unique up to multiplicative constant. Furthermore, if the transition probability matrix P (f) is irreducible then also Q(f) is irreducible and the right Perron eigenvector v(f) can be selected strictly positive. To extend this assertion to unichain models in contrast to condition ( ) for the risk neutral case it is necessary to assume existence of state i 0 I accessible from any state i I for every f F that belongs to the basic class of Q(f). If condition ( ) is fulfilled it can be shown (cf. [5]) that in (30), (3) and (33) eigenvectors v(f), ˆv, v can be selected strictly positive and ρ, resp. ˆρ, is the maximum, resp. minimum, Perron eigenvalue of the matrix family {Q(f), f F}. So we have arrived to Result 3. Sufficient condition for the existence of γ-average reward/costs optimality equation is the existence of state i 0 I fulfilling condition ( ) (trivially fulfilled for irreducible models). ( ) In particular, for unichain models condition ( ) is fulfilled if this risk sensitive coefficient γ is sufficiently close to zero (cf. [3, 4, 5]). Finding solution of (34) can be performed by policy or value iteration. Details can be found e.g. in [2, 3, 7, 4, 5]. Finally, we rewrite optimality condition (28) of Result 2 in terms of Q(f), ˆv, ˆρ. To this end first observe that E π i 0 e γ φ Xk,X k+ (ŵ,ĝ) = p ik,i k+ (fi k k ) e γ[c i k,i k+ +ŵ ik+ ŵ ik ĝ] i k+ I = q ik,i k+ (fi k k ) ˆv ik+ ˆv i k ˆρ (35) i k+ I So equation (28) can be also written in matrix form as { } { [ ] } n ln ˆV Q(f k ) ˆV ˆρ = n ln ˆV Q(f k ) ˆρ ˆV ˆV = 0 (36) where the diagonal matrix ˆV = diag [ˆv i ] and 0 is reserved for a null matrix. (i.e. irreducible class with spectral radius equal to the Perron eigenvalue of Q(f)) - 803 -

4 Conclusions In this note necessary and sufficient optimality conditions for discrete time Markov decision chains are obtained along with equations for average optimal policies both for risk-neutral and risk-sensitive models. Our analysis is restricted to unichain models, and for the risk-sensitive case some additional assumptions are made. For multichain models it is necessary to find suitable partition of the state space into nested classes that retain some properties of the unichain model. Some results in this direction can be found in [, 5, 7, 8]. Acknowledgements This research was supported by the Czech Science Foundation under Grants P402//050 and P402/0/0956. References [] Bertsekas, D. P.: Dynamic Programming and Optimal Control. Volume 2. Third Edition. Athena Scientific, Belmont, Mass. 2007. [2] Cavazos-Cadena, R. and Montes-de-Oca R.: The value iteration algorithm in risk-sensitive average Markov decision chains with finite state space. Mathematics of Operations Research 28 (2003), 752 756. [3] Cavazos-Cadena, R.: Solution to the risk-sensitive average cost optimality equation in a class of Markov decision processes with finite state space. Mathematical Methods of Operations Research 57 (2003), 253 285. [4] Cavazos-Cadena, R. and Hernández-Hernández, D.: A characterization of the optimal risk-sensitive average cost infinite controlled Markov chains. Annals of Applied Probability 5 (2005), 75 22. [5] Gantmakher, F. R.: The Theory of Matrices. Chelsea, London 959. [6] Howard, R. A.: Dynamic Programming and Markov Processes. MIT Press, Cambridge, Mass. 960. [7] Howard, R. A. and Matheson, J.: Risk-sensitive Markov decision processes. Management Science 23 (972), 356 369. [8] Mandl, P.: On the variance in controlled Markov chains. Kybernetika 7 (97), 2. [9] Puterman, M. L.: Markov Decision Processes Discrete Stochastic Dynamic Programming. Wiley, New York 994. [0] Ross, S. M.: Introduction to Stochastic Dynamic Programming. Academic Press, New York 983. [] Rothblum, U. G. and Whittle, P.: Growth optimality for branching Markov decision chains. Mathematics of Operations Research 7 (982), 582 60. [2] Sladký, K.: On the set of optimal controls for Markov chains with rewards. Kybernetika 0 (974), 526 547. [3] Sladký, K.: On dynamic programming recursions for multiplicative Markov decision chains. Mathematical Programming Study 6 (976), 26 226. [4] Sladký, K.: Bounds on discrete dynamic programming recursions I. Kybernetika 6 (980), 526 547. [5] Sladký, K.: Growth rates and average optimality in risk-sensitive Markov decision chains. Kybernetika 44 (2008), 205 226. [6] Tijms, H. C.: A First Course in Stochastic Models. Wiley, Chichester, 2003. [7] Whittle, P.: Optimization Over Time Dynamic Programming and Stochastic Control. Volume II, Chapter 35. Wiley, Chichester 983. [8] Zijm, W. H. M.: Nonnegative Matrices in Dynamic Programming. Mathematical Centre Tract, Amsterdam 983. - 804 -