Non-Irreducible Controlled Markov Chains with Exponential Average Cost.

Size: px

Start display at page:

Download "Non-Irreducible Controlled Markov Chains with Exponential Average Cost."

Tracey Anthony
5 years ago
Views:

1 Non-Irreducible Controlled Markov Chains with Exponential Average Cost. Agustin Brau-Rojas Departamento de Matematicas Universidad de Sonora. Emmanuel Fernández-Gaucherand Dept. of Electrical & Computer Eng. and Computer Science University of Cincinnati. Abstract We study discrete controlled Markov chains with finite state and action spaces. The performance of control policies is measured by the exponential average cost (EAC), a risk-sensitive version of the standard average cost which models risk-sensitivity by means of an exponential (dis)utility function (so that a constant risk-sensitivity coefficient is assumed.) The main result is the characterization of the EAC corresponding to an arbitrary stationary deterministic policy in terms of the spectral radii of suitable irreducible matrices. This result generalizes a well known theorem of Howard and Matheson that deals with the particular case in which the transition probability matrix induced by the control policy is primitive. The following products are obtained from the mentioned characterization. It is shown that, when a stationary deterministic policy determines only one class of recurrent states, the corresponding EAC converges to the risk-null average cost as the risk-sensitivity coefficient γ goes to zero. However, it is also shown that for large risk-sensitivity, fundamental differences arise between both models. In particular, the limiting values of the EAC as γ goes to infinity are determined. Further insight and illustration of the behavior of the EAC are provided by means of simple examples. Finally, we include a proof of the existence of solutions to the associated optimality equation when a simultaneous Doeblin condition is satisfied and the risk-sensitivity coefficient is small enough. The just mentioned proof is significantly simpler than that recently provided in [5, 6, 7], and unlike that one, it completely relies on elementary results of Perron-Frobenius theory of non-negative matrices, which is the approach in Howard and Matheson s seminal paper [4].. Introduction. We study controlled Markov chains (CMC s) with a risk-sensitive average cost optimality criterion as introduced by Howard and Matheson [4]; see also [5, 6, 0,, 6]. This criterion incorporates risk-sensitivity in the decision process by means of an exponential (dis)utility function U γ (x) = sgn(γ) e γx, where γ 0, and sgn(γ) denotes the sign of γ, see [8, 20]. We deal with both cases γ > 0 and γ < 0, which represent respectively risk-aversion and risk-proneness. For brevity, we will refer to Howard-Matheson s criterion as the exponential average cost (EAC), after the (dis)utility function employed to define it. The paper is organized as follows. In Section 2, the model and some general notation and terminology are introduced. Sections 3 and 4 are devoted to a comprehensive discussion of the EAC corresponding to a fixed (but otherwise arbitrary) stationary deterministic policy f. In Section 3, we restrict attention to the the case in which the transition probability matrix P f induced by f is irreducible. First, for arbitrary γ 0, the EAC is characterized as the unique solution of a Poisson equation and is be expressed in terms of the spectral radius of the so called disutility matrix P f (γ); this result constitutes an extension of a theorem of Howard and Matheson [4], which considers only the case γ > 0 and assumes that P is primitive; see also [7, 0]. Then the impact of both small and large risk sensitivity on the EAC under the irreducibility assumption is studied. We show that (a) the risk-sensitive model approaches the risk-null one as the risk-sensitivity coefficient γ goes to zero, and (b) as γ goes to ( ), the EAC converges to the worst (best) arithmetical

2 average of the costs in the cycles determined by the transition matrix. Section 4 deals with the EAC for an arbitrary (not necessarily irreducible) P f. The main result of that section, Theorem 2, characterizes the EAC in terms of the spectral radii of suitable submatrices of Pf (γ), thus generalizing Howard-Matheson s characterization. Theorem 2 is employed to compute the EAC in two simple examples which illustrate two peculiar features of risk sensitivity when the initial state is transient: (a) the EAC is not affected by the probability of entering an irreducible closed class but only by the EAC on that class, and (b) the EAC may depend on the cost structure at the transient states. Section 5 is concerned with the existence of solutions for the exponential average optimality equation (EAOE) which, as it is known, yield the optimal exponential average cost and an optimal stationary deterministic policy, see [4, 6]. First, by using Howard-Matheson s policy improvement argument [4], we show that if P f is irreducible for every f Π SD, then for arbitrary γ 0 the EAOE has a solution. Then we show that if the mentioned recurrence assumption is relaxed to a simultaneous Doeblin condition, the existence of solutions to the EAOE can be assured only for γ small enough. We must note that both results in this last section as well as the EAC s charactherization as a solution of a Poisson equation (Theorem ) for irreducible P, were recently obtained by Cavazos Cadena and Fernández Gaucherand [5, 6, 7]. However, the proofs we present here, unlike those in [5, 6, 7], are completely based on elementary results of the Perron-Frobenius theory of non-negative matrices, and are much less involved than those provided in the cited references. 2. Description of the Model. Consider the standard framework for a CMC, specified by X, A, {A(i) : i X}, P, c, where: a) X, the state space, is a discrete set. For ease of notation we will take X = {, 2,..., N}. b) A, the action or control space, is a finite set. c) A(i), the set of admissible actions at i, is a subset of A. The set of admissible state-action pairs is defined as K := {(i, a) : i X, a A(i)}. d) P = {P ( i, a) : (i, a) K}, is a transition probability on K 2 X, where 2 X is the family of all subsets of X. For brevity, sometimes we will write P (j i, a) or P ij (a) instead of P ({j} x, a). e) c : K R is the one-stage cost function. We will assume that c is bounded and, without loss of generality for our purposes as we will see later, also non-negative; that is, 0 c(i, a) K for some constant K (0, ). To avoid trivial situations we will assume also that c is no identically zero. The six tuple X, A, {A(i) : i X}, P, c represents a stochastic dynamic system observed at times (or epochs) t N 0 := {0,, 2,...}. The evolution of the system is as follows. Let X t denote the state at time t N 0, and A t the action chosen at that time. If at decision epoch t the system is in state X t = i X, and the control A t = a A(i) is chosen, then (i) a cost c(i, a) is incurred, and (ii) the system moves to a new state X t+ according to the probability distribution P ( i, a). Once the transition into the new state has ocurred, a new action is chosen, and the process is repeated. We will take the stochastic processes (X t ) and (A t ) as given by the coordinate functions defined on (X A) in the usual way; for more details, see [, 2, 9]. For simplicity we will often denote C t := c((x t, A t )). Let Π denote the set of all admissible (possibly randomized and history dependent) policies (see []), and F the set of admissible decision functions, i.e., functions from X to A such that f(i) A(i) i X. We will distinguish two subclasses of policies: the class Π MD of Markovian deterministic policies and the class Π SD of stationary deterministic policies []. The stationary deterministic policy determined by f F will be denoted by f. For each policy π Π and each initial state i X we define in the usual way a probability measure Pi π on Ω := (X A) [, 7, 9] and denote the corresponding expectation operator by Ei π. When π = f we 2

3 write P f i and E f i for Pi π and Ei π respectively, and the transition probability matrix induced by that policy is denoted by P f, that is, P f (i, j) := P (j i, f(i)). The (risk-null) average cost due to a policy π Π and initial state i X will be denoted by φ π (i) := lim sup n Eπ i [ n ] C t. () The certainty equivalent of a random variable Z with respect to the (dis)utility function U γ is defined as E(γ, Z) := U γ (E[ U γ(z)]) = γ log ( E[e γz ] ). Heuristically, a decision maker with utility function U γ is indiferent between the random (thus uncertain) cost Z and the (certain) cost E(γ, Z). The risk-sensitive average cost or exponential average cost (EAC) is defined as the (long-run) average of the certainty equivalents of the finite horizon costs. In other words, it is obtained by replacing the expectation operator in the definition of the risk-null average cost by the the certainty equivalent operator. Thus, the EAC corresponding to a risk-sensitivity coefficient γ, a policy π, and an initial state i will be given by J π (γ, i) := lim sup = lim sup n U γ ( ( n γ log E π i E π i [ ]) n U γ ( C t ) [ ( )]) n exp γ C t. Despite the fact that the EAC is not an expected utility criterion [9], it has been widely accepted in the literature as the exponential utility version of the standard (risk-null) average cost; see [4] for a discussion of alternative definitions. The optimal control problem associated to the EAC is, of course, to compute the optimal value function J (γ, i) := inf π Π J π (γ, i), (3) and to find policies at which the optimal values are attained, that is, to find π Π such that J (γ, i) = J π (γ, i). (4) Remark Observe that no generality is gained by dropping the assumption of nonnegative costs. Indeed, suppose that K c(i, a) K and define c := +K, C t = C t + K. Then 0 c(i, a) 2K and ( [ ( )]) ( [ ( )]) n lim sup n γ log n Ei π exp γ C t = lim sup n γ log e γnk Ei π exp γ C t ( [ ( )]) n = K + lim sup n γ log Ei π exp γ C t, where C t is defined similarly as C t. (2) 3

4 Notation and preliminary results. The following notation and terminology of probability theory, most of which is rather standard, will be used in the sequel. For i, j X, we write i j and say that i leads to j (or j is accesible from i), when P n (i, j) = P(X n = j X 0 = i) > 0 for some n ; observe that with this definition it may happen that i i, in which case we say that i is an irrelevant state. Similarly, for C X, i C ( the class C is accesible from i ) means that i j for some j C. As usual, i j, which is read i and j communicate, means that i j and j i. A class of states C X is called self communicating (SC) if: a) i j for every i, j C, and b) there does not exist a class C containing C such that (a) holds for C. If C is a SC class, then we will denote Q C := ( P (i, j) ) i,j C and QC := ( P (i, j) )i,j C = ( e γc(i) P (i, j) ) i,j C. Note that if C is SC, then both Q C and Q C are irreducible. Now we recall some standard notation and basic definitions and facts about the theory of non negative matrices we will use in the sequel. Vectors in R N and real N N matrices will be called positive (non negative) if all of their components are positive (non negative). Let A denote an N N non negative matrix. The spectral radius of A will be denoted by ρ(a); recall that ρ(a) := max { λ : λ is an eigenvalue of A } = lim An n, (5) where A = max{ N j= A(i, j) : i N}, see [3], Cor The well known Perron Frobenius Theorem (Cf. [3], Th ) establishes that, when A is irreducible (a) ρ(a) is positive, (b) ρ(a) is an algebraically simple eigenvalue of A and (c) there exist both positive right and positive left eigenvectors corresponding to ρ(a). Moreover, ρ(a) is the unique positive eigenvalue of A having such eigenvectors (see [3], Cor ). The asymptotic behavior of the iterates A n = (A n (i, j)) shown in the following proposition, is sometimes also included as part of the Perron Frobenius Theorem, see for example [8] Th Proposition. Let A = [A(i, j)] N i,j= be a non negative irreducible matrix and φ = (φ, φ 2,..., φ N ) any positive vector, then n m lim A n (i, j)φ i = ρ(a) (6) for every i {,..., N}. j= The following two monotonicity properties of irreducible matrices will be needed in the proof of Theorem 3. below; for a proof, see [2, Cor and 3.3] Proposition 2. Let A and B be non negative N N irreducible matrices such that A B 0 and A B. Then ρ(a) > ρ(b). Proposition 3. Let A be a non negative irreducible matrix, x a positive vector and α (0, ) such that Ax αx and Ax αx. Then ρ(a) < α. 4

5 Proposition 4. If P is a non negative N N matrix, then either P is irreducible, or by a permutation similarity it can be brought into the so called irreducible normal form P = Q 0... where each matrix Q i is either irreducible or equal to the matrix (0). Moreover, ρ(p ) = max{ρ(q i ) : Q i is irreducible}; Cf. [3], Ch In the next two sections we study the EAC corresponding to a (fixed) stationary deterministic policy f as a function of γ, the risk-sensitivity coefficient. Consequently, throughout both sections the dependance on f is omited and we consider a simplified model, known as a Markov cost chain (MCC), whose elements are the state space X = {,..., N}, a stochastic matrix P = (P ij ) N i,j=, and a non null cost vector c = (c(),..., c(n)) with nonnegative components. Then, the exponential average cost for the MCC is given by J(γ, i) : = lim sup = lim sup n E i Q m ( ) n γ, c(x t ), [ ( )]} n {E n γ log i exp γ c(x t ), where E i and E i are respectively the expectation and the certainty equivalent operators on (X, 2 X ) induced by the transition probability matrix P and the initial state i X. If we denote the disutility incurred by the cost chain up to time n by ( )] n U n (γ, i) := E i [sgn(γ) exp γ c(x t ), for n =, 2,..., and U 0 (γ, j) sgn(γ), then it is not hard to see that the following recursion formula holds true: N U n+ (γ, i) = P ij e γc(i) U n (γ, j), j= n = 0,,...; see [3, 4]. Now, if we define the disutility matrix P (γ) by P (γ) := (P (i, j)e γc(i) ) and denote U n := (U n (, γ),..., U n (N, γ)) τ (where v τ denotes the transpose of a vector v), then the above recursion formula can be written in vector form as U n+ = P U n, n = 0,,..., where U 0 := sgn(γ)(,..., ) τ. Thus U n = P n U 0, n = 0,,..., and substituting this expression for U n in (7) we get J(γ, i) = lim sup (7) N n γ log P n (i, j). (8) j= 3. The exponential average cost: the irreducible case. Throughout this section, we restrict attention to MCC s X, P, c for which the transition matrix P is irreducible. In Theorem 3. below, the EAC is 5

6 characterized as the unique solution of a Poisson equation and its value is expressed in terms of the spectral radius ρ( P ) of the disutility matrix P. The results in Theorem 3. are an extension of those established by Howard and Matheson in [4], where P is required to be primitive, that is, aperiodicity is assumed there besides irreducibility. We follow closely Howard and Matheson s arguments based on Perron Frobenius Theory of non negative matrices, thus showing that aperiodicity is not essential to employ such a tool in the present problem. Theorem If P is irreducible, then for every i X N J(γ, i) = lim n γ log P n (i, j) = log ρ(γ) =: J(γ), (9) γ j= where ρ(γ) is the spectral radius of the (irreducible) disutility matrix P (γ). Moreover, for each γ 0 there exists H(γ, ) : X R such that (J(γ), H(γ, )) is the unique solution of the Poissson equation with J(γ) > 0 and H(γ, N) = 0. e γ[j(γ)+h(γ, )] = e γc( ) N j= P ij e γh(γ,j), (0) Proof: Set H := {u : X R : u(n) = and u(i) > 0 i X}, H 2 := {v : X R : v(n) = 0} and let I γ denotes the interval (0, ) if γ < 0, and (, ) if γ > 0. Then, the mapping T : I γ H (0, ) H 2 defined by T (x, u) = ( γ log x, γ log u) is bijective. Moreover, if (x, u) I γ H and T (x, u) = (y, v), then (y, v) satisfies equation (0) if and only if (x, u) satisfies the equation xu = P (γ)u (we abuse notation by indistincly considering u as a function or as a vector in R N ). i.e., if and only if x is the spectral radius of P (γ) and u the corresponding positive eigenvector with U(N) =. Let ρ(γ) > 0 denote the spectral radius of P (γ) and w the positive eigenvector with w N =. First, let us check that ρ(γ) I γ. To that end, note that P (γ) P because c is non null; also, if γ > (<)0 then P (γ) ()P. Thus, by Proposition 2 ρ(γ) > (<)ρ(p ) = and our claim is proved. Therefore, by the Perron Frobenius Theorem, (ρ, w) is the unique couple in H for which ρ(γ)w = P (γ)w. Consequently, it follows from the above considerations that if we take (J(γ), H(γ, )) := T ( ρ(γ), w) = ( γ log ρ(γ), γ log w), then (J(γ), H(γ, )) is the unique solution of the Poissson equation (0). Finally, by Proposition, the irreducibility of P (γ) implies that for every positive vector φ = (φ,..., φ N ) the limit relationship N lim n log P (γ) n (i, j)φ j = log ρ( P (γ)) () j= holds true for every i X. Thus, (9) follows directly from () by taking φ = (,,..., ) τ. We can observe that there is a clear resemblance between (0) and the value equation for the risk-neutral average cost N φ + h(i) = c(i) + P ij h(j), (2) j= 6

7 where φ and h( ) are respectively the (risk-neutral) average cost and the relative value function [, 3, 2, 9]). Moreover, as the following lemma shows, when P is irreducible, the risk-sensitive model converges to the risk-null model as the risk-sensitivity coefficient decreases to zero. In particular, if we define J(0, i) := φ i X, then J(, i) turns out to be continuous at γ = 0 for every i X. This result was predicted in [4], but a proof was not provided. Theorem 2 If P is irreducible, then lim J(γ) = φ and lim H(γ, i) = h(i). (3) γ 0 γ 0 Proof: For each γ 0, let ρ(γ) denotes the spectral radius of P (γ) and w(γ) = (w (γ),......, w N (γ)) corresponding eigenvector such that w N (γ) =. Since the entries of P (γ) are analytic functions of γ, and ρ is an eigenvalue of multiplicity one, then ρ and w i are also analytic functions of γ; see [5, Ch.II]. In particular, lim γ 0 ρ(γ) = ρ(0) = and lim γ 0 w i (γ) = w i (0) =, i X. Differentiating (with respect to γ) both sides of the eigenvalue equations we obtain and letting γ 0 yields ρ(γ)w i (γ) = N P ij e γc(i) w j (γ), i X, j= ρ (γ)w i (γ) + ρ(γ)w i(γ) = N P ij [w j(γ) + w j (γ)c(i)]e γc(i) ; j= lim γ 0 ρ (γ) + lim w γ 0 i(γ) = N j= [ ] P ij lim γ 0 w j(γ) + c(i). (4) Since w N (γ) 0 and the solution (φ, h) with h(n) = 0 of the value equation (2) is unique [], we deduce from (4) that φ = lim ρ (γ) and h(i) = lim w γ 0 γ 0 i(γ). Finally, the lemma follows by observing that, from L Hospitals rule and Theorem, ρ (γ) lim J(γ) = lim log ρ(γ) = lim γ 0 γ 0 γ γ 0 ρ(γ) = lim γ 0 ρ (γ) = φ, and lim H(γ, i) = lim γ 0 γ 0 γ log w w i i(γ) = lim (γ) γ 0 w i (γ) = lim γ 0 u i(γ) = h(i). Remark 2 Since J(γ) = γ log ρ(γ) is analytic in R \ {0} and lim γ 0 γj(γ) = 0 (by the continuity of J at zero) then J is in fact analytic at zero. 7

8 To end this section, we analize the behavior of the EAC for large risk-sensitivity when the probability transition matrix P is irreducible. We will show that for (infinitely) large risk aversion, the EAC is given by the worst average cost that can occur (with positive probability) in the long run. Heuristically, we might say that the attitude toward risk of a decision maker with infinitely large risk-aversion is as pessimistic as can be. A similar result is proved as well for (infinitely) large risk proneness. We will use the following notation and definitions. A finite directed P-path (or simply a P-path) from state i to state i k is a finite ordered sequence of states {i, i,..., i k }, k, such that P (i, i ) P (i, i 2 ) P (i k 2, i k )P (i k, i k ) > 0. The length of a P-path {i, i,..., i k } is the number k of states following the initial state i. Thus, a P -path of length k from i to j exists if and only if P k (i, j) > 0. A P-cycle at state i, which we will denote as Γ i, is a P -path which begins and ends at i, and such that i occurs exactly twice in the path. A P -cycle of length is called a P-loop. If Γ i is a P -cycle, then A(Γ i ) will denote the arithmetic average of the costs in the cycle. That is, if Γ i is a P -loop then A(Γ i ) = c(i), and if Γ i = {i, i,..., i k, i} with k, then A(Γ i ) = ( c(i) + c(i ) + + c(i k ) ). k + Also, we denote P (Γ i ) = P (i, i)e γc(i) if Γ i is a P -loop, and if Γ i = {i, i,..., i k, i} with k. P (Γ i ) = P (i, i ) P (i, i 2 ) P (i k, i)e γ(c(i)+c(i)+ +c(i k)) Theorem 3 If P is irreducible then { } lim J(γ) = max A(Γ i ) : Γ i is a P -cycle. (5) γ Proof: Let s denote by α the right hand side of (5). First, observe that if Γ i = i, i,, i k, i is any P -cycle of length k (k ), then the positive number P (Γ i ) is one of the summands in the (i, i)-entry of P k. Thus, which yields with 0 < p.therefore, ρ ( P (γ) ) k = ρ ( P (γ) k ) P (Γ i ), γ log ρ( ) P kγ log P (Γ i ) = kγ log p + A(Γ i), lim inf γ J(γ) A(Γ i), and consequently lim inf γ J(γ) α, since Γ i was an arbitrary cycle. To prove the converse inequality, we will first determine an upper bound in terms of α for the entries of matrix P n (γ). To that end, observe that any P -path {i 0, i,..., i n } of length n > N must contain at least one P -cycle. If that cycle is of length r, say {k 0, k,..., k r, k 0 } then we have [ ] [ ] e γ c(i 0)+c(i )+ +c(i n +c(i n)) e γ c(i 0)+c(j )+ +c(j n r) e γrα, 8

9 where {i 0, j,..., j n r } is the P -path from i 0 to j n r (= i n ) of length n r obtained after removing {k 0, k,..., k r } from the original P -path. By applying the previous procedure as many times as necessary, we see that there must exist a path {i 0, k,..., k s, i n } with s < N, such that [ ] [ ] e γ c(i 0)+c(i )+ +c(i n +c(i n) e γ c(i 0)+c(k )+ +c(k s )+c(i n) e γ(n s)α. (6) Moreover, for each i X, c(i) Nα because i is contained at least in one P -cycle (recall that P is irreducible). Taking into account the last observation, inequality (6) yields [ ] e γ c(i 0)+c(i )+ +c(i n ) e γsnα e γ(n s)α (7) e γn 2α e γnα = e γ(n+n 2 )α. Hence, it follows from (7) that, for arbitrary n > N and i, j {,..., N}, we have P n (i, j) P n (i, j)e γ(n+n 2 )α, and we have accomplished our first step. The right above inequality is in fact rather rough, yet it will be sufficient for our purposes. From the just obtained inequality it follows immediately that P n e γ(n+n 2 )α P for every n N. Thus, ρ( P ) n = ρ( P n ) e γ(n+n 2 )α n N, and applying the function γ n log( ) to both extremes of this inequality yields γ log ρ( P ) N 2 + n α n N. n Consequently, J(γ) α for every γ > 0, and therefore lim sup γ J(γ) α. The proof of the lemma is complete. Remark 3 It is not always necessary to consider the set of all the P -cycles when taking the maximum in (5). If, for example, P (i, i) > 0 for every i X, then it is sufficient to consider the P -loops: { } lim J(γ) = max c(),..., c(n). γ As opposed to what happens with large risk aversion, the attitude toward risk of a decision maker with large risk proneness may be described as highly optimistic. The rigorous statement, whose proof is similar to that of Lemma 2, is given below. Lemma If P is irreducible then { } lim J(γ) = min A(Γ i ) : Γ i is a P -cycle. (8) γ 9

10 4. Exponential average cost: the general (non-irreducible) case. The main result of this section, Theorem 2, provides a representation of the EAC in terms of the spectral radii of suitable submatrices of P, for an arbitrary (not necessarily irreducible) transition probability matrix P. Then, by means of two simple examples, the effect of risk-sensitivity when P is not irreducible is illustrated. Finally, we study the behavior of the EAC for little risk-sensitivity in the present general case. In particular, convergence of the EAC to the risk-neutral average cost is seen to hold under a recurrence assumption less restrictive than irreducibility. Before proceeding to the main result, we prove an auxiliary proposition which is important by itself. Lemma 2 For i, j X, i j sgn(γ) J(γ, i) sgn(γ) J(γ, j). Proof: Since equality trivially holds true for i = j, let us assume that i j. Let us consider first the case γ > 0. Take r such that P r (i, j) > 0. Then for n r we have Thus, E i [exp(γs n )] P r (i, j)e i [exp(γs n ) X r = j] ] n P r (i, j)e i [exp(γ c(x t )) X r = j t=r = P r (i, j)e j [exp(γs n r )]. n γ log (E i [exp(γs n )]) n γ log (P r (i, j)e j [exp(γs n r )]) = n γ log P r (i, j) + n γ log E j [exp(γs n r )], and the claim follows by taking lim sup as n in the extremes of (9), since lim n γ log P r (i, j) = 0. For the case γ < 0, if we take r and n as above then, by taking into account that c( ) K, we have E i [exp(γs n )] P r (i, j)e i [exp(γs n ) X r = j] ] n P r (i, j)e γrk E i [exp(γ c(x t )) X r = j t=r = P r (i, j)e γrk E j [exp(γs n r )]. Since lim n γ log(p r (i, j)e γrk ) = 0 as well, we can proceed similarly as in the case γ > 0 to obtain J(γ, i) J(γ, j), and the proof is complete. Remark 4 (a) It follows from Lemma 2 that J(γ, i) = J(γ, j) when i j, that is, the EAC is constant within a self-communicating class. (b) Note that the arguments in the proof of Lemma 2 are valid even if the state space X is infinite. (9) Theorem 4 For a Markov cost chain with arbitrary transition probability matrix P, the EAC for i X is given by J(γ, i) = {log γ max ρ( Q } C ) : C is SC and i C (20) 0

11 Proof: First, note that for γ < 0, equation (20) can be written as { } J(γ, i) = min γ log ρ( Q C ) : C is SC and i C. For brevity of writing, denote by α(γ, i) the right hand side of (20). First, if i C and C is a closed SC class, then (20) holds because J(γ, i) = γ log ρ( Q C ) = α(γ, i). Next, if if C and C is a non closed (transient) SC class, then for γ > (<)0 we have J(γ, i) = lim sup = γ log ρ( Q C ). N n γ log P n (i, j) () lim j= Finally, if C is SC, i C and i / C, then taking j C, by Lemma 2 we have J(γ, i) ()J(γ, j) () γ log ρ( Q C ). n γ log j C( Q C ) n (i, j) Thus, we have proved that J(γ, i) α(i) for all i X. To obtain the opposite inequality, define L(i) := {i} {j : i j} and P [i] := ( P (k, l)) k,l L(i). Then, for γ > (<)0 lim sup N n γ log P n (i, j) j= = lim sup n γ log j L(i) P n [i] (i, j) n lim log P [i] γ /n = γ log ρ( P [i] ) = α(i). Since ρ( P [i] ) = max{ρ( Q C ) : C is SC and i C} by Proposition 3, the proof is complete. Remark 5 Two remarkable differences between the risk-sensitive and the risk-neutral model become apparent from Theorem 2: () The EAC when beginning at a transient state is not a typical average of the EAC s over the closed classes that are accesible from the initial state; and (2) The EAC may depend on the cost structure at the transient states. Remark 6 Notice that the proofs of the last two results are still valid if we substitute lim sup by lim inf wherever the first one appears. We deduce from that observation that, for the finite state space model, the limit exists in the definition of the EAC (2) and in (8). Next, we give an example that demonstrates how characteristic () in the above remark may cause J(, x) not to be continuous at zero, i.e., that (3) does not hold in general. The example considers a model with a transient state x which leads to more than one closed class.

12 Example. Consider the cost process with state space X = {, 2, 3}, transition probability matrix P = , p p 0 and cost vector c = (, 3, 5). In this case we have that e γc() 0 0 P = 0 e γc(2) 0, pe γc(3) ( p)e γc(3) 0 and state 3 leads to the classes C = {} and C 2 = {2}. Thus, from Theorem 4 we see that { max{c(), c(2)} = 3 if γ > 0, J(γ, 3) = min{c(), c(2)} = if γ < 0, so that lim J(γ, 3) = 3 = lim J(γ, 3). γ 0+ γ 0 On the other hand, it is easy to check that φ(3) = pc() + ( p)c(2) = 3 2p. Thus, J(, 3) is neither left nor right continuous at zero. Remark 7 Note that the J(γ, 3) has the same value for all 0 < p <. In other words, as opposed to what happens with the risk null criterion, given that and 2 are accesible from 3, J(γ, 3) does not depend on the probability of entering to each of the classes It is a well known fact that if a Markov cost chain has a unichain structure, i.e. only one irreducible closed class and possibly some transient states, then the standard (risk-neutral) average cost does not depend on the initial state. The following example shows how feature (2) in Remark (5) may cause the previously mentioned property not to be true in general for the EAC, when the risk-sensitivity coefficient is large. Example 2. Consider the cost process with state space X = {, 2}, transition matrix ( ) 0 P =, p p and cost vector c = (c(), c(2)) such that c() < c(2). In this case we have P = (e γc() ) and P 2 = (pe γc(2) ), corresponding to the self-communicating classe C = {} and C 2 = {2}. Thus, from Theorem 2, J(γ, ) = c() for every γ, and J(γ, 2) = max{c(), γ log(peγc(2) )} for γ > 0, and J(γ, 2) = min{c(), γ log(peγc(2) )} for γ < 0, that is { c() if γ < log p Thus, if γ > log p c() c(2) J(γ, 2) = c(2) + γ c() c(2), log p if γ log p c() c(2)., then the EAC does depend on the initial state. On the other hand, we can observe that the EAC is better behaved for small values of γ: J(γ, ) = J(γ, 2) for γ 2 log p c() c(2).

13 The following corollary to Theorem 2 shows that for γ close enough to zero, the EAC behaves similarly as the risk neutral average cost in that it is completely determined by the long-run behavior of the underlying stochastic process, that is, its value is not influenced by the cost structure at the transient states. Corollary There exists γ 0 > 0 such that for every transient state i and γ ( γ 0, γ 0 ), J(γ, i) = {log γ max ρ( Q } C ) : i C and C is SC and closed. Proof: Let C be a non closed (transient) SC class. For γ > 0, we have Q C e γk Q C ; thus ρ( Q C ) e γk ρ(q C ) and consequently γ log ρ( Q C ) K + γ log (ρ(q C)). Taking into account that ρ(q C ) <, which follows from Proposition 2 and the fact that Q C is irreducible and strictly substochastic, we have, for γ > 0 lim γ 0+ γ log ρ( Q C ) K + lim γ 0+ γ log ρ(q C) =. Now, for γ < 0, QC Q C ; thus ρ( Q C ) ρ(q C ) and consequently Then, similarly as in the previous case, γ log ρ( Q C ) γ log (ρ(q C)). lim γ 0 γ log ρ( Q C ) lim γ 0 γ log ρ(q C) = +. Since there is only a finite number of SC classes, the claim follows from Theorem 2. Corollary 2 If the probability transition matrix induces only one closed SC class C then, as in the irreducible case, the EAC converges to the risk-neutral average cost when γ goes to zero. Proof: On one hand, we know that the value φ of the risk neutral average cost does not depend on the initial state, see for example []. On the other hand, by the previous corollary, for γ small enough J(γ, i) = γ log ρ( Q C ) for every i X. Therefore, by Lemma lim J((γ, i)) = lim γ 0 γ 0 γ log ρ( Q C ) = φ for every i X. 5. Existence of solutions to the optimality equation. Similarly as in the optimal control problem for the risk neutral average cost, the risk sensitive optimal value function is, under apropiate conditions, given by the solutions to the functional equation { sgn(γ)αw(i) = min sgn(γ)e γc(i,a) P ij (a)w(j) }, i X, (2) a A j X 3

14 which we will call the exponential average optimality equation (EAOE) corresponding to γ or the γ-eaoe. More precisely, if α > 0 and the function w : X [K, K 2 ], with K > 0, satisfy (2), then J (γ, i) = γ log α for every i X. Furthermore, if for each i, f (i) attains the minimum on the right hand side of (2), then the policy π = (f, f,...) Π SD is optimal; see [4, 0,, 8]. In this section, we show that, under a simultaneous Doeblin condition and for small enough γ, solutions to the γ-eaoe exist. This result was already proved by Cavazos-Cadena and Fernández-Gaucherand [5]; however, the proof we present here (a) is significantly simpler than that in [5], and (b) as the developments in previous sections of this paper, it completely relies on elementary results of Perron Frobenius theory for non negative matrices (similarly to the approach in the seminal paper of Howard and Matheson howard-matheson:72.) First, we prove the existence claim for the case in which the transition matrix P f is irreducible for every decision function f. Although this result extends that of Howard Matheson s howard-matheson:72 in that aperiodicity of P f is not required, the proof essentially follows the policy improvement algorithm devised by these authors. Lemma 3 If a finite state CMC is irreducible, i.e., P f = (P ij (f(i)) is irreducible for every f Π SD, then for each γ 0 the γ-eaoe has a solution (α, w). Proof: Take f Π SD such that J f = min{j g : g Π SD }. Let ρ(f) and w f respectively be the spectral radius and corresponding positive eigenvector of P f, so that J f = γ log ρ(f) and ρ(f)w = P f w. We claim that (ρ(f), w f ) is a solution of the optimality equation. Assume to the contrary that there exists g F such that sgn(γ)ρ(f)w f sgn(γ) P g w f and ρ(f)w f P g w f. Then, taking into account that w f is a positive vector, it follows from Proposition 3 that sgn(γ)ρ(f) > sgn(γ)ρ(g). This last inequality contradicts the way we chose f, since J f J g implies sgn(γ)ρ(f) sgn(γ)ρ(g). Remark 8 (a) It is clear that under the irreducibility assumption of Lemma 3, a stationary deterministic policy exists which is optimal within Π SD, because that class of policies is finite. What the lemma guarantees, via the verification theorem cited at the beginning of this section, is that such policy will be optimal within the whole Π as well. (b) The smoothness (with respect to γ) of the EAC for a model as in the lemma above, has simple yet remarkable implications for the variation of the optimal policies with respect to γ. First, consider arbitrary f, g Π SD. If φ f > φ g then, from the continuity of the value functions J f ( ) and J g ( ) at γ = 0 we obtain that J f (γ) > J g (γ) for every γ in a neighborhood of zero. Now, if φ f = φ g, then the analytic character of the value functions implies that for some γ 0 > 0, J f (γ) J g (γ) > 0 γ < γ 0 or J f (γ) J g (γ) = 0 γ < γ 0. It follows at once from the previous observations that there exists γ 0 > 0 and decision functions f, g F (possibly equal) such that f is γ average optimal for every γ (0, γ 0 ) and g is γ-average optimal for every γ ( γ 0, 0). Moreover, f and g are risk-null average optimal. In particular, if there is only one risk-null average optimal policy (an unlikely scenario), that policy must be also γ-average optimal for every γ ( γ 0, γ 0 ). Lemma 4 If a finite state CMC satisfies a simultaneous Doeblin condition, i.e., if there exists an state i 0 such that i i 0 under P f for any f in the set F of decision functions and every i X, then the exponential average optimality equation has a solution whenever γ is small enough. 4

15 Proof: Let us respectively denote by C(f) and T (f) the class of recurrent states and the class of transient states corresponding to the transition probability matrix P f induced by f. Additionally, denote Q f := (P (i, j)) i,j C and as usual by Q f the corresponding disutility matrix. Observe that i 0 C f for every f F. Since F is finite, Corollary guarantees the existence of γ 0 > 0 such that if γ < γ 0, then ρ( Q f ) = ρ( P f ) := ρ(f) and J f (γ, i) = γ ρ( Q f ) := J f, for every i X and f F. As in Lemma 3, take f F such that J f = min{j g : g F} and denote by w f a nonnegative eigenvector of Pf corresponding to ρ(f ). To prove that (ρ(f ), w f ) is a solution of the optimality equation exactly as we did in Lemma 3, all we need to check is that w f is in fact a positive vector. To that end, relabel X so that C(f ) = {,, k} and T (f ) = {k +,, N}. Since ρ(f ) is also the spectral radius of Qf, which is irreducible, and (w f (),, w f (k)) is a corresponding nonnegative eigenvector, then that eigenvector must be positive, that is, we have w f (i) > 0 for i C f. Consider now i T f and a positive integer n such that P f n (i, i 0) > 0. From the eigenvalue equation P f n w f = ρ(f ) n w f we obtain the equality ρ(f ) n w f (i) = P f n (i, j)w f (j) + P f n (i, j)w f (j), j C f n Now, P f (i, i 0 ) > 0 and w f (i 0 ) > 0 imply that the first sum in the right hand side of the above equality is positive and consequently, w f (i) > 0. Thus, as we noted before, we can now proceed as in Lemma 3 to complete the proof of the present Lemma. j T f Remark 9 Observe that the remarks to Lemma 3 are still valid in the context of Lemma 4. References [] A. Arapostathis, V. S. Borkar, E. Fernández-Gaucherand, M. K. Ghosh, and S. I. Marcus. Discrete-time controlled Markov processes with average cost criterion: a survey. SIAM J. Control and Optimization, 3(2): , March 993. [2] A. Berman and R. Plemmons. Nonnegative Matrices in the Mathematical Sciences. Academic Press, New York, 979. [3] D. P. Bertsekas. Dynamic Programming: Deterministic and Stochastic Models. Prentice Hall, Englewood Cliffs, N.J., 987. [4] A. Brau and E. Fernandéz-Gaucherand. Controlled Markov chains with risk-sensitive exponential average cost criterion. In Proceedings of the 36th IEEE Conference on Decision and Control, pages , San Diego, CA, 997. [5] R. Cavazos-Cadena and E. Fernandéz-Gaucherand. Controlled Markov chains with risk-sensitive average cost criterion: A counter-example and necessary conditions for optimal solutions under strong recurrence assumptions. Submitted for publication, 998. [6] R. Cavazos-Cadena and E. Fernandéz-Gaucherand. Controlled Markov chains with risk-sensitive criteria: Average cost, optimality equations, and optimal solutions. Mathematical Methods of Operations Research, 49: ,

16 [7] R. Cavazos-Cadena and E. Fernandéz-Gaucherand. Risk sensitive optimal control in communicating average markov decision chains. In M. Dror, P. L Ecuyer, and D. F. Szidarovszky, editors, Modeling Uncertainty. An Examination of Stochastic Theory, Methods and Applications. Kluwer Academic Publishers, [8] A. Dembo and O. Zeitouni. Large Deviations Techniques and Applications. Jones and Bartlett, Boston, MA, 993. [9] E. A. Feinberg. Controlled Markov processes with arbitrary numerical criteria. Theory of Probability and its Applications, 27: , 982. [0] W. H. Fleming and D. Hernández-Hernández. Risk sensitive control of finite state machines on an infinite horizon i. SIAM Journal on Control and Optimization, 35(5):970 80, September 997. [] D. Hernández-Hernández and S. Marcus. Risk sensitive control of Markov processes in countable state space. Systems and Control Letters, 997. (to appear). [2] O. Hernández-Lerma and J. Laserre. Discrete Time Markov Control Processes. Springer, New York, 996. [3] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, Cambridge, 982. [4] R. A. Howard and J. E. Matheson. Risk-sensitive Markov decision processes. Management Science, 8(7): , March 972. [5] T. Kato. A Short Introduction to Perturbation Theory for Linear Operators. Springer-Verlag, New York, 982. [6] S. I. Marcus, E. Fernández-Gaucherand, D. Hernández-Hernández, S. Coraluppi, and P. Fard. Risk sensitive Markov decision processes. In C. Byrnes, B. Data, D. Gilliam, and C. Martin, editors, Systems and Control in the Twenty-First Century, Progress in Systems and Control, pages Birkhauser, 997. [7] J. Neveu. Mathematical Foundations of the Calculus of Probability. Holden-Day,Inc., San Francisco, Cal., 965. [8] J. W. Pratt. Risk aversion in the small and in the large. Econometrica, 32():22 36, January-April 964. [9] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, 994. [20] P. Whittle. Risk-sensitive Optimal Control. John Wiley & Sons, Chichester,

Risk-Sensitive and Average Optimality in Markov Decision Processes

Risk-Sensitive and Average Optimality in Markov Decision Processes Karel Sladký Abstract. This contribution is devoted to the risk-sensitive optimality criteria in finite state Markov Decision Processes.