Risk-Sensitive and Average Optimality in Markov Decision Processes

Similar documents
Separable Utility Functions in Dynamic Economic Models

Non-Irreducible Controlled Markov Chains with Exponential Average Cost.

UNCORRECTED PROOFS. P{X(t + s) = j X(t) = i, X(u) = x(u), 0 u < t} = P{X(t + s) = j X(t) = i}.

Markov decision processes and interval Markov chains: exploiting the connection

Lecture 1: Brief Review on Stochastic Processes

INTERTWINING OF BIRTH-AND-DEATH PROCESSES

Stochastic Processes. Theory for Applications. Robert G. Gallager CAMBRIDGE UNIVERSITY PRESS

Markov Chains, Random Walks on Graphs, and the Laplacian

A relative entropy characterization of the growth rate of reward in risk-sensitive control

Mean-field dual of cooperative reproduction

Elements of Reinforcement Learning

Online solution of the average cost Kullback-Leibler optimization problem

Markov Decision Processes and Dynamic Programming

21 Markov Decision Processes

Markov Chains, Stochastic Processes, and Matrix Decompositions

Population Games and Evolutionary Dynamics

Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS

Matrix analytic methods. Lecture 1: Structured Markov chains and their stationary distribution

Intrinsic products and factorizations of matrices

Statistics 992 Continuous-time Markov Chains Spring 2004

On Finding Optimal Policies for Markovian Decision Processes Using Simulation

Q-Learning for Markov Decision Processes*

Some notes on Markov Decision Theory

MDP Preliminaries. Nan Jiang. February 10, 2019

The convergence limit of the temporal difference learning

Finite-Horizon Statistics for Markov chains

Lecture 5. If we interpret the index n 0 as time, then a Markov chain simply requires that the future depends only on the present and not on the past.

RECURSION EQUATION FOR

COMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS* Michael N. Katehakis. State University of New York at Stony Brook. and.

On Polynomial Cases of the Unichain Classification Problem for Markov Decision Processes

Alternative Characterization of Ergodicity for Doubly Stochastic Chains

A Connection between Controlled Markov Chains and Martingales

Measuring transitivity of fuzzy pairwise comparison matrix

Discrete time Markov chains. Discrete Time Markov Chains, Limiting. Limiting Distribution and Classification. Regular Transition Probability Matrices

Average Reward Parameters

Continuous-Time Markov Decision Processes. Discounted and Average Optimality Conditions. Xianping Guo Zhongshan University.

University of Warwick, EC9A0 Maths for Economists Lecture Notes 10: Dynamic Programming

Reinforcement Learning

Linearly-solvable Markov decision problems

Eigenvalues in Applications

BOUNDS OF MODULUS OF EIGENVALUES BASED ON STEIN EQUATION

P i [B k ] = lim. n=1 p(n) ii <. n=1. V i :=

Markov Chains and Stochastic Sampling

STOCHASTIC PROCESSES Basic notions

Introduction to Machine Learning CMU-10701

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Approximate active fault detection and control

Monte Carlo Methods in Statistical Mechanics

The Perron eigenspace of nonnegative almost skew-symmetric matrices and Levinger s transformation

ELEC633: Graphical Models

Computational complexity estimates for value and policy iteration algorithms for total-cost and average-cost Markov decision processes

The generalized order-k Fibonacci Pell sequence by matrix methods

Series Expansions in Queues with Server

DISCRETE STOCHASTIC PROCESSES Draft of 2nd Edition

Stochastic Models. Edited by D.P. Heyman Bellcore. MJ. Sobel State University of New York at Stony Brook

Loss Bounds for Uncertain Transition Probabilities in Markov Decision Processes

Stability of the two queue system

A PARAMETRIC MODEL FOR DISCRETE-VALUED TIME SERIES. 1. Introduction

Summary: A Random Walks View of Spectral Segmentation, by Marina Meila (University of Washington) and Jianbo Shi (Carnegie Mellon University)

Detailed Proof of The PerronFrobenius Theorem

Key words. Strongly eventually nonnegative matrix, eventually nonnegative matrix, eventually r-cyclic matrix, Perron-Frobenius.

Model reversibility of a two dimensional reflecting random walk and its application to queueing network

Cover Page. The handle holds various files of this Leiden University dissertation

= P{X 0. = i} (1) If the MC has stationary transition probabilities then, = i} = P{X n+1

CHUN-HUA GUO. Key words. matrix equations, minimal nonnegative solution, Markov chains, cyclic reduction, iterative methods, convergence rate

High-dimensional Markov Chain Models for Categorical Data Sequences with Applications Wai-Ki CHING AMACL, Department of Mathematics HKU 19 March 2013

International Competition in Mathematics for Universtiy Students in Plovdiv, Bulgaria 1994

Analysis of an Infinite-Server Queue with Markovian Arrival Streams

Cover Page. The handle holds various files of this Leiden University dissertation

MATH4406 Assignment 5

NP-hardness results for linear algebraic problems with interval data

Convex Optimization CMU-10725

Neural, Parallel, and Scientific Computations 18 (2010) A STOCHASTIC APPROACH TO THE BONUS-MALUS SYSTEM 1. INTRODUCTION

Stochastic Design Criteria in Linear Models

Value Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes

OPTIMALITY OF RANDOMIZED TRUNK RESERVATION FOR A PROBLEM WITH MULTIPLE CONSTRAINTS

Regular finite Markov chains with interval probabilities

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Stochastic convexity in dynamic programming

Markov Chains CK eqns Classes Hitting times Rec./trans. Strong Markov Stat. distr. Reversibility * Markov Chains

Markov Chains. As part of Interdisciplinary Mathematical Modeling, By Warren Weckesser Copyright c 2006.

A SECOND ORDER STOCHASTIC DOMINANCE PORTFOLIO EFFICIENCY MEASURE

Notes on Linear Algebra and Matrix Theory

MATH36001 Perron Frobenius Theory 2015

SIMILAR MARKOV CHAINS

A NEW EFFECTIVE PRECONDITIONED METHOD FOR L-MATRICES

Total Expected Discounted Reward MDPs: Existence of Optimal Policies

Key words. Feedback shift registers, Markov chains, stochastic matrices, rapid mixing

Affine iterations on nonnegative vectors

Affine Monotonic and Risk-Sensitive Models in Dynamic Programming

Foundations of Matrix Analysis

E-Companion to The Evolution of Beliefs over Signed Social Networks

INEQUALITY FOR VARIANCES OF THE DISCOUNTED RE- WARDS

ON THE POLICY IMPROVEMENT ALGORITHM IN CONTINUOUS TIME

Central-limit approach to risk-aware Markov decision processes

Markov Decision Processes and Dynamic Programming

Basis Construction from Power Series Expansions of Value Functions

Kernels of Directed Graph Laplacians. J. S. Caughman and J.J.P. Veerman

Asymptotics of minimax stochastic programs

An iterative procedure for constructing subsolutions of discrete-time optimal control problems

Transcription:

Risk-Sensitive and Average Optimality in Markov Decision Processes Karel Sladký Abstract. This contribution is devoted to the risk-sensitive optimality criteria in finite state Markov Decision Processes. At first, we rederive necessary and sufficient conditions for average optimality of (classical) risk-neutral unichain models. This approach is then extended to the risk-sensitive case, i.e., when expectation of the stream of one-stage costs (or rewards) generated by a Markov chain is evaluated by an exponential utility function. We restrict ourselves on irreducible or unichain Markov models where risk-sensitive average optimality is independent of the starting state. As we show this problem is closely related to solution of (nonlinear) Poissonian equations and their connections with nonnegative matrices. Keywords: dynamic programming, stochastic models, risk analysis and management. JEL classification: C44, C6, C63 AMS classification: 90C40, 60J0, 93E20 Notation and Preinaries In this note, we consider unichain Markov decision processes with finite state space and compact action spaces where the stream of costs generated by the Markov processes is evaluated by an exponential utility function (so-called risk-sensitive models) with a given risk sensitivity coefficient. To this end, let us consider an exponential utility function, say ū γ ( ), i.e. a separable utility function with constant risk sensitivity γ R. For γ > 0 (risk averse case) ū γ ( ) is convex, if γ < 0 (risk seeking case) ū γ ( ) is concave. Finally if γ =0 (risk neutral case) ū γ ( ) is linear. Observe that exponential utility function ū γ ( ) is separable and multiplicative if the risk sensitivity γ 0 and additive for γ = 0. In particular, we have u γ (ξ + ξ 2 ) = u γ (ξ ) u γ (ξ 2 ) if γ 0 and u γ (ξ + ξ 2 ) ξ + ξ 2 for γ = 0. Then the utility assigned to the (random) outcome ξ is given by ū γ (ξ) := { (sign γ) exp(γξ), if γ 0, ξ for γ = 0. () For what follows let u γ (ξ) := exp(γξ), hence ū γ (ξ) = (sign γ) u γ (ξ). Obviously ū γ ( ) is continuous and strictly increasing. Then for the corresponding certainty equivalent, say Z γ (ξ), since ū γ (Z γ (ξ)) = E [ū γ (ξ)] (E is reserved for expectation), we immediately get Z γ (ξ) = { γ ln{e u γ (ξ)}, if γ 0 E [ξ] for γ = 0. (2) In what follows, we consider a Markov decision chain X = {X n, n = 0,,...} with finite state space I = {, 2,..., N} and a compact set A i of possible decisions (actions) in state i I. Supposing that in state i I action a A i is selected, then state j is reached in the next transition with a given probability p ij (a) and one-stage transition cost c ij will be accrued to such transition. Institute of Information Theory and Automation Academy of Sciences of the Czech Republic Pod Vodárenskou věží 4, 82 08 Praha 8, Czech Republic, e-mail: sladky@utia.cas.cz - 799 -

A (Markovian) policy controlling the decision process is given by a sequence of decisions (actions) at every time point. In particular, policy controlling the process, π = (f 0, f,...), with π k = (f k, f k+,...) for k =, 2,..., hence also π = (f 0, f,..., f k, π k ) is identified by a sequence of decision vectors {f n, n = 0,,...} where f n F A... A N for every n = 0,, 2,..., and f n i A i is the decision (or action) taken at the nth transition if the chain X is in state i. Policy π which selects at all times the same decision rule, i.e. π (f), is called stationary, hence X is a homogeneous Markov chain with transition probability matrix P (f) whose ij-th element equals p ij (f i ). Let ξn α = αk c Xk,X k+ with α (0, ), resp. ξ n = c X k,x k+, be the stream of α-discounted, resp. undiscounted, transition costs received in the n next transitions of the considered Markov chain X. Similarly let ξ (m,n) be reserved for the total (random) cost obtained from the mth up to the nth transition (obviously, ξ n = c X0,X + ξ (,n) ). Moreover, if the risk sensitivity γ 0 then ū γ (ξn) α = (sign γ) u γ (ξn), α resp. ū γ (ξ n ) = (sign γ) u γ (ξ n ), is the (random) utility assigned to ξn, α resp. to ξ n. Observe that ξ α := ξn α is well defined, hence u γ (ξ α ) = exp(γ α k c Xk,X k+ ). In the overwhelming literature on stochastic dynamic programming attention was mostly paid to the risk neutral case, i.e. if γ = 0. The following results and techniques adapted from [8] and [2] will be useful for derivation of necessary and sufficient average optimality conditions and for their further extensions to risk-sensitive models. Introducing for arbitrary g, w j R (i, j I) the discrepancy function we can easily verify the following identities: φ i,j (w, g) = c i,j w i + w j g (3) ξn α = αn α g + w X 0 α n w Xn + α k [ φ Xk,X k+ (w, g) ( α) w Xk+ ] (4) ξ n = ng + w X0 w Xn + φ Xk,X k+ (w, g). (5) If the process starts in state i and policy π = (f n ) is followed then for the expected α-discounted or undiscounted total cost Vi π(α, n) := E π i ξα n, Vi π(n) := E π i ξ n we immediately get by (4) (5) Vi π (α, n) = αn α g + w i + E π i { α k [ φ Xk,X k+ (w, g) ( α)w Xk+ ] α n w Xn } (6) V π Observe that i (n) = ng + w i + E π i { φ Xk,X k+ (w, g) w Xn }. (7) E π i φ Xk,X k+ (w, g) = p ij (f i ){ φ i,j (w, g) + E π j k= φ Xk,X k+ (w, g)}. (8) It is well-known from the dynamic programming literature (cf. e.g. [, 6, 9, 0, 6]) that If there exists state i 0 I that is accessible from any state i I for every f F then (i) For every f F the resulting transition probability matrix P (f) is unichain (i.e. P (f) have no two disjoint closed sets), (ii) There exists decision vector ˆf F along with numbers ŵ i, i I (unique up to additive constant), and ĝ being the solution of the set of (nonlinear) equations ŵ i + ĝ = min p ij (a)[c i,j + ŵ j ] = p ij ( ˆf i )[c i,j + ŵ j ], (9) a A i φ i (f, ˆf) := p ij (f)[c i,j + ŵ j ] ŵ i ĝ 0 with φ i ( ˆf, ˆf) = 0. (0) ( ) - 800 -

From (6), (7), (0) we immediately get for Vi π (α) := V π i (α, n) V ˆπ i (α) = αĝ + ŵ i ( α)e ˆπ i α k ŵ Xk+, V ˆπ i (n) = nĝ + ŵ i E ˆπ i ŵ n hence for stationary policy π ( ˆf) and arbitrary policy π = (f n ) n V i ˆπ (n) = ( α)vi ˆπ (α) = ĝ () α n V π i (n) = ĝ = α ( α)v π i (α) if and only if n E π i φ Xk (f n, ˆf) = 0. (2) 2 Risk-Sensitive Optimality On inserting the discrepancy function given by (3) in the exponential function u γ ( ) by (4) we get for the stream of discounted costs u γ (ξn) α = e γ α k c Xk,X k+ (3) = e γ[ and for U π i (γ, α, n) := E π i uγ (ξ α n) we have α k g+w X0 α n w Xn ] γ e α k [ φ Xk,X k+ (w,g) ( α) w Xk+ ] γ{ Ui π αn γ[ (γ, α, n) = e α g+wi] E π α k [ φ Xk,X k+ (w,g) ( α)w Xk+ ] α n w X n i e } (4) Observe that w i s are bounded, i.e. w Xk K for some K 0. Hence it holds e γ K e γ ( α) α k w Xk+ k= e γ K (5) and for n tending to infinity from (4) we immediately get for U π i (γ, α) := U π i (γ, α, n) Ui π (γ, α) = e γ[ α g+wi] E π i e γ α k [ φ Xk,X k+ (w,g)+( α)w Xk+ ] (6) Similarly for undiscounted models we get by (3), (4) Now observe that E π j {e γ k=m E π i e γ U π i (γ, n) = e γ[ng+w i] E π i e γ[ φ Xk,X k+ (w,g) = φ Xk,X k+ (w,g) Xm = j} = p jl (fj m l I φ Xk,X k+ (w,g) w X n ] (7) p ij (f 0 i ) e γ[ci,j wi+wj g] E π j ) e γ[c j,l w j +w l g] E πm+ l Employing (8) the following facts can be easily verified by (6), (7). e γ φ Xk,X k+ (w,g) k= (8) e γ φ Xk,X k+ (w,g) k=m+ (9) - 80 -

Result. (i) Let for a given stationary policy π (f) there exist g(f), w j (f) s such that p ij (f i ) e γ φi,j(w(f),g(f)) =, i I. (20) Then p ij (f i ) e γ[ci,j wi(f)+wj(f) g(f)] = Ui π (γ, α) = e γ( α g(f)+wi(f)) E π γ ( α) i e α k w Xk (f) k= (2) U π i (γ, n) = e γ(ng(f)+w i(f)) E π i e γ w Xn (f) (22) (ii) If it is possible to select g = g and w j = w j s, resp. g = ĝ and w j = ŵ j s, such that for any f F, all i I and some f F, resp. ˆf F, respectively p ij (f i ) e γ φi,j(w,g ) with p ij (f i ) e γ φi,j(ŵ,ĝ) with p ij (f i ) e γ φ i,j(w,g ) = (23) p ij ( ˆf i ) e γ φi,j(ŵ,ĝ) = (24) then e γ( α ĝ+ŵ i) e γ K U ˆπ i (γ, α) e γ(nĝ+ŵi) e γ K U ˆπ i (γ, n) U π i (γ, α) e γ( α g +w i ) e γ K (25) U π i (γ, n)) e γ(ng +w i ) e γ K. (26) Result 2. Let (cf. (2)) Zi π(γ, α) = γ ln U i π(γ, α), Zπ i (γ, n) = γ ln U i π (γ, n). Then by (25), (26) for stationary policies ˆπ ( ˆf), π (f ), and by (6), (7) for an arbitrary policy π = (f n ) n Z i ˆπ (γ, n) = ( α)zi ˆπ (γ, α) = ĝ, α n Zπ i (γ, n) = g, resp. n ln[e π i e γ 3 Poissonian Equations n Zπ i (γ, n) = ĝ, n Zπ i (γ, n) = α ( α)z π i (γ, α) = g (27) if and only if φ Xk,X k+ (w,g ) ] = 0, resp. n ln[e π i e γ φ Xk,X k+ (ŵ,ĝ) ] = 0. (28) The system of equations (20) for the considered stationary policy π (f) and the nonlinear systems of equations (23), (24) for finding stationary policy with maximal/minimal value of g(f) can be also written as e γ[g(f)+w i(f)] = e γ[g +w i ] = max e γ[ĝ+ŵ i] = min p ij (f i ) e γ[c i,j+w j (f)] (i I) (29) p ij (f i ) e γ[ci,j+w j ] (i I) (30) p ij (f i ) e γ[c i,j+ŵ j ] (i I) (3) respectively, for the values g(f), ĝ, g, w i (f), wi, ŵ i (i =,..., N); obviously, these values depend on the selected risk sensitivity γ. Eqs. (30), (3) can be called the γ-average reward/cost optimality equation. In particular, if γ 0 using the Taylor expansion by (29), resp. (3), we have g(f) + w i (f) = p ij (f i ) [c i,j + w j (f)], resp. ĝ + ŵ i = min that well corresponds to (9). p ij (f i ) [c i,j + ŵ j ] - 802 -

On introducing new variables v i (f) := e γwi(f), ρ(f) := e γg(f), and on replacing transition probabilities p ij (f i ) s by general nonnegative numbers defined by q ij (f i ) := p ij (f i ) e γcij (29) can be alternatively written as the following set of equations ρ(f)v i (f) = q ij (f i ) v j (f) (i I) (32) and (30), (3) can be rewritten as the following sets of nonlinear equations (here ˆv i := e γŵ i, v i := eγw i, ˆρ = e γĝ, ρ := e γg ) ρ v i = max q ij (f i ) vj, ˆρ ˆv i = min called γ-average reward/cost optimality equation in multiplicative form. q ij (f i ) ˆv j (i I) (33) For what follows it is convenient to consider (32), (33) in matrix form. To this end, we introduce the N N matrix Q(f) = [q ij (f i )] with spectral radius (Perron eigenvalue) ρ(f) along with its right Perron eigenvector v(f) = [v i (f i )], hence (cf. [5]) ρ(f)v(f) = Q(f)v(f). Similarly, for v(f ) = v, v( ˆf) = ˆv (33) can be written in matrix form as ρ v = max Q(f)v, ˆρ ˆv = min Q(f)ˆv. (34) Recall that vectorial maximum and minimum in (34) should be considered componentwise and ˆv, v are unique up to multiplicative constant. Furthermore, if the transition probability matrix P (f) is irreducible then also Q(f) is irreducible and the right Perron eigenvector v(f) can be selected strictly positive. To extend this assertion to unichain models in contrast to condition ( ) for the risk neutral case it is necessary to assume existence of state i 0 I accessible from any state i I for every f F that belongs to the basic class of Q(f). If condition ( ) is fulfilled it can be shown (cf. [5]) that in (30), (3) and (33) eigenvectors v(f), ˆv, v can be selected strictly positive and ρ, resp. ˆρ, is the maximum, resp. minimum, Perron eigenvalue of the matrix family {Q(f), f F}. So we have arrived to Result 3. Sufficient condition for the existence of γ-average reward/costs optimality equation is the existence of state i 0 I fulfilling condition ( ) (trivially fulfilled for irreducible models). ( ) In particular, for unichain models condition ( ) is fulfilled if this risk sensitive coefficient γ is sufficiently close to zero (cf. [3, 4, 5]). Finding solution of (34) can be performed by policy or value iteration. Details can be found e.g. in [2, 3, 7, 4, 5]. Finally, we rewrite optimality condition (28) of Result 2 in terms of Q(f), ˆv, ˆρ. To this end first observe that E π i 0 e γ φ Xk,X k+ (ŵ,ĝ) = p ik,i k+ (fi k k ) e γ[c i k,i k+ +ŵ ik+ ŵ ik ĝ] i k+ I = q ik,i k+ (fi k k ) ˆv ik+ ˆv i k ˆρ (35) i k+ I So equation (28) can be also written in matrix form as { } { [ ] } n ln ˆV Q(f k ) ˆV ˆρ = n ln ˆV Q(f k ) ˆρ ˆV ˆV = 0 (36) where the diagonal matrix ˆV = diag [ˆv i ] and 0 is reserved for a null matrix. (i.e. irreducible class with spectral radius equal to the Perron eigenvalue of Q(f)) - 803 -

4 Conclusions In this note necessary and sufficient optimality conditions for discrete time Markov decision chains are obtained along with equations for average optimal policies both for risk-neutral and risk-sensitive models. Our analysis is restricted to unichain models, and for the risk-sensitive case some additional assumptions are made. For multichain models it is necessary to find suitable partition of the state space into nested classes that retain some properties of the unichain model. Some results in this direction can be found in [, 5, 7, 8]. Acknowledgements This research was supported by the Czech Science Foundation under Grants P402//050 and P402/0/0956. References [] Bertsekas, D. P.: Dynamic Programming and Optimal Control. Volume 2. Third Edition. Athena Scientific, Belmont, Mass. 2007. [2] Cavazos-Cadena, R. and Montes-de-Oca R.: The value iteration algorithm in risk-sensitive average Markov decision chains with finite state space. Mathematics of Operations Research 28 (2003), 752 756. [3] Cavazos-Cadena, R.: Solution to the risk-sensitive average cost optimality equation in a class of Markov decision processes with finite state space. Mathematical Methods of Operations Research 57 (2003), 253 285. [4] Cavazos-Cadena, R. and Hernández-Hernández, D.: A characterization of the optimal risk-sensitive average cost infinite controlled Markov chains. Annals of Applied Probability 5 (2005), 75 22. [5] Gantmakher, F. R.: The Theory of Matrices. Chelsea, London 959. [6] Howard, R. A.: Dynamic Programming and Markov Processes. MIT Press, Cambridge, Mass. 960. [7] Howard, R. A. and Matheson, J.: Risk-sensitive Markov decision processes. Management Science 23 (972), 356 369. [8] Mandl, P.: On the variance in controlled Markov chains. Kybernetika 7 (97), 2. [9] Puterman, M. L.: Markov Decision Processes Discrete Stochastic Dynamic Programming. Wiley, New York 994. [0] Ross, S. M.: Introduction to Stochastic Dynamic Programming. Academic Press, New York 983. [] Rothblum, U. G. and Whittle, P.: Growth optimality for branching Markov decision chains. Mathematics of Operations Research 7 (982), 582 60. [2] Sladký, K.: On the set of optimal controls for Markov chains with rewards. Kybernetika 0 (974), 526 547. [3] Sladký, K.: On dynamic programming recursions for multiplicative Markov decision chains. Mathematical Programming Study 6 (976), 26 226. [4] Sladký, K.: Bounds on discrete dynamic programming recursions I. Kybernetika 6 (980), 526 547. [5] Sladký, K.: Growth rates and average optimality in risk-sensitive Markov decision chains. Kybernetika 44 (2008), 205 226. [6] Tijms, H. C.: A First Course in Stochastic Models. Wiley, Chichester, 2003. [7] Whittle, P.: Optimization Over Time Dynamic Programming and Stochastic Control. Volume II, Chapter 35. Wiley, Chichester 983. [8] Zijm, W. H. M.: Nonnegative Matrices in Dynamic Programming. Mathematical Centre Tract, Amsterdam 983. - 804 -