The Role of Discount Factor in Risk Sensitive Markov Decision Processes

Size: px

Start display at page:

Download "The Role of Discount Factor in Risk Sensitive Markov Decision Processes"

Todd Bell
5 years ago
Views:

1 06 5th Brazilian Conference on Intelligent Systems The Role of Discount Factor in Risk Sensitive Markov Decision Processes Valdinei Freire Escola de Artes, Ciências e Humanidades Universidade de São Paulo São Paulo, Brazil valdinei.freire@usp.br Abstract Markov Decision Processes MDPs) have long been the framework for modeling optimal decisions in stochastic environments. Although less known, Risk Sensitive Markov Decision Processes RSMDP) extend MDPs by allowing arbitrary risk attitude. However, not every environment is well-defined in MDPs and RSMDPs and both versions make use of discount in costs to turn every problem well-defined; because of the exponential grow in RSMDPs, the problem of a well-defined problem is even harder. Here, we show that the use of discount in costs: i) in MDPs induces a risk-prone attitude in MDPs, and ii) in MDPs, hinders risk-averse attitude for some scenarios. I. INTRODUCTION Markov Decision Processes MDPs) have long been a successful framework for planning over stochastic environment []. MDPs are used in planning problems or learning problems, underlying the Reinforcement Learning framework []. Solutions to MDPs are optimal policies that minimize a cost function based on an immediate cost return. However, in contrast with human decision-makers, who show risk-averse in preference [3]; MDPs do not take into account risk attitudes [4]. Over the last four decades, some extensions to MDPs have been proposed to take into account risk attitudes. Some of them consider Expected Utility Theory [5] to define risk attitude and propose appropriated methods to account for it [6], [7], [8], [4], whereas others consider the role of variance to define risk [9], [0]. Some authors also consider risk without taking into account any general purpose theory [8], []. A direct extension to MDPs is the Risk-Sensitive MDP RSMDP) framework, where the objective of planning is defined as an exponential utility function with a parameter λ [4]. Just like linear utility functions, which present risk neutral attitudes, exponential utility function has some good mathematical property; for example, if all outcomes are increased by an amount of Δ, the certain equivalent is also increased by an amount of Δ [4], this guarantees dynamic programming can be applied to RSMDPs. However, exponential function presents a problem: it grows too fast. In the MDP framework two different scenarios are commonly define: infinite horizon and indefinite horizon []. The infinite horizon considers the average immediate cost as cost function, whereas the indefinite horizon considers the immediate-cost sum as cost function. Although algorithms for both cost functions are not in general interchangeable, there is a cost function that can unify both scenarios: the discounted immediate-cost sum, which considers a discount factor γ 0, ) as parameter. Algorithms that can find optimal policies under discounted sum can also find good approximation in both scenarios: infinite and indefinite horizon; in fact, the approximation gets closer to optimality as γ []. In RSMDPs both scenarios can also be taken; however, whereas the infinite horizon is well-defined, the indefinite horizon does not converge in every scenario and risk factor λ. However, if a discount factor can unify approximately both cost function within MDP; within RSMDP, the use of discount factor makes optimal policies non-stationary. Finally, in an indefinite horizon scenario with constant immediate cost, i.e., the shortest stochastic path problem, the discounted version of the problem is the same as an exponential function, but with risk-prone attitude [3]. In this paper we analyze what happens to risk attitude in the shortest stochastic path problem when RSMDP and MDP is considered with discount factor. We also analyze two other alternatives to risk sensitive [8], []. In section II we present the MDP and RSMDP frameworks, whereas in sections III and IV we present our results. Finally, section V presents some final considerations. II. MARKOV COST PROCESSES We consider a general framework that underlies MDPs and RSMDPs, the Markov Cost Process MCP) [4]. An MCP is defined by a tuple S, A,B 0,T,c, where: S is a set of states; A is a set of actions; B 0 : S [0, ] is an initial state distribution, where B 0 s) =Prs 0 = s); T : S A S [0, ] is a probabilistic transition function, where T s, a, s )=Ps t+ = s s t = s, a t = a); and c : S A R is a cost function. A process starts at a state s 0 according to B 0 and at any time step t 0 evolves as follow: i) the agent perceives the state s t and chooses an action a t ;ii) a cost c t = cs t,a t ) incurs; In [4], Markov Reward Processes are defined; here, we assume rewards are always negative and we define MCP instead /6 $ IEEE DOI 0.09/BRACIS

2 and iii) the process transits to a next state s t+ following probability P s t+ = s s t = s, a t = a) =T s, a, s ). An agent guides the process by choosing a stationary policy ρ : S A, that maps each state into an action at any time step, or a non-stationary policy π : S N A, that maps each state and time steps t 0 into an action. N is the set of natural numbers. An important characteristic regarding MCPs is their horizons. Three horizon types can be defined: ) The finite horizon defines a final time step N, where the process ends. To make clearer the distinction, we define a finite horizon MCP by the tuple S, A,B 0,T,c,N. ) The infinite horizon considers that the process goes on for infinite steps and the process never ends. We define an infinite horizon MCP by the tuple S, A,B 0,T,c,. 3) The indefinite horizon considers a goal state g which is absorbing, where the process stops, or equivalently, cg, a) = 0 for all a A, T g, a, g) = and T g, a, s) =0for s g and all a A. We define an indefinite horizon MCP by the tuple S, A,B 0,T,c,g. Since the finite horizon has no problem of convergence, in this paper we consider only the infinite and indefinite horizon. A. Expected Utility Theory An objective to an MCP can be defined by a utility function U : X R, where X is the set of outcomes. Given a probability distribution over the set of outcomes X and a decision d, the value of a decision is given by V d) = x X P x d)ux). Then, given a set of feasible decisions D, the optimal decision d is the decision with the greatest value function, i.e., d =argmax V d). d D In an MCP, an outcome is described by the cost history h = c 0,c,c,..., i.e., X R. However, it is common to summarize a history h in a unique scalar value, for example, the cost sum; then X R +, in this case the utility function is strictly decreasing. If X R + and U ) is continuous and decreasing, then a certain equivalent C d for a decision d can be defined as C d = U V d)), and the expected outcome C d can be defined as C d = P C d)c. Then, attitude to risk can be defined. C X Definition : Risk Attitude) The decision maker is risk neutral if and only if C d = C d for all d D, or, equivalently, if and only if the utility function is linear. The decision maker is risk prone if and only if C d C d for all d Dand strictly different for at least a decision, or, equivalently, if and only if the utility function is convex. The decision maker is risk averse if and only if C d C d for all d Dand strictly different for at least a decision, or, equivalently, if and only if the utility function is concave. We consider two ways of summarizing histories until some time N: cost sum and discounted cost sum. The first one is simply C N = c t, whereas the second one is Cγ N = γ t c t. Finally, in an MCP the set of decisions is the set of policies. In the next sections we make use of this two summarization to define MDP s and RSMDP s utility functions. B. Markov Decision Process A Markov Decision Process MDP) is an MCP associated to a utility function. MDPs consider a linear utility function, and therefore is risk neutral. The solution of an MDP consists in solving a non-linear system of equations over a function v : S Rfrom which an optimal policy can be defined. We consider three scenarios: infinite horizon, indefinite horizon, and discount scenario. ) Infinite Horizon: In the infinite horizon scenario the C N Uh) = lim N N = lim c t. N N To define a utility function in the infinite horizon a limit condition must be considered. However, the condition on infiniteness is necessary; if the process ends or cost ceases, the utility goes to zero and decisions cannot be differentiated among themselves. The process, under at least optimal policy, must be ergodic, i.e., all states can be reached from every other state and averaging over time or over the state space is equivalently, i.e., an average performance μ can be calculated for optimal policy [4]. If it is not the case, then μ is not unique for every state s S. Finally, optimal policies are stationary. ) Indefinite Horizon: In the indefinite horizon scenario the Uh) = lim N CN = lim c t. N Again, to define a utility function in the indefinite horizon a limit condition must be considered, but, here, this process is supposed to end; if it is not the case, expected utility diverges. A condition so that a solution to this scenario exists is that there exists a proper policy, i.e, a policy that reaches with probability the absorbing state g [4]. 3) Discount Factor: Both versions previously presented have two major drawbacks: i) in infinite horizons the process under optimal policy must be ergodic, and ii) in indefinite horizons there must exists a proper policy. These drawbacks restrict the use of both types of horizon in general. The discounted sum cost function allows the problem to be welldefined in both case. In the discount scenario the utility function is given by Uh) = lim N CN γ = lim γ t c t. N 48 48

3 Although the discounted sum cost function can work in both scenarios: infinite and indefinite horizons; it is not the panacea. As we show in section III, discount and risk-prone walk side by side and decision-makers may not desire such influence. C. Risk Sensitive Markov Decision Process Similar to an MDP, a Risk Sensitive Markov Decision Process RSMDP) is an MCP associated to an objective function. However, RSMDPs consider an exponential utility function, and therefore allows risk-prone or risk-averse attitudes. The choice of a risk attitude is done by a factor λ: ifλ > 0, the decision-maker is risk averse, and if λ<0, the decisionmaker is risk prone. Under a limiting argument, it is possible to show that as λ 0, RSMDP becomes risk-neutral. Again, we consider three scenarios: infinite horizon, indefinite horizon, and discount scenario. ) Infinite Horizon: In the infinite horizon scenario the exp λc ) N Uh) = sgnλ) lim. N N Here, we have the same restriction regarding ergodicity as we have in an MDP. ) Indefinite Horizon: In the infinite horizon scenario the Uh) = sgnλ) lim N expλcn ). If we compare the treatment of both scenarios in MDP and RSMDP they are very similar, then we can expect the same kind of problems. In fact, the existence of a proper policy is necessary, but not sufficient. The proper policy must also be λ-feasible. A policy ρ is λ-feasible if the probability of not being in an absorbing state vanishes faster than the exponential accumulated cost [5], i.e., lim t Dρ ) λ T ρ ) t =0, ) where T ρ is a matrix S \{g S\{g with elements T ρ i,j = T i, ρi),j) and D ρ is a diagonal matrix with elements D ρ i,i = expci, ρi))). Note that λ-feasible condition in equation depends on λ; a proper policy only guarantees lim T ρ ) t =0. t Note that, once costs is always positive, unless there is a policy with trajectory length less than a constant L, for any proper policy ρ there exists a λ 0 > 0 such that ρ is not λ- feasible for any λ>λ 0, i.e., risk-prone attitude cannot be arbitrary large. 3) Discount Factor: Again, the drawbacks of infinite and indefinite scenarios can be solved by changing C N for Cγ N. In the discount scenario the Uh) = sgnλ) lim exp λc N ) γ N ) = sgnλ) lim N exp λγ t )cs t,a t ). Every previously analyzed methods have stationary policies for optimality. The tempting solution for a discounted version of RSMDP introduces a non-stationary policy for optimality. Define the immediate utility function: qs, a) =expλcs, a)). ) In the discount scenario, immediate utility function is not any more stationary. Considering the discount factor γ, we have: q γ s, a, t) =expγ t λcs, a)). 3) If we compare equations and 3; the last one presents a non-stationary risk factor γ t λ, following a non-stationary policy. Another problem is that the solution of a discountedcost RSMDP is given by the system of equations for all s S,t N: vg, θ) = signλ) { θ R + vs, θ) =min T s, a, s )expλcs, a))vs,θγ) { s S π s, t) =min T s, a, s )expλcs, a))vs,λγ t ) s S and must be solved for θ {γ t λ t N, which requires approximation techniques. There exist other alternatives for risk formulation and discount where optimal policy are stationary ones; but in these cases, expected theory utility cannot be used to analyze them [8], []. However, in these alternatives the use of discount factor also hinders risk-averse attitude. In the next section we prove this result. III. DISCOUNT AND RISK-PRONE ATTITUDE IN UTILITY-BASED DECISION CRITERIA In [3], it was proved that, in an indefinite horizon with discount and constant immediate cost, MDPs a risk-prone attitude. Here, we extend it for any immediate-cost function in MDPs and show how to construct problems where risksensitive MCP with discount does not present risk-averse attitude. Remember that a necessary condition to a utility function be risk-averse is C d C d for all d D. To prove our results we follow two step: i) define an MCP problem; and ii) show that the risk-averse condition cannot be assured for every problem. In fact, we create a unique parametrized MCP problem and show that an adequate parametrization of it cannot assure risk-averse condition. Definition : MCP General Problem) First, consider the simple MCP in Figure, where immediate cost are constant in every step. There is only four policies to be chosen: ρ A takes for sure a path that guides to goal in steps cost ); ρ B takes a path that guides to goal in step cost ) with probability ε N and N + steps cost N +) with probability ε N ; ρ C takes a path that guides to goal in step cost ) with probability ε γ and does not reaches the goal infinite cost) with probability ε γ ; and

4 ρ D takes a path that never reaches the goal. Set: ε N = such that under indefinite horizon MDP, N V ρ A )=Vρ B ) and C A = C B =; and ε γ = γ such that under MDP with discount factor γ, V ρ A )=Vρ C ). policies ρ C and ρ D are equate, or even better than policy ρ A in the RSMDP framework with discount. In the next section we consider only policies ρ A and ρ B in our analysis. B. Markov Decision Process with Discount In the case of MDP with discount, we prove the following theorem. Theorem : In the indefinite horizon, an MDP with discount factor presents risk-prone attitude. Proof : Remember that C N γ = cost is constant and unitary, then: γ t c t. If the immediate ε N ε N ε γ ε γ C N γ = N γ t = N+, and, if γ<, then ln γ<0 and the utility function is convex, since, Uc 0,c,...,c )= N+ K +expn +)lnγ). Fig.. A simple MCP problem. The cost is constant, except in the absorbing state g. The start state is s 0, where there are four options for actions: A, B, C, and D; every other states have only one option for acting. N, λ, and γ are free variables; transition parameters are set as follows: ε N = N, and ε γ = γ. Our analysis consists in comparing C i and C i.if C i < C i, then the agent is not risk-averse. We use policy ρ A only to contrast choice of attitudes. A. Risk-averse attitude and Proper Policies Note that only ρ A and ρ B are proper policies. Suppose N is big N ), the choice is between: i) putting up with a for sure cost, or ii) taking the risk of having a small cost of, but having a very small chance ε of having a very big cost. If risk-attitude is not considered, an agent should strive for both policy equally, since both were set to have the same expected cost, i.e., CA = C A = C B = C B =. But policies ρ A and ρ B show completely different behaviors. The riskaverse attitude is necessary in order to choose between such policies: risk-averse attitude chooses policy ρ A, and risk-prone attitude chooses policy ρ B. Even if we do not consider risk attitudes, we may prefer proper policies over non-proper ones [6]. In the MDP with discount scenario, the values of policy ρ A and policy ρ C are the same, but the first policy is proper, whereas the second one is not. Besides, the best case of both is only half of the other in best case, policy ρ A incurs total cost and policy ρ B incurs total cost ); in fact, it is easy to design a problem where the difference in best cases is as smaller as we want. Finally, in the MDP with discount scenario, if we set γ =, we have that V ρ A )=Vρ D ),butρ D never C A reaches the goal. It is easy to construct scenarios similar where Now consider a history of immediate cost arbitrary h = c 0,c,...,c ). But, in the MDP framework with discount, the sequence of immediate cost is equivalent to a sequence h =c 0 = Cγ N,c =0,...,c =0), i.e., Uh) =Uh ) and the decision maker is indifferent between the sequence h or the sequence h, where the value Cγ N is received and no discounted is applied. But, if γ <, then Cγ N <C N and h <h, characterizing a risk-prone attitude. If we apply the previous theorem in the simple problem in figure, we have that V ρ A ) <Vρ B ) for any γ<and N>, evenifwehavec A = C B. C. Risk Sensitive Markov Decision Process with Discount We use again the example in figure. After following policy ρ B, consider the utility of the lengthiest trajectory, in the RSMDP framework with discount: U ) Cγ N ) N = signλ)exp λ γ t = signλ)exp λ ) 4) γn+, and V ρ B ) = ε)expλ) ε exp λ ) γn+ = ) expλ) N λ N exp ). γn+ Now, we can enunciate the following theorem. Theorem : In the indefinite horizon, an RSMDP with discount factor and constant immediate cost, there exist problems where the agent does not present risk-averse attitude even if λ>

5 Proof : We use the example in figure and show that exists a N 0 such that for all N>N 0 the agent presents risk-prone attitude. Take the value of policy ρ B with N, then: V ρ B ) = lim N N ) expλ) ) + N exp λ N+ = expλ) Then, there exists N 0 such that V ρ B ) expλ) ɛ for all ɛ>0. If we choose ɛ =expλ). Wehave: V ρ B ) expλ) = = expln + λ) > expλ) =V ρ A ), or equivalently, C B < C A = C A = C B, i.e., the decision is risk-prone. Here, different from MDPs, it is not the case that in every problem the agent are risk-prone. This is because the agent changes her attitude according to the amount of accumulated cost. Figure shows the disutility function, the negative of utility function, regarding different values of γ and λ as given by equation 4; note the difference in scales for each graph and the change of convex risk averse) to concave risk prone) in each configuration. In our simple example, considering γ = 0.99, ifλ =then N 0 > 0 4, whereas if λ =0. then N 0 > 0 5. However, if γ =0.9 and λ =0. then N 0 =5 works. disutility function x x λ =.0,γ =0.99 λ =.0,γ = x λ =0.,γ =0.99 λ =0.,γ = accumulated constant cost Fig.. Disutility of constant cost c t =in the RSMDP framework with discount. IV. NON-UTILITY DECISION CRITERIA WITH RISK AND DISCOUNT In previous section we analyzed two alternatives based on utility function. In this section we consider two alternatives where risk is taken into account together with discount. Instead of considering a utility function and deriving a system of equations, the decision criteria starts directly with a fixed.5 3 point equation, from which a general analytical utility function cannot be derived. A. Discount over Certain Equivalent The first alternative considers a system of equations based on exponential function and therefore is similar to RSMDP discussed before. However, discounts are applied over certain equivalent, instead of the cost itself [8]. Then, for λ>0, the system of equations to be solved is: exp λvs)) { = =min T s, a, s )expλcs, a)) exp γλvs )). s S and optimal stationary function is obtained from: ρ s) = { =argmin s S T s, a, s )expλcs, a)) exp γλvs )) Here, vs) denotes a value close in semantic to the certainty equivalent of state s instead of the expected utility. We again use the simple example in figure by comparing policies ρ A and ρ B. Note that in such example the only risk to be solved is at the start state, after which all we have is determinism and no recurrence. In this case, the same analysis done with RSMDP with discount applies here. Theorem 3: In the indefinite horizon with discount over certain equivalent, there exist problems where the agent does not present risk-averse attitude even if λ>0. Proof 3: The proof is the same as in theorem. B. Discount with piecewise-linear risk attitude Again, the second alternative considers a system of equations, but at this time based on a piecewise-linear function []. Consider the function u : R [0, ] R defined by: { U λ λ)x if x>0 x) =Ux, λ) = + λ)x otherwise. Then, the following system of equations must be solved to obtained the value of a stationary policy ρ: 0= s S T s, ρs),s )U λ [ cs, ρs)) + γv ρ s ) v ρ s)], the optimal policy is the one that maximize the value function for all states, i.e., ρ is optimal if and only if v ρ s) v ρ s) for all s S and all stationary policy ρ. Here, λ is a risk factor, but limited to, ), and λ has the same semantic, i.e., if λ>0 we have risk-averse attitude and if λ<0 we have risk-prone attitude. Using again our simple problem and considering the lack of risk after step, we have the value of state s 0 by following ρ A to be: v ρa s 0 )= + γ),

6 whereas the value of policy ρ B can be found by solving: 0 = ε N )U λ vs 0 )) +ε N U λ ) γn vs 0) 0 = ε N ) λ) vs 0 )) +ε N + λ) ) γn vs 0) 0 = ) λ) vs 0 )) N + N + λ) ) γn vs 0) Theorem 4: In the indefinite horizon with discount over piecewise-linear risk attitude, there exist problems where the agent does no present risk-averse attitude even if λ>0. Proof 4: Consider equation 5 when N, then we have: 0= λ) vs 0 )) v ρb s 0 )= > + γ) =v ρa s 0 ). By the same argument used in theorem ; there exists N 0 such that for any N>N 0 the agent presents risk-prone attitude. V. CONCLUSION We have shown that by using a discount factor to take decisions in sequential decision problems in stochastic environment may produce undesired effects: the decision-maker behaves with a risk-prone attitude. Even in frameworks where risk-averse attitude can be set arbitrarily, the use of discount may end in risk-prone attitudes. Although ignored by many research in the literature, we showed that problems can be designed such that undesirable decisions may be taken if risk attitude is not considered. Decisions may be indifferent between proper policies and nonproper ones and, in fact, risk-neutral attitude has been used without taking this result in consideration. In the case of RSMDPs with discount, if parameters are set appropriately, then the conversion to risk-prone can be avoided. However, by choosing agents with large risk-averse parameters may also produce undesirable effects, such as paying too much attention to worst case scenarios. Balancing this both effects is a challenge that must be tackle. 5) [7], An exact algorithm for solving mdps under risk-sensitive planning objectives with one-switch utility functions, in AAMAS ), 008, pp [8] K.-J. Chung and M. J. Sobel, Discounted mdp s: distribution functions and exponential utility maximization, SIAM J. Control Optim., vol. 5, pp. 49 6, January 987. [9] S. Mannor, D. Simester, P. Sun, and J. N. Tsitsiklis, Bias and variance in value function estimation, in ICML 04: Proceedings of the twentyfirst international conference on Machine learning. New York, NY, USA: ACM, 004, p. 7. [0] S. Mannor and J. N. Tsitsiklis, Mean-variance optimization in markov decision processes, in ICML, 0, pp [] O. Mihatsch and R. Neuneier, Risk-sensitive reinforcement learning, Mach. Learn., vol. 49, pp , November 00. [Online]. Available: [] X.-R. Cao, A sensitivity view of Markov decision processes and reinforcement learning, in Modeling, Control and Optimization of Complex Systems. [3] R. Minami and V. F. d. Silva, Shortest stochastic path with risk sensitive evaluation, in th Mexican International Conference on Artificial Intelligence, MICAI 0, ser. Lecture Notes in Artificial Intelligence, I. Batyrshin and M. G. Mendoza, Eds., vol San Luis Potosí, Mexico: Springer, 0, pp [4] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, st ed. New York, NY: John Wiley and Sons, 994. [5] S. D. Patek, On terminating markov decision processes with a risk averse objective function, Automatica, vol. 37, pp , 00. [6] S. d. L. Pereira, L. N. Barros, and F. G. Cozman, Strong probabilistic planning, in MICAI 08: Proceedings of the 7th Mexican International Conference on Artificial Intelligence. Berlin, Heidelberg: Springer- Verlag, 008, pp REFERENCES [] Mausam and A. Kolobov, Planning with Markov Decision Processes: An AI Perspective, ser. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan and Claypool Publishers, 0. [Online]. Available: [] R. Sutton and A. Barto, Reinforcement learning: An introduction. Cambridge Univ Press, 998, vol. 6. [3] R. L. Keeney and H. Raiffa, Decisions with Multiple Objectives: Preferences and Value Tradeoffs. New York: Wiley, 976. [4] R. A. Howard and J. E. Matheson, Risk-sensitive markov decision processes, Management Science, vol. 8, no. 7, pp , 97. [Online]. Available: [5] J. von Neumann and O. Morgenstern, The Theory of Games and Economic Behaviour, nd ed. Princeton: Princeto University Press, 947. [6] Y. Liu and S. Koenig, Probabilistic planning with nonlinear utility functions, in ICAPS, 006, pp

Markov Decision Processes Chapter 17. Mausam

Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.