The Role of Discount Factor in Risk Sensitive Markov Decision Processes

Size: px
Start display at page:

Download "The Role of Discount Factor in Risk Sensitive Markov Decision Processes"

Transcription

1 06 5th Brazilian Conference on Intelligent Systems The Role of Discount Factor in Risk Sensitive Markov Decision Processes Valdinei Freire Escola de Artes, Ciências e Humanidades Universidade de São Paulo São Paulo, Brazil valdinei.freire@usp.br Abstract Markov Decision Processes MDPs) have long been the framework for modeling optimal decisions in stochastic environments. Although less known, Risk Sensitive Markov Decision Processes RSMDP) extend MDPs by allowing arbitrary risk attitude. However, not every environment is well-defined in MDPs and RSMDPs and both versions make use of discount in costs to turn every problem well-defined; because of the exponential grow in RSMDPs, the problem of a well-defined problem is even harder. Here, we show that the use of discount in costs: i) in MDPs induces a risk-prone attitude in MDPs, and ii) in MDPs, hinders risk-averse attitude for some scenarios. I. INTRODUCTION Markov Decision Processes MDPs) have long been a successful framework for planning over stochastic environment []. MDPs are used in planning problems or learning problems, underlying the Reinforcement Learning framework []. Solutions to MDPs are optimal policies that minimize a cost function based on an immediate cost return. However, in contrast with human decision-makers, who show risk-averse in preference [3]; MDPs do not take into account risk attitudes [4]. Over the last four decades, some extensions to MDPs have been proposed to take into account risk attitudes. Some of them consider Expected Utility Theory [5] to define risk attitude and propose appropriated methods to account for it [6], [7], [8], [4], whereas others consider the role of variance to define risk [9], [0]. Some authors also consider risk without taking into account any general purpose theory [8], []. A direct extension to MDPs is the Risk-Sensitive MDP RSMDP) framework, where the objective of planning is defined as an exponential utility function with a parameter λ [4]. Just like linear utility functions, which present risk neutral attitudes, exponential utility function has some good mathematical property; for example, if all outcomes are increased by an amount of Δ, the certain equivalent is also increased by an amount of Δ [4], this guarantees dynamic programming can be applied to RSMDPs. However, exponential function presents a problem: it grows too fast. In the MDP framework two different scenarios are commonly define: infinite horizon and indefinite horizon []. The infinite horizon considers the average immediate cost as cost function, whereas the indefinite horizon considers the immediate-cost sum as cost function. Although algorithms for both cost functions are not in general interchangeable, there is a cost function that can unify both scenarios: the discounted immediate-cost sum, which considers a discount factor γ 0, ) as parameter. Algorithms that can find optimal policies under discounted sum can also find good approximation in both scenarios: infinite and indefinite horizon; in fact, the approximation gets closer to optimality as γ []. In RSMDPs both scenarios can also be taken; however, whereas the infinite horizon is well-defined, the indefinite horizon does not converge in every scenario and risk factor λ. However, if a discount factor can unify approximately both cost function within MDP; within RSMDP, the use of discount factor makes optimal policies non-stationary. Finally, in an indefinite horizon scenario with constant immediate cost, i.e., the shortest stochastic path problem, the discounted version of the problem is the same as an exponential function, but with risk-prone attitude [3]. In this paper we analyze what happens to risk attitude in the shortest stochastic path problem when RSMDP and MDP is considered with discount factor. We also analyze two other alternatives to risk sensitive [8], []. In section II we present the MDP and RSMDP frameworks, whereas in sections III and IV we present our results. Finally, section V presents some final considerations. II. MARKOV COST PROCESSES We consider a general framework that underlies MDPs and RSMDPs, the Markov Cost Process MCP) [4]. An MCP is defined by a tuple S, A,B 0,T,c, where: S is a set of states; A is a set of actions; B 0 : S [0, ] is an initial state distribution, where B 0 s) =Prs 0 = s); T : S A S [0, ] is a probabilistic transition function, where T s, a, s )=Ps t+ = s s t = s, a t = a); and c : S A R is a cost function. A process starts at a state s 0 according to B 0 and at any time step t 0 evolves as follow: i) the agent perceives the state s t and chooses an action a t ;ii) a cost c t = cs t,a t ) incurs; In [4], Markov Reward Processes are defined; here, we assume rewards are always negative and we define MCP instead /6 $ IEEE DOI 0.09/BRACIS

2 and iii) the process transits to a next state s t+ following probability P s t+ = s s t = s, a t = a) =T s, a, s ). An agent guides the process by choosing a stationary policy ρ : S A, that maps each state into an action at any time step, or a non-stationary policy π : S N A, that maps each state and time steps t 0 into an action. N is the set of natural numbers. An important characteristic regarding MCPs is their horizons. Three horizon types can be defined: ) The finite horizon defines a final time step N, where the process ends. To make clearer the distinction, we define a finite horizon MCP by the tuple S, A,B 0,T,c,N. ) The infinite horizon considers that the process goes on for infinite steps and the process never ends. We define an infinite horizon MCP by the tuple S, A,B 0,T,c,. 3) The indefinite horizon considers a goal state g which is absorbing, where the process stops, or equivalently, cg, a) = 0 for all a A, T g, a, g) = and T g, a, s) =0for s g and all a A. We define an indefinite horizon MCP by the tuple S, A,B 0,T,c,g. Since the finite horizon has no problem of convergence, in this paper we consider only the infinite and indefinite horizon. A. Expected Utility Theory An objective to an MCP can be defined by a utility function U : X R, where X is the set of outcomes. Given a probability distribution over the set of outcomes X and a decision d, the value of a decision is given by V d) = x X P x d)ux). Then, given a set of feasible decisions D, the optimal decision d is the decision with the greatest value function, i.e., d =argmax V d). d D In an MCP, an outcome is described by the cost history h = c 0,c,c,..., i.e., X R. However, it is common to summarize a history h in a unique scalar value, for example, the cost sum; then X R +, in this case the utility function is strictly decreasing. If X R + and U ) is continuous and decreasing, then a certain equivalent C d for a decision d can be defined as C d = U V d)), and the expected outcome C d can be defined as C d = P C d)c. Then, attitude to risk can be defined. C X Definition : Risk Attitude) The decision maker is risk neutral if and only if C d = C d for all d D, or, equivalently, if and only if the utility function is linear. The decision maker is risk prone if and only if C d C d for all d Dand strictly different for at least a decision, or, equivalently, if and only if the utility function is convex. The decision maker is risk averse if and only if C d C d for all d Dand strictly different for at least a decision, or, equivalently, if and only if the utility function is concave. We consider two ways of summarizing histories until some time N: cost sum and discounted cost sum. The first one is simply C N = c t, whereas the second one is Cγ N = γ t c t. Finally, in an MCP the set of decisions is the set of policies. In the next sections we make use of this two summarization to define MDP s and RSMDP s utility functions. B. Markov Decision Process A Markov Decision Process MDP) is an MCP associated to a utility function. MDPs consider a linear utility function, and therefore is risk neutral. The solution of an MDP consists in solving a non-linear system of equations over a function v : S Rfrom which an optimal policy can be defined. We consider three scenarios: infinite horizon, indefinite horizon, and discount scenario. ) Infinite Horizon: In the infinite horizon scenario the C N Uh) = lim N N = lim c t. N N To define a utility function in the infinite horizon a limit condition must be considered. However, the condition on infiniteness is necessary; if the process ends or cost ceases, the utility goes to zero and decisions cannot be differentiated among themselves. The process, under at least optimal policy, must be ergodic, i.e., all states can be reached from every other state and averaging over time or over the state space is equivalently, i.e., an average performance μ can be calculated for optimal policy [4]. If it is not the case, then μ is not unique for every state s S. Finally, optimal policies are stationary. ) Indefinite Horizon: In the indefinite horizon scenario the Uh) = lim N CN = lim c t. N Again, to define a utility function in the indefinite horizon a limit condition must be considered, but, here, this process is supposed to end; if it is not the case, expected utility diverges. A condition so that a solution to this scenario exists is that there exists a proper policy, i.e, a policy that reaches with probability the absorbing state g [4]. 3) Discount Factor: Both versions previously presented have two major drawbacks: i) in infinite horizons the process under optimal policy must be ergodic, and ii) in indefinite horizons there must exists a proper policy. These drawbacks restrict the use of both types of horizon in general. The discounted sum cost function allows the problem to be welldefined in both case. In the discount scenario the utility function is given by Uh) = lim N CN γ = lim γ t c t. N 48 48

3 Although the discounted sum cost function can work in both scenarios: infinite and indefinite horizons; it is not the panacea. As we show in section III, discount and risk-prone walk side by side and decision-makers may not desire such influence. C. Risk Sensitive Markov Decision Process Similar to an MDP, a Risk Sensitive Markov Decision Process RSMDP) is an MCP associated to an objective function. However, RSMDPs consider an exponential utility function, and therefore allows risk-prone or risk-averse attitudes. The choice of a risk attitude is done by a factor λ: ifλ > 0, the decision-maker is risk averse, and if λ<0, the decisionmaker is risk prone. Under a limiting argument, it is possible to show that as λ 0, RSMDP becomes risk-neutral. Again, we consider three scenarios: infinite horizon, indefinite horizon, and discount scenario. ) Infinite Horizon: In the infinite horizon scenario the exp λc ) N Uh) = sgnλ) lim. N N Here, we have the same restriction regarding ergodicity as we have in an MDP. ) Indefinite Horizon: In the infinite horizon scenario the Uh) = sgnλ) lim N expλcn ). If we compare the treatment of both scenarios in MDP and RSMDP they are very similar, then we can expect the same kind of problems. In fact, the existence of a proper policy is necessary, but not sufficient. The proper policy must also be λ-feasible. A policy ρ is λ-feasible if the probability of not being in an absorbing state vanishes faster than the exponential accumulated cost [5], i.e., lim t Dρ ) λ T ρ ) t =0, ) where T ρ is a matrix S \{g S\{g with elements T ρ i,j = T i, ρi),j) and D ρ is a diagonal matrix with elements D ρ i,i = expci, ρi))). Note that λ-feasible condition in equation depends on λ; a proper policy only guarantees lim T ρ ) t =0. t Note that, once costs is always positive, unless there is a policy with trajectory length less than a constant L, for any proper policy ρ there exists a λ 0 > 0 such that ρ is not λ- feasible for any λ>λ 0, i.e., risk-prone attitude cannot be arbitrary large. 3) Discount Factor: Again, the drawbacks of infinite and indefinite scenarios can be solved by changing C N for Cγ N. In the discount scenario the Uh) = sgnλ) lim exp λc N ) γ N ) = sgnλ) lim N exp λγ t )cs t,a t ). Every previously analyzed methods have stationary policies for optimality. The tempting solution for a discounted version of RSMDP introduces a non-stationary policy for optimality. Define the immediate utility function: qs, a) =expλcs, a)). ) In the discount scenario, immediate utility function is not any more stationary. Considering the discount factor γ, we have: q γ s, a, t) =expγ t λcs, a)). 3) If we compare equations and 3; the last one presents a non-stationary risk factor γ t λ, following a non-stationary policy. Another problem is that the solution of a discountedcost RSMDP is given by the system of equations for all s S,t N: vg, θ) = signλ) { θ R + vs, θ) =min T s, a, s )expλcs, a))vs,θγ) { s S π s, t) =min T s, a, s )expλcs, a))vs,λγ t ) s S and must be solved for θ {γ t λ t N, which requires approximation techniques. There exist other alternatives for risk formulation and discount where optimal policy are stationary ones; but in these cases, expected theory utility cannot be used to analyze them [8], []. However, in these alternatives the use of discount factor also hinders risk-averse attitude. In the next section we prove this result. III. DISCOUNT AND RISK-PRONE ATTITUDE IN UTILITY-BASED DECISION CRITERIA In [3], it was proved that, in an indefinite horizon with discount and constant immediate cost, MDPs a risk-prone attitude. Here, we extend it for any immediate-cost function in MDPs and show how to construct problems where risksensitive MCP with discount does not present risk-averse attitude. Remember that a necessary condition to a utility function be risk-averse is C d C d for all d D. To prove our results we follow two step: i) define an MCP problem; and ii) show that the risk-averse condition cannot be assured for every problem. In fact, we create a unique parametrized MCP problem and show that an adequate parametrization of it cannot assure risk-averse condition. Definition : MCP General Problem) First, consider the simple MCP in Figure, where immediate cost are constant in every step. There is only four policies to be chosen: ρ A takes for sure a path that guides to goal in steps cost ); ρ B takes a path that guides to goal in step cost ) with probability ε N and N + steps cost N +) with probability ε N ; ρ C takes a path that guides to goal in step cost ) with probability ε γ and does not reaches the goal infinite cost) with probability ε γ ; and

4 ρ D takes a path that never reaches the goal. Set: ε N = such that under indefinite horizon MDP, N V ρ A )=Vρ B ) and C A = C B =; and ε γ = γ such that under MDP with discount factor γ, V ρ A )=Vρ C ). policies ρ C and ρ D are equate, or even better than policy ρ A in the RSMDP framework with discount. In the next section we consider only policies ρ A and ρ B in our analysis. B. Markov Decision Process with Discount In the case of MDP with discount, we prove the following theorem. Theorem : In the indefinite horizon, an MDP with discount factor presents risk-prone attitude. Proof : Remember that C N γ = cost is constant and unitary, then: γ t c t. If the immediate ε N ε N ε γ ε γ C N γ = N γ t = N+, and, if γ<, then ln γ<0 and the utility function is convex, since, Uc 0,c,...,c )= N+ K +expn +)lnγ). Fig.. A simple MCP problem. The cost is constant, except in the absorbing state g. The start state is s 0, where there are four options for actions: A, B, C, and D; every other states have only one option for acting. N, λ, and γ are free variables; transition parameters are set as follows: ε N = N, and ε γ = γ. Our analysis consists in comparing C i and C i.if C i < C i, then the agent is not risk-averse. We use policy ρ A only to contrast choice of attitudes. A. Risk-averse attitude and Proper Policies Note that only ρ A and ρ B are proper policies. Suppose N is big N ), the choice is between: i) putting up with a for sure cost, or ii) taking the risk of having a small cost of, but having a very small chance ε of having a very big cost. If risk-attitude is not considered, an agent should strive for both policy equally, since both were set to have the same expected cost, i.e., CA = C A = C B = C B =. But policies ρ A and ρ B show completely different behaviors. The riskaverse attitude is necessary in order to choose between such policies: risk-averse attitude chooses policy ρ A, and risk-prone attitude chooses policy ρ B. Even if we do not consider risk attitudes, we may prefer proper policies over non-proper ones [6]. In the MDP with discount scenario, the values of policy ρ A and policy ρ C are the same, but the first policy is proper, whereas the second one is not. Besides, the best case of both is only half of the other in best case, policy ρ A incurs total cost and policy ρ B incurs total cost ); in fact, it is easy to design a problem where the difference in best cases is as smaller as we want. Finally, in the MDP with discount scenario, if we set γ =, we have that V ρ A )=Vρ D ),butρ D never C A reaches the goal. It is easy to construct scenarios similar where Now consider a history of immediate cost arbitrary h = c 0,c,...,c ). But, in the MDP framework with discount, the sequence of immediate cost is equivalent to a sequence h =c 0 = Cγ N,c =0,...,c =0), i.e., Uh) =Uh ) and the decision maker is indifferent between the sequence h or the sequence h, where the value Cγ N is received and no discounted is applied. But, if γ <, then Cγ N <C N and h <h, characterizing a risk-prone attitude. If we apply the previous theorem in the simple problem in figure, we have that V ρ A ) <Vρ B ) for any γ<and N>, evenifwehavec A = C B. C. Risk Sensitive Markov Decision Process with Discount We use again the example in figure. After following policy ρ B, consider the utility of the lengthiest trajectory, in the RSMDP framework with discount: U ) Cγ N ) N = signλ)exp λ γ t = signλ)exp λ ) 4) γn+, and V ρ B ) = ε)expλ) ε exp λ ) γn+ = ) expλ) N λ N exp ). γn+ Now, we can enunciate the following theorem. Theorem : In the indefinite horizon, an RSMDP with discount factor and constant immediate cost, there exist problems where the agent does not present risk-averse attitude even if λ>

5 Proof : We use the example in figure and show that exists a N 0 such that for all N>N 0 the agent presents risk-prone attitude. Take the value of policy ρ B with N, then: V ρ B ) = lim N N ) expλ) ) + N exp λ N+ = expλ) Then, there exists N 0 such that V ρ B ) expλ) ɛ for all ɛ>0. If we choose ɛ =expλ). Wehave: V ρ B ) expλ) = = expln + λ) > expλ) =V ρ A ), or equivalently, C B < C A = C A = C B, i.e., the decision is risk-prone. Here, different from MDPs, it is not the case that in every problem the agent are risk-prone. This is because the agent changes her attitude according to the amount of accumulated cost. Figure shows the disutility function, the negative of utility function, regarding different values of γ and λ as given by equation 4; note the difference in scales for each graph and the change of convex risk averse) to concave risk prone) in each configuration. In our simple example, considering γ = 0.99, ifλ =then N 0 > 0 4, whereas if λ =0. then N 0 > 0 5. However, if γ =0.9 and λ =0. then N 0 =5 works. disutility function x x λ =.0,γ =0.99 λ =.0,γ = x λ =0.,γ =0.99 λ =0.,γ = accumulated constant cost Fig.. Disutility of constant cost c t =in the RSMDP framework with discount. IV. NON-UTILITY DECISION CRITERIA WITH RISK AND DISCOUNT In previous section we analyzed two alternatives based on utility function. In this section we consider two alternatives where risk is taken into account together with discount. Instead of considering a utility function and deriving a system of equations, the decision criteria starts directly with a fixed.5 3 point equation, from which a general analytical utility function cannot be derived. A. Discount over Certain Equivalent The first alternative considers a system of equations based on exponential function and therefore is similar to RSMDP discussed before. However, discounts are applied over certain equivalent, instead of the cost itself [8]. Then, for λ>0, the system of equations to be solved is: exp λvs)) { = =min T s, a, s )expλcs, a)) exp γλvs )). s S and optimal stationary function is obtained from: ρ s) = { =argmin s S T s, a, s )expλcs, a)) exp γλvs )) Here, vs) denotes a value close in semantic to the certainty equivalent of state s instead of the expected utility. We again use the simple example in figure by comparing policies ρ A and ρ B. Note that in such example the only risk to be solved is at the start state, after which all we have is determinism and no recurrence. In this case, the same analysis done with RSMDP with discount applies here. Theorem 3: In the indefinite horizon with discount over certain equivalent, there exist problems where the agent does not present risk-averse attitude even if λ>0. Proof 3: The proof is the same as in theorem. B. Discount with piecewise-linear risk attitude Again, the second alternative considers a system of equations, but at this time based on a piecewise-linear function []. Consider the function u : R [0, ] R defined by: { U λ λ)x if x>0 x) =Ux, λ) = + λ)x otherwise. Then, the following system of equations must be solved to obtained the value of a stationary policy ρ: 0= s S T s, ρs),s )U λ [ cs, ρs)) + γv ρ s ) v ρ s)], the optimal policy is the one that maximize the value function for all states, i.e., ρ is optimal if and only if v ρ s) v ρ s) for all s S and all stationary policy ρ. Here, λ is a risk factor, but limited to, ), and λ has the same semantic, i.e., if λ>0 we have risk-averse attitude and if λ<0 we have risk-prone attitude. Using again our simple problem and considering the lack of risk after step, we have the value of state s 0 by following ρ A to be: v ρa s 0 )= + γ),

6 whereas the value of policy ρ B can be found by solving: 0 = ε N )U λ vs 0 )) +ε N U λ ) γn vs 0) 0 = ε N ) λ) vs 0 )) +ε N + λ) ) γn vs 0) 0 = ) λ) vs 0 )) N + N + λ) ) γn vs 0) Theorem 4: In the indefinite horizon with discount over piecewise-linear risk attitude, there exist problems where the agent does no present risk-averse attitude even if λ>0. Proof 4: Consider equation 5 when N, then we have: 0= λ) vs 0 )) v ρb s 0 )= > + γ) =v ρa s 0 ). By the same argument used in theorem ; there exists N 0 such that for any N>N 0 the agent presents risk-prone attitude. V. CONCLUSION We have shown that by using a discount factor to take decisions in sequential decision problems in stochastic environment may produce undesired effects: the decision-maker behaves with a risk-prone attitude. Even in frameworks where risk-averse attitude can be set arbitrarily, the use of discount may end in risk-prone attitudes. Although ignored by many research in the literature, we showed that problems can be designed such that undesirable decisions may be taken if risk attitude is not considered. Decisions may be indifferent between proper policies and nonproper ones and, in fact, risk-neutral attitude has been used without taking this result in consideration. In the case of RSMDPs with discount, if parameters are set appropriately, then the conversion to risk-prone can be avoided. However, by choosing agents with large risk-averse parameters may also produce undesirable effects, such as paying too much attention to worst case scenarios. Balancing this both effects is a challenge that must be tackle. 5) [7], An exact algorithm for solving mdps under risk-sensitive planning objectives with one-switch utility functions, in AAMAS ), 008, pp [8] K.-J. Chung and M. J. Sobel, Discounted mdp s: distribution functions and exponential utility maximization, SIAM J. Control Optim., vol. 5, pp. 49 6, January 987. [9] S. Mannor, D. Simester, P. Sun, and J. N. Tsitsiklis, Bias and variance in value function estimation, in ICML 04: Proceedings of the twentyfirst international conference on Machine learning. New York, NY, USA: ACM, 004, p. 7. [0] S. Mannor and J. N. Tsitsiklis, Mean-variance optimization in markov decision processes, in ICML, 0, pp [] O. Mihatsch and R. Neuneier, Risk-sensitive reinforcement learning, Mach. Learn., vol. 49, pp , November 00. [Online]. Available: [] X.-R. Cao, A sensitivity view of Markov decision processes and reinforcement learning, in Modeling, Control and Optimization of Complex Systems. [3] R. Minami and V. F. d. Silva, Shortest stochastic path with risk sensitive evaluation, in th Mexican International Conference on Artificial Intelligence, MICAI 0, ser. Lecture Notes in Artificial Intelligence, I. Batyrshin and M. G. Mendoza, Eds., vol San Luis Potosí, Mexico: Springer, 0, pp [4] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, st ed. New York, NY: John Wiley and Sons, 994. [5] S. D. Patek, On terminating markov decision processes with a risk averse objective function, Automatica, vol. 37, pp , 00. [6] S. d. L. Pereira, L. N. Barros, and F. G. Cozman, Strong probabilistic planning, in MICAI 08: Proceedings of the 7th Mexican International Conference on Artificial Intelligence. Berlin, Heidelberg: Springer- Verlag, 008, pp REFERENCES [] Mausam and A. Kolobov, Planning with Markov Decision Processes: An AI Perspective, ser. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan and Claypool Publishers, 0. [Online]. Available: [] R. Sutton and A. Barto, Reinforcement learning: An introduction. Cambridge Univ Press, 998, vol. 6. [3] R. L. Keeney and H. Raiffa, Decisions with Multiple Objectives: Preferences and Value Tradeoffs. New York: Wiley, 976. [4] R. A. Howard and J. E. Matheson, Risk-sensitive markov decision processes, Management Science, vol. 8, no. 7, pp , 97. [Online]. Available: [5] J. von Neumann and O. Morgenstern, The Theory of Games and Economic Behaviour, nd ed. Princeton: Princeto University Press, 947. [6] Y. Liu and S. Koenig, Probabilistic planning with nonlinear utility functions, in ICAPS, 006, pp

Markov Decision Processes Chapter 17. Mausam

Markov Decision Processes Chapter 17. Mausam Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.

More information

Reusing Risk-Aware Stochastic Abstract Policies in Robotic Navigation Learning

Reusing Risk-Aware Stochastic Abstract Policies in Robotic Navigation Learning Reusing Risk-Aware Stochastic Abstract Policies in Robotic Navigation Learning Valdinei Freire da Silva 1, Marcelo Li Koga 2, Fábio Gagliardi Cozman 2, and Anna Helena Reali Costa 2 1 Escola de Artes,

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

Decision Theory: Q-Learning

Decision Theory: Q-Learning Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

PROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION PING HOU. A dissertation submitted to the Graduate School

PROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION PING HOU. A dissertation submitted to the Graduate School PROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION BY PING HOU A dissertation submitted to the Graduate School in partial fulfillment of the requirements for the degree Doctor of Philosophy Major Subject:

More information

Stochastic Safest and Shortest Path Problems

Stochastic Safest and Shortest Path Problems Stochastic Safest and Shortest Path Problems Florent Teichteil-Königsbuch AAAI-12, Toronto, Canada July 24-26, 2012 Path optimization under probabilistic uncertainties Problems coming to searching for

More information

Information, Utility & Bounded Rationality

Information, Utility & Bounded Rationality Information, Utility & Bounded Rationality Pedro A. Ortega and Daniel A. Braun Department of Engineering, University of Cambridge Trumpington Street, Cambridge, CB2 PZ, UK {dab54,pao32}@cam.ac.uk Abstract.

More information

Markov decision processes

Markov decision processes CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only

More information

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement

More information

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL) 15-780: Graduate Artificial Intelligence Reinforcement learning (RL) From MDPs to RL We still use the same Markov model with rewards and actions But there are a few differences: 1. We do not assume we

More information

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam CSE 573 Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming Slides adapted from Andrey Kolobov and Mausam 1 Stochastic Shortest-Path MDPs: Motivation Assume the agent pays cost

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Lecture notes for Analysis of Algorithms : Markov decision processes

Lecture notes for Analysis of Algorithms : Markov decision processes Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with

More information

Optimism in the Face of Uncertainty Should be Refutable

Optimism in the Face of Uncertainty Should be Refutable Optimism in the Face of Uncertainty Should be Refutable Ronald ORTNER Montanuniversität Leoben Department Mathematik und Informationstechnolgie Franz-Josef-Strasse 18, 8700 Leoben, Austria, Phone number:

More information

Central-limit approach to risk-aware Markov decision processes

Central-limit approach to risk-aware Markov decision processes Central-limit approach to risk-aware Markov decision processes Jia Yuan Yu Concordia University November 27, 2015 Joint work with Pengqian Yu and Huan Xu. Inventory Management 1 1 Look at current inventory

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Separable Utility Functions in Dynamic Economic Models

Separable Utility Functions in Dynamic Economic Models Separable Utility Functions in Dynamic Economic Models Karel Sladký 1 Abstract. In this note we study properties of utility functions suitable for performance evaluation of dynamic economic models under

More information

Markov Decision Processes Chapter 17. Mausam

Markov Decision Processes Chapter 17. Mausam Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.

More information

CS 7180: Behavioral Modeling and Decisionmaking

CS 7180: Behavioral Modeling and Decisionmaking CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

Sequential Decisions

Sequential Decisions Sequential Decisions A Basic Theorem of (Bayesian) Expected Utility Theory: If you can postpone a terminal decision in order to observe, cost free, an experiment whose outcome might change your terminal

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

University of Warwick, EC9A0 Maths for Economists Lecture Notes 10: Dynamic Programming

University of Warwick, EC9A0 Maths for Economists Lecture Notes 10: Dynamic Programming University of Warwick, EC9A0 Maths for Economists 1 of 63 University of Warwick, EC9A0 Maths for Economists Lecture Notes 10: Dynamic Programming Peter J. Hammond Autumn 2013, revised 2014 University of

More information

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Markov Decision Processes and Solving Finite Problems. February 8, 2017 Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

Optimal Convergence in Multi-Agent MDPs

Optimal Convergence in Multi-Agent MDPs Optimal Convergence in Multi-Agent MDPs Peter Vrancx 1, Katja Verbeeck 2, and Ann Nowé 1 1 {pvrancx, ann.nowe}@vub.ac.be, Computational Modeling Lab, Vrije Universiteit Brussel 2 k.verbeeck@micc.unimaas.nl,

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Homework #6 (10/18/2017)

Homework #6 (10/18/2017) Homework #6 (0/8/207). Let G be the set of compound gambles over a finite set of deterministic payoffs {a, a 2,...a n } R +. A decision maker s preference relation over compound gambles can be represented

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

Bias-Variance Error Bounds for Temporal Difference Updates

Bias-Variance Error Bounds for Temporal Difference Updates Bias-Variance Bounds for Temporal Difference Updates Michael Kearns AT&T Labs mkearns@research.att.com Satinder Singh AT&T Labs baveja@research.att.com Abstract We give the first rigorous upper bounds

More information

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs 2015 IEEE 54th Annual Conference on Decision and Control CDC December 15-18, 2015. Osaka, Japan An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs Abhishek Gupta Rahul Jain Peter

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information

Reinforcement Learning: An Introduction

Reinforcement Learning: An Introduction Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is

More information

Procedia Computer Science 00 (2011) 000 6

Procedia Computer Science 00 (2011) 000 6 Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-

More information

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

Preference Elicitation for Sequential Decision Problems

Preference Elicitation for Sequential Decision Problems Preference Elicitation for Sequential Decision Problems Kevin Regan University of Toronto Introduction 2 Motivation Focus: Computational approaches to sequential decision making under uncertainty These

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Noel Welsh 11 November 2010 Noel Welsh () Markov Decision Processes 11 November 2010 1 / 30 Annoucements Applicant visitor day seeks robot demonstrators for exciting half hour

More information

1 Problem Formulation

1 Problem Formulation Book Review Self-Learning Control of Finite Markov Chains by A. S. Poznyak, K. Najim, and E. Gómez-Ramírez Review by Benjamin Van Roy This book presents a collection of work on algorithms for learning

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms * Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

More information

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca

More information

Some notes on Markov Decision Theory

Some notes on Markov Decision Theory Some notes on Markov Decision Theory Nikolaos Laoutaris laoutaris@di.uoa.gr January, 2004 1 Markov Decision Theory[1, 2, 3, 4] provides a methodology for the analysis of probabilistic sequential decision

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

Temporal difference learning

Temporal difference learning Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).

More information

Final Exam December 12, 2017

Final Exam December 12, 2017 Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes

More information

State Space Abstractions for Reinforcement Learning

State Space Abstractions for Reinforcement Learning State Space Abstractions for Reinforcement Learning Rowan McAllister and Thang Bui MLG RCC 6 November 24 / 24 Outline Introduction Markov Decision Process Reinforcement Learning State Abstraction 2 Abstraction

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

16.4 Multiattribute Utility Functions

16.4 Multiattribute Utility Functions 285 Normalized utilities The scale of utilities reaches from the best possible prize u to the worst possible catastrophe u Normalized utilities use a scale with u = 0 and u = 1 Utilities of intermediate

More information

(Deep) Reinforcement Learning

(Deep) Reinforcement Learning Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

Sequential Decisions

Sequential Decisions Sequential Decisions A Basic Theorem of (Bayesian) Expected Utility Theory: If you can postpone a terminal decision in order to observe, cost free, an experiment whose outcome might change your terminal

More information

CSE250A Fall 12: Discussion Week 9

CSE250A Fall 12: Discussion Week 9 CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.

More information

Control Theory : Course Summary

Control Theory : Course Summary Control Theory : Course Summary Author: Joshua Volkmann Abstract There are a wide range of problems which involve making decisions over time in the face of uncertainty. Control theory draws from the fields

More information

The Reinforcement Learning Problem

The Reinforcement Learning Problem The Reinforcement Learning Problem Slides based on the book Reinforcement Learning by Sutton and Barto Formalizing Reinforcement Learning Formally, the agent and environment interact at each of a sequence

More information

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

Optimal Control of Partiality Observable Markov. Processes over a Finite Horizon

Optimal Control of Partiality Observable Markov. Processes over a Finite Horizon Optimal Control of Partiality Observable Markov Processes over a Finite Horizon Report by Jalal Arabneydi 04/11/2012 Taken from Control of Partiality Observable Markov Processes over a finite Horizon by

More information

Loss Bounds for Uncertain Transition Probabilities in Markov Decision Processes

Loss Bounds for Uncertain Transition Probabilities in Markov Decision Processes Loss Bounds for Uncertain Transition Probabilities in Markov Decision Processes Andrew Mastin and Patrick Jaillet Abstract We analyze losses resulting from uncertain transition probabilities in Markov

More information

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

Chapter 16 Planning Based on Markov Decision Processes

Chapter 16 Planning Based on Markov Decision Processes Lecture slides for Automated Planning: Theory and Practice Chapter 16 Planning Based on Markov Decision Processes Dana S. Nau University of Maryland 12:48 PM February 29, 2012 1 Motivation c a b Until

More information

3 Intertemporal Risk Aversion

3 Intertemporal Risk Aversion 3 Intertemporal Risk Aversion 3. Axiomatic Characterization This section characterizes the invariant quantity found in proposition 2 axiomatically. The axiomatic characterization below is for a decision

More information

Intertemporal Risk Aversion, Stationarity, and Discounting

Intertemporal Risk Aversion, Stationarity, and Discounting Traeger, CES ifo 10 p. 1 Intertemporal Risk Aversion, Stationarity, and Discounting Christian Traeger Department of Agricultural & Resource Economics, UC Berkeley Introduce a more general preference representation

More information

Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS

Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS 63 2.1 Introduction In this chapter we describe the analytical tools used in this thesis. They are Markov Decision Processes(MDP), Markov Renewal process

More information

Markov Decision Processes Infinite Horizon Problems

Markov Decision Processes Infinite Horizon Problems Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld 1 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T)

More information

Reinforcement Learning as Classification Leveraging Modern Classifiers

Reinforcement Learning as Classification Leveraging Modern Classifiers Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions

More information

Ad Network Optimization: Evaluating Linear Relaxations

Ad Network Optimization: Evaluating Linear Relaxations Ad Network Optimization: Evaluating Linear Relaxations Flávio Sales Truzzi, Valdinei Freire da Silva, Anna Helena Reali Costa and Fabio Gagliardi Cozman Escola Politécnica - Universidade de São Paulo (USP)

More information

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))] Review: TD-Learning function TD-Learning(mdp) returns a policy Class #: Reinforcement Learning, II 8s S, U(s) =0 set start-state s s 0 choose action a, using -greedy policy based on U(s) U(s) U(s)+ [r

More information

Some AI Planning Problems

Some AI Planning Problems Course Logistics CS533: Intelligent Agents and Decision Making M, W, F: 1:00 1:50 Instructor: Alan Fern (KEC2071) Office hours: by appointment (see me after class or send email) Emailing me: include CS533

More information

Final Exam December 12, 2017

Final Exam December 12, 2017 Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes

More information

Food delivered. Food obtained S 3

Food delivered. Food obtained S 3 Press lever Enter magazine * S 0 Initial state S 1 Food delivered * S 2 No reward S 2 No reward S 3 Food obtained Supplementary Figure 1 Value propagation in tree search, after 50 steps of learning the

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

arxiv: v1 [cs.lg] 23 Oct 2017

arxiv: v1 [cs.lg] 23 Oct 2017 Accelerated Reinforcement Learning K. Lakshmanan Department of Computer Science and Engineering Indian Institute of Technology (BHU), Varanasi, India Email: lakshmanank.cse@itbhu.ac.in arxiv:1710.08070v1

More information

Solving Multi-Objective MDP with Lexicographic Preference: An application to stochastic planning with multiple quantile objective

Solving Multi-Objective MDP with Lexicographic Preference: An application to stochastic planning with multiple quantile objective arxiv:1705.03597v1 [cs.ai] 10 May 2017 Solving Multi-Objective MDP with Lexicographic Preference: An application to stochastic planning with multiple quantile objective Yan Li 1 and Zhaohan Sun 2 1 School

More information

CS 598 Statistical Reinforcement Learning. Nan Jiang

CS 598 Statistical Reinforcement Learning. Nan Jiang CS 598 Statistical Reinforcement Learning Nan Jiang Overview What s this course about? A grad-level seminar course on theory of RL 3 What s this course about? A grad-level seminar course on theory of RL

More information

Markov Decision Processes With Delays and Asynchronous Cost Collection

Markov Decision Processes With Delays and Asynchronous Cost Collection 568 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL 48, NO 4, APRIL 2003 Markov Decision Processes With Delays and Asynchronous Cost Collection Konstantinos V Katsikopoulos, Member, IEEE, and Sascha E Engelbrecht

More information

Value Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes

Value Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes Value Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes RAÚL MONTES-DE-OCA Departamento de Matemáticas Universidad Autónoma Metropolitana-Iztapalapa San Rafael

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

Choice under Uncertainty

Choice under Uncertainty In the Name of God Sharif University of Technology Graduate School of Management and Economics Microeconomics 2 44706 (1394-95 2 nd term) Group 2 Dr. S. Farshad Fatemi Chapter 6: Choice under Uncertainty

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5 Table of contents 1 Introduction 2 2 Markov Decision Processes 2 3 Future Cumulative Reward 3 4 Q-Learning 4 4.1 The Q-value.............................................. 4 4.2 The Temporal Difference.......................................

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

arxiv: v1 [cs.ai] 1 Jul 2015

arxiv: v1 [cs.ai] 1 Jul 2015 arxiv:507.00353v [cs.ai] Jul 205 Harm van Seijen harm.vanseijen@ualberta.ca A. Rupam Mahmood ashique@ualberta.ca Patrick M. Pilarski patrick.pilarski@ualberta.ca Richard S. Sutton sutton@cs.ualberta.ca

More information

A Gentle Introduction to Reinforcement Learning

A Gentle Introduction to Reinforcement Learning A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,

More information

Consumer theory Topics in consumer theory. Microeconomics. Joana Pais. Fall Joana Pais

Consumer theory Topics in consumer theory. Microeconomics. Joana Pais. Fall Joana Pais Microeconomics Fall 2016 Indirect utility and expenditure Properties of consumer demand The indirect utility function The relationship among prices, incomes, and the maximised value of utility can be summarised

More information

Basis Construction from Power Series Expansions of Value Functions

Basis Construction from Power Series Expansions of Value Functions Basis Construction from Power Series Expansions of Value Functions Sridhar Mahadevan Department of Computer Science University of Massachusetts Amherst, MA 3 mahadeva@cs.umass.edu Bo Liu Department of

More information

Coarticulation in Markov Decision Processes

Coarticulation in Markov Decision Processes Coarticulation in Markov Decision Processes Khashayar Rohanimanesh Department of Computer Science University of Massachusetts Amherst, MA 01003 khash@cs.umass.edu Sridhar Mahadevan Department of Computer

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information