The Role of Discount Factor in Risk Sensitive Markov Decision Processes
|
|
- Todd Bell
- 5 years ago
- Views:
Transcription
1 06 5th Brazilian Conference on Intelligent Systems The Role of Discount Factor in Risk Sensitive Markov Decision Processes Valdinei Freire Escola de Artes, Ciências e Humanidades Universidade de São Paulo São Paulo, Brazil valdinei.freire@usp.br Abstract Markov Decision Processes MDPs) have long been the framework for modeling optimal decisions in stochastic environments. Although less known, Risk Sensitive Markov Decision Processes RSMDP) extend MDPs by allowing arbitrary risk attitude. However, not every environment is well-defined in MDPs and RSMDPs and both versions make use of discount in costs to turn every problem well-defined; because of the exponential grow in RSMDPs, the problem of a well-defined problem is even harder. Here, we show that the use of discount in costs: i) in MDPs induces a risk-prone attitude in MDPs, and ii) in MDPs, hinders risk-averse attitude for some scenarios. I. INTRODUCTION Markov Decision Processes MDPs) have long been a successful framework for planning over stochastic environment []. MDPs are used in planning problems or learning problems, underlying the Reinforcement Learning framework []. Solutions to MDPs are optimal policies that minimize a cost function based on an immediate cost return. However, in contrast with human decision-makers, who show risk-averse in preference [3]; MDPs do not take into account risk attitudes [4]. Over the last four decades, some extensions to MDPs have been proposed to take into account risk attitudes. Some of them consider Expected Utility Theory [5] to define risk attitude and propose appropriated methods to account for it [6], [7], [8], [4], whereas others consider the role of variance to define risk [9], [0]. Some authors also consider risk without taking into account any general purpose theory [8], []. A direct extension to MDPs is the Risk-Sensitive MDP RSMDP) framework, where the objective of planning is defined as an exponential utility function with a parameter λ [4]. Just like linear utility functions, which present risk neutral attitudes, exponential utility function has some good mathematical property; for example, if all outcomes are increased by an amount of Δ, the certain equivalent is also increased by an amount of Δ [4], this guarantees dynamic programming can be applied to RSMDPs. However, exponential function presents a problem: it grows too fast. In the MDP framework two different scenarios are commonly define: infinite horizon and indefinite horizon []. The infinite horizon considers the average immediate cost as cost function, whereas the indefinite horizon considers the immediate-cost sum as cost function. Although algorithms for both cost functions are not in general interchangeable, there is a cost function that can unify both scenarios: the discounted immediate-cost sum, which considers a discount factor γ 0, ) as parameter. Algorithms that can find optimal policies under discounted sum can also find good approximation in both scenarios: infinite and indefinite horizon; in fact, the approximation gets closer to optimality as γ []. In RSMDPs both scenarios can also be taken; however, whereas the infinite horizon is well-defined, the indefinite horizon does not converge in every scenario and risk factor λ. However, if a discount factor can unify approximately both cost function within MDP; within RSMDP, the use of discount factor makes optimal policies non-stationary. Finally, in an indefinite horizon scenario with constant immediate cost, i.e., the shortest stochastic path problem, the discounted version of the problem is the same as an exponential function, but with risk-prone attitude [3]. In this paper we analyze what happens to risk attitude in the shortest stochastic path problem when RSMDP and MDP is considered with discount factor. We also analyze two other alternatives to risk sensitive [8], []. In section II we present the MDP and RSMDP frameworks, whereas in sections III and IV we present our results. Finally, section V presents some final considerations. II. MARKOV COST PROCESSES We consider a general framework that underlies MDPs and RSMDPs, the Markov Cost Process MCP) [4]. An MCP is defined by a tuple S, A,B 0,T,c, where: S is a set of states; A is a set of actions; B 0 : S [0, ] is an initial state distribution, where B 0 s) =Prs 0 = s); T : S A S [0, ] is a probabilistic transition function, where T s, a, s )=Ps t+ = s s t = s, a t = a); and c : S A R is a cost function. A process starts at a state s 0 according to B 0 and at any time step t 0 evolves as follow: i) the agent perceives the state s t and chooses an action a t ;ii) a cost c t = cs t,a t ) incurs; In [4], Markov Reward Processes are defined; here, we assume rewards are always negative and we define MCP instead /6 $ IEEE DOI 0.09/BRACIS
2 and iii) the process transits to a next state s t+ following probability P s t+ = s s t = s, a t = a) =T s, a, s ). An agent guides the process by choosing a stationary policy ρ : S A, that maps each state into an action at any time step, or a non-stationary policy π : S N A, that maps each state and time steps t 0 into an action. N is the set of natural numbers. An important characteristic regarding MCPs is their horizons. Three horizon types can be defined: ) The finite horizon defines a final time step N, where the process ends. To make clearer the distinction, we define a finite horizon MCP by the tuple S, A,B 0,T,c,N. ) The infinite horizon considers that the process goes on for infinite steps and the process never ends. We define an infinite horizon MCP by the tuple S, A,B 0,T,c,. 3) The indefinite horizon considers a goal state g which is absorbing, where the process stops, or equivalently, cg, a) = 0 for all a A, T g, a, g) = and T g, a, s) =0for s g and all a A. We define an indefinite horizon MCP by the tuple S, A,B 0,T,c,g. Since the finite horizon has no problem of convergence, in this paper we consider only the infinite and indefinite horizon. A. Expected Utility Theory An objective to an MCP can be defined by a utility function U : X R, where X is the set of outcomes. Given a probability distribution over the set of outcomes X and a decision d, the value of a decision is given by V d) = x X P x d)ux). Then, given a set of feasible decisions D, the optimal decision d is the decision with the greatest value function, i.e., d =argmax V d). d D In an MCP, an outcome is described by the cost history h = c 0,c,c,..., i.e., X R. However, it is common to summarize a history h in a unique scalar value, for example, the cost sum; then X R +, in this case the utility function is strictly decreasing. If X R + and U ) is continuous and decreasing, then a certain equivalent C d for a decision d can be defined as C d = U V d)), and the expected outcome C d can be defined as C d = P C d)c. Then, attitude to risk can be defined. C X Definition : Risk Attitude) The decision maker is risk neutral if and only if C d = C d for all d D, or, equivalently, if and only if the utility function is linear. The decision maker is risk prone if and only if C d C d for all d Dand strictly different for at least a decision, or, equivalently, if and only if the utility function is convex. The decision maker is risk averse if and only if C d C d for all d Dand strictly different for at least a decision, or, equivalently, if and only if the utility function is concave. We consider two ways of summarizing histories until some time N: cost sum and discounted cost sum. The first one is simply C N = c t, whereas the second one is Cγ N = γ t c t. Finally, in an MCP the set of decisions is the set of policies. In the next sections we make use of this two summarization to define MDP s and RSMDP s utility functions. B. Markov Decision Process A Markov Decision Process MDP) is an MCP associated to a utility function. MDPs consider a linear utility function, and therefore is risk neutral. The solution of an MDP consists in solving a non-linear system of equations over a function v : S Rfrom which an optimal policy can be defined. We consider three scenarios: infinite horizon, indefinite horizon, and discount scenario. ) Infinite Horizon: In the infinite horizon scenario the C N Uh) = lim N N = lim c t. N N To define a utility function in the infinite horizon a limit condition must be considered. However, the condition on infiniteness is necessary; if the process ends or cost ceases, the utility goes to zero and decisions cannot be differentiated among themselves. The process, under at least optimal policy, must be ergodic, i.e., all states can be reached from every other state and averaging over time or over the state space is equivalently, i.e., an average performance μ can be calculated for optimal policy [4]. If it is not the case, then μ is not unique for every state s S. Finally, optimal policies are stationary. ) Indefinite Horizon: In the indefinite horizon scenario the Uh) = lim N CN = lim c t. N Again, to define a utility function in the indefinite horizon a limit condition must be considered, but, here, this process is supposed to end; if it is not the case, expected utility diverges. A condition so that a solution to this scenario exists is that there exists a proper policy, i.e, a policy that reaches with probability the absorbing state g [4]. 3) Discount Factor: Both versions previously presented have two major drawbacks: i) in infinite horizons the process under optimal policy must be ergodic, and ii) in indefinite horizons there must exists a proper policy. These drawbacks restrict the use of both types of horizon in general. The discounted sum cost function allows the problem to be welldefined in both case. In the discount scenario the utility function is given by Uh) = lim N CN γ = lim γ t c t. N 48 48
3 Although the discounted sum cost function can work in both scenarios: infinite and indefinite horizons; it is not the panacea. As we show in section III, discount and risk-prone walk side by side and decision-makers may not desire such influence. C. Risk Sensitive Markov Decision Process Similar to an MDP, a Risk Sensitive Markov Decision Process RSMDP) is an MCP associated to an objective function. However, RSMDPs consider an exponential utility function, and therefore allows risk-prone or risk-averse attitudes. The choice of a risk attitude is done by a factor λ: ifλ > 0, the decision-maker is risk averse, and if λ<0, the decisionmaker is risk prone. Under a limiting argument, it is possible to show that as λ 0, RSMDP becomes risk-neutral. Again, we consider three scenarios: infinite horizon, indefinite horizon, and discount scenario. ) Infinite Horizon: In the infinite horizon scenario the exp λc ) N Uh) = sgnλ) lim. N N Here, we have the same restriction regarding ergodicity as we have in an MDP. ) Indefinite Horizon: In the infinite horizon scenario the Uh) = sgnλ) lim N expλcn ). If we compare the treatment of both scenarios in MDP and RSMDP they are very similar, then we can expect the same kind of problems. In fact, the existence of a proper policy is necessary, but not sufficient. The proper policy must also be λ-feasible. A policy ρ is λ-feasible if the probability of not being in an absorbing state vanishes faster than the exponential accumulated cost [5], i.e., lim t Dρ ) λ T ρ ) t =0, ) where T ρ is a matrix S \{g S\{g with elements T ρ i,j = T i, ρi),j) and D ρ is a diagonal matrix with elements D ρ i,i = expci, ρi))). Note that λ-feasible condition in equation depends on λ; a proper policy only guarantees lim T ρ ) t =0. t Note that, once costs is always positive, unless there is a policy with trajectory length less than a constant L, for any proper policy ρ there exists a λ 0 > 0 such that ρ is not λ- feasible for any λ>λ 0, i.e., risk-prone attitude cannot be arbitrary large. 3) Discount Factor: Again, the drawbacks of infinite and indefinite scenarios can be solved by changing C N for Cγ N. In the discount scenario the Uh) = sgnλ) lim exp λc N ) γ N ) = sgnλ) lim N exp λγ t )cs t,a t ). Every previously analyzed methods have stationary policies for optimality. The tempting solution for a discounted version of RSMDP introduces a non-stationary policy for optimality. Define the immediate utility function: qs, a) =expλcs, a)). ) In the discount scenario, immediate utility function is not any more stationary. Considering the discount factor γ, we have: q γ s, a, t) =expγ t λcs, a)). 3) If we compare equations and 3; the last one presents a non-stationary risk factor γ t λ, following a non-stationary policy. Another problem is that the solution of a discountedcost RSMDP is given by the system of equations for all s S,t N: vg, θ) = signλ) { θ R + vs, θ) =min T s, a, s )expλcs, a))vs,θγ) { s S π s, t) =min T s, a, s )expλcs, a))vs,λγ t ) s S and must be solved for θ {γ t λ t N, which requires approximation techniques. There exist other alternatives for risk formulation and discount where optimal policy are stationary ones; but in these cases, expected theory utility cannot be used to analyze them [8], []. However, in these alternatives the use of discount factor also hinders risk-averse attitude. In the next section we prove this result. III. DISCOUNT AND RISK-PRONE ATTITUDE IN UTILITY-BASED DECISION CRITERIA In [3], it was proved that, in an indefinite horizon with discount and constant immediate cost, MDPs a risk-prone attitude. Here, we extend it for any immediate-cost function in MDPs and show how to construct problems where risksensitive MCP with discount does not present risk-averse attitude. Remember that a necessary condition to a utility function be risk-averse is C d C d for all d D. To prove our results we follow two step: i) define an MCP problem; and ii) show that the risk-averse condition cannot be assured for every problem. In fact, we create a unique parametrized MCP problem and show that an adequate parametrization of it cannot assure risk-averse condition. Definition : MCP General Problem) First, consider the simple MCP in Figure, where immediate cost are constant in every step. There is only four policies to be chosen: ρ A takes for sure a path that guides to goal in steps cost ); ρ B takes a path that guides to goal in step cost ) with probability ε N and N + steps cost N +) with probability ε N ; ρ C takes a path that guides to goal in step cost ) with probability ε γ and does not reaches the goal infinite cost) with probability ε γ ; and
4 ρ D takes a path that never reaches the goal. Set: ε N = such that under indefinite horizon MDP, N V ρ A )=Vρ B ) and C A = C B =; and ε γ = γ such that under MDP with discount factor γ, V ρ A )=Vρ C ). policies ρ C and ρ D are equate, or even better than policy ρ A in the RSMDP framework with discount. In the next section we consider only policies ρ A and ρ B in our analysis. B. Markov Decision Process with Discount In the case of MDP with discount, we prove the following theorem. Theorem : In the indefinite horizon, an MDP with discount factor presents risk-prone attitude. Proof : Remember that C N γ = cost is constant and unitary, then: γ t c t. If the immediate ε N ε N ε γ ε γ C N γ = N γ t = N+, and, if γ<, then ln γ<0 and the utility function is convex, since, Uc 0,c,...,c )= N+ K +expn +)lnγ). Fig.. A simple MCP problem. The cost is constant, except in the absorbing state g. The start state is s 0, where there are four options for actions: A, B, C, and D; every other states have only one option for acting. N, λ, and γ are free variables; transition parameters are set as follows: ε N = N, and ε γ = γ. Our analysis consists in comparing C i and C i.if C i < C i, then the agent is not risk-averse. We use policy ρ A only to contrast choice of attitudes. A. Risk-averse attitude and Proper Policies Note that only ρ A and ρ B are proper policies. Suppose N is big N ), the choice is between: i) putting up with a for sure cost, or ii) taking the risk of having a small cost of, but having a very small chance ε of having a very big cost. If risk-attitude is not considered, an agent should strive for both policy equally, since both were set to have the same expected cost, i.e., CA = C A = C B = C B =. But policies ρ A and ρ B show completely different behaviors. The riskaverse attitude is necessary in order to choose between such policies: risk-averse attitude chooses policy ρ A, and risk-prone attitude chooses policy ρ B. Even if we do not consider risk attitudes, we may prefer proper policies over non-proper ones [6]. In the MDP with discount scenario, the values of policy ρ A and policy ρ C are the same, but the first policy is proper, whereas the second one is not. Besides, the best case of both is only half of the other in best case, policy ρ A incurs total cost and policy ρ B incurs total cost ); in fact, it is easy to design a problem where the difference in best cases is as smaller as we want. Finally, in the MDP with discount scenario, if we set γ =, we have that V ρ A )=Vρ D ),butρ D never C A reaches the goal. It is easy to construct scenarios similar where Now consider a history of immediate cost arbitrary h = c 0,c,...,c ). But, in the MDP framework with discount, the sequence of immediate cost is equivalent to a sequence h =c 0 = Cγ N,c =0,...,c =0), i.e., Uh) =Uh ) and the decision maker is indifferent between the sequence h or the sequence h, where the value Cγ N is received and no discounted is applied. But, if γ <, then Cγ N <C N and h <h, characterizing a risk-prone attitude. If we apply the previous theorem in the simple problem in figure, we have that V ρ A ) <Vρ B ) for any γ<and N>, evenifwehavec A = C B. C. Risk Sensitive Markov Decision Process with Discount We use again the example in figure. After following policy ρ B, consider the utility of the lengthiest trajectory, in the RSMDP framework with discount: U ) Cγ N ) N = signλ)exp λ γ t = signλ)exp λ ) 4) γn+, and V ρ B ) = ε)expλ) ε exp λ ) γn+ = ) expλ) N λ N exp ). γn+ Now, we can enunciate the following theorem. Theorem : In the indefinite horizon, an RSMDP with discount factor and constant immediate cost, there exist problems where the agent does not present risk-averse attitude even if λ>
5 Proof : We use the example in figure and show that exists a N 0 such that for all N>N 0 the agent presents risk-prone attitude. Take the value of policy ρ B with N, then: V ρ B ) = lim N N ) expλ) ) + N exp λ N+ = expλ) Then, there exists N 0 such that V ρ B ) expλ) ɛ for all ɛ>0. If we choose ɛ =expλ). Wehave: V ρ B ) expλ) = = expln + λ) > expλ) =V ρ A ), or equivalently, C B < C A = C A = C B, i.e., the decision is risk-prone. Here, different from MDPs, it is not the case that in every problem the agent are risk-prone. This is because the agent changes her attitude according to the amount of accumulated cost. Figure shows the disutility function, the negative of utility function, regarding different values of γ and λ as given by equation 4; note the difference in scales for each graph and the change of convex risk averse) to concave risk prone) in each configuration. In our simple example, considering γ = 0.99, ifλ =then N 0 > 0 4, whereas if λ =0. then N 0 > 0 5. However, if γ =0.9 and λ =0. then N 0 =5 works. disutility function x x λ =.0,γ =0.99 λ =.0,γ = x λ =0.,γ =0.99 λ =0.,γ = accumulated constant cost Fig.. Disutility of constant cost c t =in the RSMDP framework with discount. IV. NON-UTILITY DECISION CRITERIA WITH RISK AND DISCOUNT In previous section we analyzed two alternatives based on utility function. In this section we consider two alternatives where risk is taken into account together with discount. Instead of considering a utility function and deriving a system of equations, the decision criteria starts directly with a fixed.5 3 point equation, from which a general analytical utility function cannot be derived. A. Discount over Certain Equivalent The first alternative considers a system of equations based on exponential function and therefore is similar to RSMDP discussed before. However, discounts are applied over certain equivalent, instead of the cost itself [8]. Then, for λ>0, the system of equations to be solved is: exp λvs)) { = =min T s, a, s )expλcs, a)) exp γλvs )). s S and optimal stationary function is obtained from: ρ s) = { =argmin s S T s, a, s )expλcs, a)) exp γλvs )) Here, vs) denotes a value close in semantic to the certainty equivalent of state s instead of the expected utility. We again use the simple example in figure by comparing policies ρ A and ρ B. Note that in such example the only risk to be solved is at the start state, after which all we have is determinism and no recurrence. In this case, the same analysis done with RSMDP with discount applies here. Theorem 3: In the indefinite horizon with discount over certain equivalent, there exist problems where the agent does not present risk-averse attitude even if λ>0. Proof 3: The proof is the same as in theorem. B. Discount with piecewise-linear risk attitude Again, the second alternative considers a system of equations, but at this time based on a piecewise-linear function []. Consider the function u : R [0, ] R defined by: { U λ λ)x if x>0 x) =Ux, λ) = + λ)x otherwise. Then, the following system of equations must be solved to obtained the value of a stationary policy ρ: 0= s S T s, ρs),s )U λ [ cs, ρs)) + γv ρ s ) v ρ s)], the optimal policy is the one that maximize the value function for all states, i.e., ρ is optimal if and only if v ρ s) v ρ s) for all s S and all stationary policy ρ. Here, λ is a risk factor, but limited to, ), and λ has the same semantic, i.e., if λ>0 we have risk-averse attitude and if λ<0 we have risk-prone attitude. Using again our simple problem and considering the lack of risk after step, we have the value of state s 0 by following ρ A to be: v ρa s 0 )= + γ),
6 whereas the value of policy ρ B can be found by solving: 0 = ε N )U λ vs 0 )) +ε N U λ ) γn vs 0) 0 = ε N ) λ) vs 0 )) +ε N + λ) ) γn vs 0) 0 = ) λ) vs 0 )) N + N + λ) ) γn vs 0) Theorem 4: In the indefinite horizon with discount over piecewise-linear risk attitude, there exist problems where the agent does no present risk-averse attitude even if λ>0. Proof 4: Consider equation 5 when N, then we have: 0= λ) vs 0 )) v ρb s 0 )= > + γ) =v ρa s 0 ). By the same argument used in theorem ; there exists N 0 such that for any N>N 0 the agent presents risk-prone attitude. V. CONCLUSION We have shown that by using a discount factor to take decisions in sequential decision problems in stochastic environment may produce undesired effects: the decision-maker behaves with a risk-prone attitude. Even in frameworks where risk-averse attitude can be set arbitrarily, the use of discount may end in risk-prone attitudes. Although ignored by many research in the literature, we showed that problems can be designed such that undesirable decisions may be taken if risk attitude is not considered. Decisions may be indifferent between proper policies and nonproper ones and, in fact, risk-neutral attitude has been used without taking this result in consideration. In the case of RSMDPs with discount, if parameters are set appropriately, then the conversion to risk-prone can be avoided. However, by choosing agents with large risk-averse parameters may also produce undesirable effects, such as paying too much attention to worst case scenarios. Balancing this both effects is a challenge that must be tackle. 5) [7], An exact algorithm for solving mdps under risk-sensitive planning objectives with one-switch utility functions, in AAMAS ), 008, pp [8] K.-J. Chung and M. J. Sobel, Discounted mdp s: distribution functions and exponential utility maximization, SIAM J. Control Optim., vol. 5, pp. 49 6, January 987. [9] S. Mannor, D. Simester, P. Sun, and J. N. Tsitsiklis, Bias and variance in value function estimation, in ICML 04: Proceedings of the twentyfirst international conference on Machine learning. New York, NY, USA: ACM, 004, p. 7. [0] S. Mannor and J. N. Tsitsiklis, Mean-variance optimization in markov decision processes, in ICML, 0, pp [] O. Mihatsch and R. Neuneier, Risk-sensitive reinforcement learning, Mach. Learn., vol. 49, pp , November 00. [Online]. Available: [] X.-R. Cao, A sensitivity view of Markov decision processes and reinforcement learning, in Modeling, Control and Optimization of Complex Systems. [3] R. Minami and V. F. d. Silva, Shortest stochastic path with risk sensitive evaluation, in th Mexican International Conference on Artificial Intelligence, MICAI 0, ser. Lecture Notes in Artificial Intelligence, I. Batyrshin and M. G. Mendoza, Eds., vol San Luis Potosí, Mexico: Springer, 0, pp [4] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, st ed. New York, NY: John Wiley and Sons, 994. [5] S. D. Patek, On terminating markov decision processes with a risk averse objective function, Automatica, vol. 37, pp , 00. [6] S. d. L. Pereira, L. N. Barros, and F. G. Cozman, Strong probabilistic planning, in MICAI 08: Proceedings of the 7th Mexican International Conference on Artificial Intelligence. Berlin, Heidelberg: Springer- Verlag, 008, pp REFERENCES [] Mausam and A. Kolobov, Planning with Markov Decision Processes: An AI Perspective, ser. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan and Claypool Publishers, 0. [Online]. Available: [] R. Sutton and A. Barto, Reinforcement learning: An introduction. Cambridge Univ Press, 998, vol. 6. [3] R. L. Keeney and H. Raiffa, Decisions with Multiple Objectives: Preferences and Value Tradeoffs. New York: Wiley, 976. [4] R. A. Howard and J. E. Matheson, Risk-sensitive markov decision processes, Management Science, vol. 8, no. 7, pp , 97. [Online]. Available: [5] J. von Neumann and O. Morgenstern, The Theory of Games and Economic Behaviour, nd ed. Princeton: Princeto University Press, 947. [6] Y. Liu and S. Koenig, Probabilistic planning with nonlinear utility functions, in ICAPS, 006, pp
Markov Decision Processes Chapter 17. Mausam
Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.
More informationReusing Risk-Aware Stochastic Abstract Policies in Robotic Navigation Learning
Reusing Risk-Aware Stochastic Abstract Policies in Robotic Navigation Learning Valdinei Freire da Silva 1, Marcelo Li Koga 2, Fábio Gagliardi Cozman 2, and Anna Helena Reali Costa 2 1 Escola de Artes,
More informationDecision Theory: Markov Decision Processes
Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies
More informationDecision Theory: Q-Learning
Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationPROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION PING HOU. A dissertation submitted to the Graduate School
PROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION BY PING HOU A dissertation submitted to the Graduate School in partial fulfillment of the requirements for the degree Doctor of Philosophy Major Subject:
More informationStochastic Safest and Shortest Path Problems
Stochastic Safest and Shortest Path Problems Florent Teichteil-Königsbuch AAAI-12, Toronto, Canada July 24-26, 2012 Path optimization under probabilistic uncertainties Problems coming to searching for
More informationInformation, Utility & Bounded Rationality
Information, Utility & Bounded Rationality Pedro A. Ortega and Daniel A. Braun Department of Engineering, University of Cambridge Trumpington Street, Cambridge, CB2 PZ, UK {dab54,pao32}@cam.ac.uk Abstract.
More informationMarkov decision processes
CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only
More informationA Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time
A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement
More information15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)
15-780: Graduate Artificial Intelligence Reinforcement learning (RL) From MDPs to RL We still use the same Markov model with rewards and actions But there are a few differences: 1. We do not assume we
More informationCSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam
CSE 573 Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming Slides adapted from Andrey Kolobov and Mausam 1 Stochastic Shortest-Path MDPs: Motivation Assume the agent pays cost
More informationCS599 Lecture 1 Introduction To RL
CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming
More informationLecture notes for Analysis of Algorithms : Markov decision processes
Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with
More informationOptimism in the Face of Uncertainty Should be Refutable
Optimism in the Face of Uncertainty Should be Refutable Ronald ORTNER Montanuniversität Leoben Department Mathematik und Informationstechnolgie Franz-Josef-Strasse 18, 8700 Leoben, Austria, Phone number:
More informationCentral-limit approach to risk-aware Markov decision processes
Central-limit approach to risk-aware Markov decision processes Jia Yuan Yu Concordia University November 27, 2015 Joint work with Pengqian Yu and Huan Xu. Inventory Management 1 1 Look at current inventory
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationBalancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm
Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu
More informationQ-Learning in Continuous State Action Spaces
Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental
More informationMarkov Decision Processes and Dynamic Programming
Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes
More informationAn Adaptive Clustering Method for Model-free Reinforcement Learning
An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at
More informationSeparable Utility Functions in Dynamic Economic Models
Separable Utility Functions in Dynamic Economic Models Karel Sladký 1 Abstract. In this note we study properties of utility functions suitable for performance evaluation of dynamic economic models under
More informationMarkov Decision Processes Chapter 17. Mausam
Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.
More informationCS 7180: Behavioral Modeling and Decisionmaking
CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and
More informationReinforcement learning
Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error
More informationSequential Decisions
Sequential Decisions A Basic Theorem of (Bayesian) Expected Utility Theory: If you can postpone a terminal decision in order to observe, cost free, an experiment whose outcome might change your terminal
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationUniversity of Warwick, EC9A0 Maths for Economists Lecture Notes 10: Dynamic Programming
University of Warwick, EC9A0 Maths for Economists 1 of 63 University of Warwick, EC9A0 Maths for Economists Lecture Notes 10: Dynamic Programming Peter J. Hammond Autumn 2013, revised 2014 University of
More informationMarkov Decision Processes and Solving Finite Problems. February 8, 2017
Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:
More informationQ-Learning for Markov Decision Processes*
McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of
More informationReinforcement Learning
Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha
More informationOptimal Convergence in Multi-Agent MDPs
Optimal Convergence in Multi-Agent MDPs Peter Vrancx 1, Katja Verbeeck 2, and Ann Nowé 1 1 {pvrancx, ann.nowe}@vub.ac.be, Computational Modeling Lab, Vrije Universiteit Brussel 2 k.verbeeck@micc.unimaas.nl,
More informationInternet Monetization
Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition
More informationHomework #6 (10/18/2017)
Homework #6 (0/8/207). Let G be the set of compound gambles over a finite set of deterministic payoffs {a, a 2,...a n } R +. A decision maker s preference relation over compound gambles can be represented
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and
More informationReinforcement Learning. Introduction
Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More informationToday s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes
Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks
More informationBias-Variance Error Bounds for Temporal Difference Updates
Bias-Variance Bounds for Temporal Difference Updates Michael Kearns AT&T Labs mkearns@research.att.com Satinder Singh AT&T Labs baveja@research.att.com Abstract We give the first rigorous upper bounds
More informationAn Empirical Algorithm for Relative Value Iteration for Average-cost MDPs
2015 IEEE 54th Annual Conference on Decision and Control CDC December 15-18, 2015. Osaka, Japan An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs Abhishek Gupta Rahul Jain Peter
More informationLecture 3: Markov Decision Processes
Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov
More informationReinforcement Learning: An Introduction
Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is
More informationProcedia Computer Science 00 (2011) 000 6
Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-
More informationReinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina
Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it
More informationCourse 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016
Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the
More informationPreference Elicitation for Sequential Decision Problems
Preference Elicitation for Sequential Decision Problems Kevin Regan University of Toronto Introduction 2 Motivation Focus: Computational approaches to sequential decision making under uncertainty These
More informationMarkov Decision Processes
Markov Decision Processes Noel Welsh 11 November 2010 Noel Welsh () Markov Decision Processes 11 November 2010 1 / 30 Annoucements Applicant visitor day seeks robot demonstrators for exciting half hour
More information1 Problem Formulation
Book Review Self-Learning Control of Finite Markov Chains by A. S. Poznyak, K. Najim, and E. Gómez-Ramírez Review by Benjamin Van Roy This book presents a collection of work on algorithms for learning
More informationPrioritized Sweeping Converges to the Optimal Value Function
Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science
More informationUsing Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *
Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms
More informationMS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction
MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent
More informationExponential Moving Average Based Multiagent Reinforcement Learning Algorithms
Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca
More informationSome notes on Markov Decision Theory
Some notes on Markov Decision Theory Nikolaos Laoutaris laoutaris@di.uoa.gr January, 2004 1 Markov Decision Theory[1, 2, 3, 4] provides a methodology for the analysis of probabilistic sequential decision
More informationCMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro
CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING
More informationTemporal difference learning
Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).
More informationFinal Exam December 12, 2017
Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes
More informationState Space Abstractions for Reinforcement Learning
State Space Abstractions for Reinforcement Learning Rowan McAllister and Thang Bui MLG RCC 6 November 24 / 24 Outline Introduction Markov Decision Process Reinforcement Learning State Abstraction 2 Abstraction
More informationOpen Theoretical Questions in Reinforcement Learning
Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem
More information16.4 Multiattribute Utility Functions
285 Normalized utilities The scale of utilities reaches from the best possible prize u to the worst possible catastrophe u Normalized utilities use a scale with u = 0 and u = 1 Utilities of intermediate
More information(Deep) Reinforcement Learning
Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015
More informationReinforcement Learning and NLP
1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value
More informationSequential Decisions
Sequential Decisions A Basic Theorem of (Bayesian) Expected Utility Theory: If you can postpone a terminal decision in order to observe, cost free, an experiment whose outcome might change your terminal
More informationCSE250A Fall 12: Discussion Week 9
CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.
More informationControl Theory : Course Summary
Control Theory : Course Summary Author: Joshua Volkmann Abstract There are a wide range of problems which involve making decisions over time in the face of uncertainty. Control theory draws from the fields
More informationThe Reinforcement Learning Problem
The Reinforcement Learning Problem Slides based on the book Reinforcement Learning by Sutton and Barto Formalizing Reinforcement Learning Formally, the agent and environment interact at each of a sequence
More informationChapter 3: The Reinforcement Learning Problem
Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which
More informationElements of Reinforcement Learning
Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,
More informationOptimal Control of Partiality Observable Markov. Processes over a Finite Horizon
Optimal Control of Partiality Observable Markov Processes over a Finite Horizon Report by Jalal Arabneydi 04/11/2012 Taken from Control of Partiality Observable Markov Processes over a finite Horizon by
More informationLoss Bounds for Uncertain Transition Probabilities in Markov Decision Processes
Loss Bounds for Uncertain Transition Probabilities in Markov Decision Processes Andrew Mastin and Patrick Jaillet Abstract We analyze losses resulting from uncertain transition probabilities in Markov
More informationReinforcement Learning. Spring 2018 Defining MDPs, Planning
Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state
More informationReinforcement Learning. Yishay Mansour Tel-Aviv University
Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak
More informationChapter 16 Planning Based on Markov Decision Processes
Lecture slides for Automated Planning: Theory and Practice Chapter 16 Planning Based on Markov Decision Processes Dana S. Nau University of Maryland 12:48 PM February 29, 2012 1 Motivation c a b Until
More information3 Intertemporal Risk Aversion
3 Intertemporal Risk Aversion 3. Axiomatic Characterization This section characterizes the invariant quantity found in proposition 2 axiomatically. The axiomatic characterization below is for a decision
More informationIntertemporal Risk Aversion, Stationarity, and Discounting
Traeger, CES ifo 10 p. 1 Intertemporal Risk Aversion, Stationarity, and Discounting Christian Traeger Department of Agricultural & Resource Economics, UC Berkeley Introduce a more general preference representation
More informationChapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS
Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS 63 2.1 Introduction In this chapter we describe the analytical tools used in this thesis. They are Markov Decision Processes(MDP), Markov Renewal process
More informationMarkov Decision Processes Infinite Horizon Problems
Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld 1 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T)
More informationReinforcement Learning as Classification Leveraging Modern Classifiers
Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions
More informationAd Network Optimization: Evaluating Linear Relaxations
Ad Network Optimization: Evaluating Linear Relaxations Flávio Sales Truzzi, Valdinei Freire da Silva, Anna Helena Reali Costa and Fabio Gagliardi Cozman Escola Politécnica - Universidade de São Paulo (USP)
More informationReview: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]
Review: TD-Learning function TD-Learning(mdp) returns a policy Class #: Reinforcement Learning, II 8s S, U(s) =0 set start-state s s 0 choose action a, using -greedy policy based on U(s) U(s) U(s)+ [r
More informationSome AI Planning Problems
Course Logistics CS533: Intelligent Agents and Decision Making M, W, F: 1:00 1:50 Instructor: Alan Fern (KEC2071) Office hours: by appointment (see me after class or send email) Emailing me: include CS533
More informationFinal Exam December 12, 2017
Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes
More informationFood delivered. Food obtained S 3
Press lever Enter magazine * S 0 Initial state S 1 Food delivered * S 2 No reward S 2 No reward S 3 Food obtained Supplementary Figure 1 Value propagation in tree search, after 50 steps of learning the
More information, and rewards and transition matrices as shown below:
CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount
More informationIntroduction to Reinforcement Learning
CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.
More informationarxiv: v1 [cs.lg] 23 Oct 2017
Accelerated Reinforcement Learning K. Lakshmanan Department of Computer Science and Engineering Indian Institute of Technology (BHU), Varanasi, India Email: lakshmanank.cse@itbhu.ac.in arxiv:1710.08070v1
More informationSolving Multi-Objective MDP with Lexicographic Preference: An application to stochastic planning with multiple quantile objective
arxiv:1705.03597v1 [cs.ai] 10 May 2017 Solving Multi-Objective MDP with Lexicographic Preference: An application to stochastic planning with multiple quantile objective Yan Li 1 and Zhaohan Sun 2 1 School
More informationCS 598 Statistical Reinforcement Learning. Nan Jiang
CS 598 Statistical Reinforcement Learning Nan Jiang Overview What s this course about? A grad-level seminar course on theory of RL 3 What s this course about? A grad-level seminar course on theory of RL
More informationMarkov Decision Processes With Delays and Asynchronous Cost Collection
568 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL 48, NO 4, APRIL 2003 Markov Decision Processes With Delays and Asynchronous Cost Collection Konstantinos V Katsikopoulos, Member, IEEE, and Sascha E Engelbrecht
More informationValue Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes
Value Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes RAÚL MONTES-DE-OCA Departamento de Matemáticas Universidad Autónoma Metropolitana-Iztapalapa San Rafael
More informationLecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation
Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free
More informationChoice under Uncertainty
In the Name of God Sharif University of Technology Graduate School of Management and Economics Microeconomics 2 44706 (1394-95 2 nd term) Group 2 Dr. S. Farshad Fatemi Chapter 6: Choice under Uncertainty
More informationReinforcement Learning with Function Approximation. Joseph Christian G. Noel
Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is
More information1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5
Table of contents 1 Introduction 2 2 Markov Decision Processes 2 3 Future Cumulative Reward 3 4 Q-Learning 4 4.1 The Q-value.............................................. 4 4.2 The Temporal Difference.......................................
More informationReinforcement Learning
1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision
More informationarxiv: v1 [cs.ai] 1 Jul 2015
arxiv:507.00353v [cs.ai] Jul 205 Harm van Seijen harm.vanseijen@ualberta.ca A. Rupam Mahmood ashique@ualberta.ca Patrick M. Pilarski patrick.pilarski@ualberta.ca Richard S. Sutton sutton@cs.ualberta.ca
More informationA Gentle Introduction to Reinforcement Learning
A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,
More informationConsumer theory Topics in consumer theory. Microeconomics. Joana Pais. Fall Joana Pais
Microeconomics Fall 2016 Indirect utility and expenditure Properties of consumer demand The indirect utility function The relationship among prices, incomes, and the maximised value of utility can be summarised
More informationBasis Construction from Power Series Expansions of Value Functions
Basis Construction from Power Series Expansions of Value Functions Sridhar Mahadevan Department of Computer Science University of Massachusetts Amherst, MA 3 mahadeva@cs.umass.edu Bo Liu Department of
More informationCoarticulation in Markov Decision Processes
Coarticulation in Markov Decision Processes Khashayar Rohanimanesh Department of Computer Science University of Massachusetts Amherst, MA 01003 khash@cs.umass.edu Sridhar Mahadevan Department of Computer
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More information