Some notes on Markov Decision Theory
|
|
- Gervais Morris
- 5 years ago
- Views:
Transcription
1 Some notes on Markov Decision Theory Nikolaos Laoutaris January,
2 Markov Decision Theory[1, 2, 3, 4] provides a methodology for the analysis of probabilistic sequential decision processes at an infinite/finite planning horizon Queueing Theory + Markov Processes: model a probabilistic system in order to evaluate its performance Markov Decision Theory: goes one step ahead; design the operation of a probabilistic system so as to optimize its performance 2
3 Probabilistic : meaning that there exists an environment, which cannot be described to its full detail, thus it is taken to be random (stochastic), following some probability law that best describes its nature. It is in this environment that our agent is to operate. Sequential : The theory aims at providing the tools for optimizing behaviors, i.e., sequences of decisions, not single decisions. Planning horizon : When deciding our behavior we must take into consideration the length of our intended activity. A poker player has a different behavior when opening a new game, as compared to the case that he is finishing one (and has secured a profit or suffered a loss). Behaviors depend on whether we will be participate in an activity for a finite 2-1
4 amount of time and then abandon it, or have decided to be involved permanently in it. Queuing theory vs decision theory : In a queuing model (say an M/M/1 queue) everything is fixed. The (stochastic) arrival process and the (stochastic) service process are parts of the environment in which we study some performance metric of interest (e.g., the expected queuing delay). An application of decision theory on queues (an M/X/1 queue, where X is an unknown service policy that we want to design and optimize) would consider the arrival process as the only element of the environment. Decision theory would provide the tools to design and optimize the service process so as to achieve a desired goal (e.g., avoid underflows or overflows). 2-2
5 A Markov Decision Process (MDP) is a discrete Markov process {In}n>0 characterized by a tuple S, A, P, C : S is the set of possible states. {In} is in state i at time n iff In = i, 0 i M A is the set of possible actions (or decisions). Following an observation, an action k is taken from the finite action space A, k = 0, 1,..., K P : S A S [0, 1] is the state transition function specifying the probability P {j i, k} pij(k) of observing a transition to state j S after taking action k A in state i S 3
6 Observation instances The process can be either continuous or discrete time. We focus on discrete time MDPs. Discrete time MDPs come in two flavors: discrete MDPs that are time-less (we do not model the time between observation instances, e.g., n > n + 1, or take it to be constant). These are in a way generalizations of Markov chains (some times disc. MDPs of that kind are called controllable Markov chains discrete MDPs that allow for time to pass between observation instances. These are much like generalized versions of semi-markov processes. In such cases the Markov property holds only upon observation instances (and not at arbitrary instances) and we may exploit them to optimize a decision system that acts 3-1
7 upon observation instances To have a continuous time MDP would require the Markov property to be in place at all time instances. This is rather restricting as most processes of interest posses the Markov property in selected times of interest and not generally. The action state can be either homogeneous or non-homogeneous. Homogeneous there is a common action state, A, among which we choose actions according to the current state. Non-homogeneous each state i is associated to a potentially different set of actions, Ai, among which a decision must be made. 3-2
8 C : S A R is a function specifying the cost ci(k) of taking action k A at state i S; ci(k) must depend only on the current state-action pair A policy R = (d0, d1,..., dm ) prescribes an action for each possible state. di(r) = k means that under policy R, action k is taken when the process is in i π(r) = (π0, π1,..., πm ) is the limiting distribution of {In} under policy R π(r) = π(r) P (R) the objective is to find the optimal policy Ropt that minimizes some cost criterion which considers both immediate and subsequent costs from the future evolution of {In} 4
9 Costs are used to drive the agent towards the desired behavior (defined by an objective function). Costs are used with minimization objectives. Alternatively, we may use rewards in conjunction with maximization objectives. Be careful when defining costs : Costs must depend only on the current state. This is a source of errors. Be particularly careful because in many cases the cost may depend on the next state also. In such cases, average over all possible transitions to have a legitimate MDP cost (that depends only on the current state). 4-1
10 Types of policies: Stationary policies always follow the same rule for choosing an action for each possible state, independently of the current time n Non-stationary polices behave differently as time evolves Deterministic policies always take the same action when at the same state: di(r) = k with probability 1 Randomized policies map a probability distribution over the possible actions to each state: di(r) = k with probability ρi(k) : k A ρ i(k) = 1 5
11 Policy space Stationary Non-Stationary Randomized Deterministic Deterministic policy: 1 state 1 action Randomized policy: 1 state 1 probability distribution over the action space 6
12 Stationarity and randomization are concepts that belong to different levels of characterizing a policy. They do not compare directly (e.g., stationary is NOT the opposite of randomized)! Stationarity is all about whether the rule for choosing decisions is affected by time Randomization refers to the way decisions are made for particular states (under a stationary or non-stationary policy) Stationary policies arise when optimizing over an infinite horizon. This makes sense intuitively. It is boundary (time) conditions that might prompt for change of behavior (given a fixed environment). A poker player that already looses money might bet more aggressively towards the end of the game in a final attempt to recover. Similarly, a winning 6-1
13 (rational!) player might avoid excessive risks towards the end of the game (to protect his winnings). Non-stationary policies arise from finite planning horizons. Behavioral changes due to approaching time boundaries appear also in the domain of Game Theory (a game involves at least two interacting agents (players), whereas the discussed decision theory involves only one agent, whose aim is to adapt to his environment, rather than compete with another rational entity). An interesting discussion of such behavioral issues appears in the context of the Iterated Prisoner s Dilemma, and other games 1. 1 William Poundstone, Prisoner s Dilemma: John Von Neumann, Game Theory and the Puzzle of the Bomb, Anchor Books, (highly recommended!) 6-2
14 Cost criteria and planning horizon Finite horizon undiscounted-cost problems require the minimization of the total expected accumulated cost over a finite planning horizon of W observations (transitions): { W } Ei{c}(R) = E (d (R)) I cin In 0 = i n=1 Infinite horizon problems consider an infinite planning horizon. Appropriate for systems that are expected to operate continuously or for systems that have an unknown stopping time 7
15 To understand Ei{c}(R) the expected cost when starting from state i and operating for W time units under policy R remember the following definitions: I0 is the initial state of the process (at n = 0) In is the state of the process at time n din (R) is the decision taken in state I n under policy R cin (d In (R)) is the cost incurred by taken decision d In (R) in state In 7-1
16 Discounted-cost problems (finite/infinite horizon) Attach a discount factor α, 0 < α < 1, to each immediate cost ci(k) thus affecting the relative importance of immediate costs over future costs: Ei{c}(R) = E { n=1 α n cin (d In (R)) I 0 = i } When α 1 future costs tend to count as much as immediate costs. Otherwise, future costs tend to be heavily discounted so, for an optimal performance, more attention must be given to the minimization of immediate costs 8
17 Average cost optimality (goes with infinite horizon) Requires the minimization of the expected average cost per unit of time: Ei{c}(R) = E { lim n 1 n n h=1 cih (d Ih (R)) I 0 = i } (1) As n, P {In = j I0 = i}(r) πj(r), independently of the initial state I0 = i, thus: E{c}(R) = j S πj(r) cj(dj(r)) (2) 9
18 Derivation (1) (2): The limiting probability πj(r) can be written as follows: 1 πj(r) = lim n n n h=1 P {Ih = j I0 = i}(r) (3) 9-1
19 Thus, starting from (1): Ei{c}(R) = E { lim n 1 = lim n n 1 = lim n n = j S = j S = (2) 1 n } n (d (R)) I cih Ih 0 = i h=1 n { } E (d (R)) I cih Ih 0 = i h=1 n h=1 j S cj(dj(r))p {Ih = j I0 = i}(r) 1 cj(dj(r)) lim n n n h=1 P {Ih = j I0 = i}(r) cj(dj(r)) πj(r) (substituting from (3)) 9-2
20 The optimal policy Ropt is the one that incurs the smallest cost: E{c}(Ropt) E{c}(R) over all R S A 1. An optimal police does not always exist under the average cost criterion 2. If an optimal police does exist, it is not guaranteed to be stationary 3. If S is finite and every stationary policy gives rise to an irreducible Markov chain then the stationarity of the optimal policy is guaranteed (+it is non-randomized) (see S.M. Ross [2] for more details) 10
21 Finding the optimal policy 1. Exhaustive enumeration. Suitable only for tiny problems due to O(size(A) size(s) ) complexity 2. Linear Programming (LP) 3. Policy improvement algorithm 4. Value iteration algorithm 11
22 Solving the MDP via Linear Programming The optimal policy can be identified efficiently by transforming the MDP formulation into a linear program Denote: Dik = P {action = k state = i} Dik s completely define a policy Also denote: yik = P {action = k and state = i} 12
23 Clearly the two are related via: yik = πidik (4) Also K πi = yik (5) k=0 From (4), (5): Dik = y ik πi = yik K k=0 y ik (6) 13
24 The are several constraints on yik s: 1. M πi = 1 i=0 M 2. πj = M πipij K i=0 k=0 yik = 1 K yik = M K i=0 k=0 i=0 k=0 3. yik 0 i, k yikpij(k) j The steady-state average cost per unit time is: M E(C) = K πicikdik = M i=0 k=0 i=0 k=0 K cikyik 14
25 yik s are obtained from the following LP: Minimize M z = E{c} = Subject to j : K yjk = k=0 M K i=0 k=0 i=0 k=0 M i=0 k=0 K cik yik (7) K yikpij(k) (8) yik = 1 (9) i, k : yik 0 (10) 15
26 D ik s are then readily available by using equation (6) The D ik s of the optimal solution are either 0 or 1, i.e., the optimal policy is non-randomized This is because the aforementioned LP has a totally unimodular constraint matrix and integer constants These two properties guarantee that an optimal solution to the LP through the Simplex method will return an integral solution (in the current case having 0 s and 1 s). See [5] for more on unimodularity 16
27 A policy improvement algorithm Very efficient for large problems Starts with an arbitrary policy which is progressively improved at each iteration of the algorithm, until the optimal policy is reached Convergence after a finite number of iterations is guaranteed for problems with finite state and action sets S, A 17
28 The theory of policy improvement v n i (R): the total expected cost of starting from state i and operating for n periods under policy R: v n i (R) = c i(k) + M pij(k) v n 1 j (R) (11) j=0 The expected average cost is independent of the initial state i: E{c}(R) = M πj cj(k) (12) j=0 For large n we have: v n i (R) n E{c}(R) + v i(r) (13) vi(r) captures the effect of starting from state i, on the total expected cost v i n (R), thus: v n i (R) vn j (R) v i(r) vj(r) (14) 18
29 substituting v n i (R) n E{c}(R) + v i(r) and v n 1 j (R) (n 1) E{c}(R) + vj(r) into equation (11) we take: E{c}(R) + vi(r) = ci(k) + M pij(k) vj(r) for i = 0, 1,..., M j=0 (15) The system of equation (15) has M + 2 unknowns (E{c}(R), vi(r)) and M + 1 equations. By setting vm (R) = 0 we can find the vi(r)s and the cost associated with a particular policy Theoretically, the recursive equation could be used for an exhaustive search for the optimal policy but this is not computationally efficient 19
30 The policy improvement algorithm Initialization Select an arbitrary initial policy R0. Iteration n Perform the following steps: Value determination For policy Rn solve the system of M + 1 equations E{c}(Rn) = ci(k)+ M pij(k)vj(rn) vi(rn), for 0 i M j=0 for the M + 1 unknown values E{c}(Rn), v0(rn), v1(rn),..., vm 1(Rn). 20
31 Policy improvement Using the values of vi(rn) computed for policy Rn, find an improved policy Rn+1 such that for each state i, Di(Rn+1) = k is the decision that minimizes: ci(k) + M pij(k)vj(rn) vi(rn), for 0 i M (16) j=0 i.e, for each state i minimize (16) and set di(rn+1) equal to the minimizing value of k. This procedure defines a new policy Rn+1 with E{c}(Rn+1) E{c}(Rn) (see Theorem 3.2 in [3]). Optimality test If the current policy Rn+1 is identical to the previous Rn, then it is the optimal policy. Otherwise set n = n + 1 and perform another iteration. 21
32 References [1] H. Mine and S. Osaki, Markovian Decision Processes, Elsevier, Amsterdam, [2] Sheldon M. Ross, Applied Probability Models with Optimization Applications, Dover Publications, New York, [3] Henk C. Tijms, Stochastic Modelling and Analysis: A Computational Approach, John Wiley & Sons, [4] Frederick S. Hillier and Gerald J. Lieberman, Introduction to Operations Research, McGraw-Hill, [5] Christos H. Papadimitriou and Kenneth Steiglitz, Combinatorial Optimization: Algorithms and Complexity, Dover Publications, New York,
1 Markov decision processes
2.997 Decision-Making in Large-Scale Systems February 4 MI, Spring 2004 Handout #1 Lecture Note 1 1 Markov decision processes In this class we will study discrete-time stochastic systems. We can describe
More information21 Markov Decision Processes
2 Markov Decision Processes Chapter 6 introduced Markov chains and their analysis. Most of the chapter was devoted to discrete time Markov chains, i.e., Markov chains that are observed only at discrete
More informationChapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS
Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS 63 2.1 Introduction In this chapter we describe the analytical tools used in this thesis. They are Markov Decision Processes(MDP), Markov Renewal process
More informationThe Markov Decision Process (MDP) model
Decision Making in Robots and Autonomous Agents The Markov Decision Process (MDP) model Subramanian Ramamoorthy School of Informatics 25 January, 2013 In the MAB Model We were in a single casino and the
More informationLecture notes for Analysis of Algorithms : Markov decision processes
Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with
More informationAM 121: Intro to Optimization Models and Methods: Fall 2018
AM 11: Intro to Optimization Models and Methods: Fall 018 Lecture 18: Markov Decision Processes Yiling Chen Lesson Plan Markov decision processes Policies and value functions Solving: average reward, discounted
More informationReinforcement Learning
Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha
More informationDecision Theory: Markov Decision Processes
Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies
More informationFinding the Value of Information About a State Variable in a Markov Decision Process 1
05/25/04 1 Finding the Value of Information About a State Variable in a Markov Decision Process 1 Gilvan C. Souza The Robert H. Smith School of usiness, The University of Maryland, College Park, MD, 20742
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationSection Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018
Section Notes 9 Midterm 2 Review Applied Math / Engineering Sciences 121 Week of December 3, 2018 The following list of topics is an overview of the material that was covered in the lectures and sections
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationMarkov Decision Processes Chapter 17. Mausam
Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.
More informationReinforcement Learning. Introduction
Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control
More informationMarkov decision processes and interval Markov chains: exploiting the connection
Markov decision processes and interval Markov chains: exploiting the connection Mingmei Teo Supervisors: Prof. Nigel Bean, Dr Joshua Ross University of Adelaide July 10, 2013 Intervals and interval arithmetic
More informationCourse 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016
Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the
More informationMarkov Decision Processes
Markov Decision Processes Noel Welsh 11 November 2010 Noel Welsh () Markov Decision Processes 11 November 2010 1 / 30 Annoucements Applicant visitor day seeks robot demonstrators for exciting half hour
More informationMarkov decision processes
CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only
More informationRECURSION EQUATION FOR
Math 46 Lecture 8 Infinite Horizon discounted reward problem From the last lecture: The value function of policy u for the infinite horizon problem with discount factor a and initial state i is W i, u
More informationComputational complexity estimates for value and policy iteration algorithms for total-cost and average-cost Markov decision processes
Computational complexity estimates for value and policy iteration algorithms for total-cost and average-cost Markov decision processes Jefferson Huang Dept. Applied Mathematics and Statistics Stony Brook
More informationThe Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount
The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount Yinyu Ye Department of Management Science and Engineering and Institute of Computational
More informationExponential Moving Average Based Multiagent Reinforcement Learning Algorithms
Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca
More information, and rewards and transition matrices as shown below:
CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount
More informationDISCRETE STOCHASTIC PROCESSES Draft of 2nd Edition
DISCRETE STOCHASTIC PROCESSES Draft of 2nd Edition R. G. Gallager January 31, 2011 i ii Preface These notes are a draft of a major rewrite of a text [9] of the same name. The notes and the text are outgrowths
More informationReinforcement Learning
1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision
More informationLearning to Coordinate Efficiently: A Model-based Approach
Journal of Artificial Intelligence Research 19 (2003) 11-23 Submitted 10/02; published 7/03 Learning to Coordinate Efficiently: A Model-based Approach Ronen I. Brafman Computer Science Department Ben-Gurion
More informationInfinite-Horizon Discounted Markov Decision Processes
Infinite-Horizon Discounted Markov Decision Processes Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 1 Outline The expected
More informationCSE250A Fall 12: Discussion Week 9
CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.
More informationTotal Expected Discounted Reward MDPs: Existence of Optimal Policies
Total Expected Discounted Reward MDPs: Existence of Optimal Policies Eugene A. Feinberg Department of Applied Mathematics and Statistics State University of New York at Stony Brook Stony Brook, NY 11794-3600
More informationUNCORRECTED PROOFS. P{X(t + s) = j X(t) = i, X(u) = x(u), 0 u < t} = P{X(t + s) = j X(t) = i}.
Cochran eorms934.tex V1 - May 25, 21 2:25 P.M. P. 1 UNIFORMIZATION IN MARKOV DECISION PROCESSES OGUZHAN ALAGOZ MEHMET U.S. AYVACI Department of Industrial and Systems Engineering, University of Wisconsin-Madison,
More informationInternet Monetization
Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition
More informationSome AI Planning Problems
Course Logistics CS533: Intelligent Agents and Decision Making M, W, F: 1:00 1:50 Instructor: Alan Fern (KEC2071) Office hours: by appointment (see me after class or send email) Emailing me: include CS533
More informationMotivation for introducing probabilities
for introducing probabilities Reaching the goals is often not sufficient: it is important that the expected costs do not outweigh the benefit of reaching the goals. 1 Objective: maximize benefits - costs.
More informationStochastic Primal-Dual Methods for Reinforcement Learning
Stochastic Primal-Dual Methods for Reinforcement Learning Alireza Askarian 1 Amber Srivastava 1 1 Department of Mechanical Engineering University of Illinois at Urbana Champaign Big Data Optimization,
More informationCover Page. The handle holds various files of this Leiden University dissertation
Cover Page The handle http://hdl.handle.net/1887/39637 holds various files of this Leiden University dissertation Author: Smit, Laurens Title: Steady-state analysis of large scale systems : the successive
More informationDiscrete planning (an introduction)
Sistemi Intelligenti Corso di Laurea in Informatica, A.A. 2017-2018 Università degli Studi di Milano Discrete planning (an introduction) Nicola Basilico Dipartimento di Informatica Via Comelico 39/41-20135
More informationExponential Moving Average Based Multiagent Reinforcement Learning Algorithms
Artificial Intelligence Review manuscript No. (will be inserted by the editor) Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Howard M. Schwartz Received:
More informationMATH 56A: STOCHASTIC PROCESSES CHAPTER 1
MATH 56A: STOCHASTIC PROCESSES CHAPTER. Finite Markov chains For the sake of completeness of these notes I decided to write a summary of the basic concepts of finite Markov chains. The topics in this chapter
More informationCS 7180: Behavioral Modeling and Decisionmaking
CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and
More informationProbabilistic Model Checking and Strategy Synthesis for Robot Navigation
Probabilistic Model Checking and Strategy Synthesis for Robot Navigation Dave Parker University of Birmingham (joint work with Bruno Lacerda, Nick Hawes) AIMS CDT, Oxford, May 2015 Overview Probabilistic
More informationReductions Of Undiscounted Markov Decision Processes and Stochastic Games To Discounted Ones. Jefferson Huang
Reductions Of Undiscounted Markov Decision Processes and Stochastic Games To Discounted Ones Jefferson Huang School of Operations Research and Information Engineering Cornell University November 16, 2016
More informationDynamic Control of a Tandem Queueing System with Abandonments
Dynamic Control of a Tandem Queueing System with Abandonments Gabriel Zayas-Cabán 1 Jungui Xie 2 Linda V. Green 3 Mark E. Lewis 1 1 Cornell University Ithaca, NY 2 University of Science and Technology
More informationPracticable Robust Markov Decision Processes
Practicable Robust Markov Decision Processes Huan Xu Department of Mechanical Engineering National University of Singapore Joint work with Shiau-Hong Lim (IBM), Shie Mannor (Techion), Ofir Mebel (Apple)
More informationDecision Theory: Q-Learning
Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning
More informationReinforcement Learning
Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value
More informationDistributed Optimization. Song Chong EE, KAIST
Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links
More informationMarkov decision processes in minimization of expected costs
Croatian Operational Research Review 247 CRORR 5(2014), 247 257 Markov decision processes in minimization of expected costs Marija Rukav 1,, Kruno Stražanac 2, Nenad Šuvak 1 and Tomljanović 1 Zoran 1 Department
More informationMarkov Processes Hamid R. Rabiee
Markov Processes Hamid R. Rabiee Overview Markov Property Markov Chains Definition Stationary Property Paths in Markov Chains Classification of States Steady States in MCs. 2 Markov Property A discrete
More informationMarkov Decision Processes and Dynamic Programming
Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes
More informationControl Theory : Course Summary
Control Theory : Course Summary Author: Joshua Volkmann Abstract There are a wide range of problems which involve making decisions over time in the face of uncertainty. Control theory draws from the fields
More informationToday s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes
Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks
More informationInventory Ordering Control for a Retrial Service Facility System Semi- MDP
International Journal of Engineering Science Invention (IJESI) ISS (Online): 239 6734, ISS (Print): 239 6726 Volume 7 Issue 6 Ver I June 208 PP 4-20 Inventory Ordering Control for a Retrial Service Facility
More informationInfinite-Horizon Average Reward Markov Decision Processes
Infinite-Horizon Average Reward Markov Decision Processes Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 1 Outline The average
More informationMARKOV DECISION PROCESSES LODEWIJK KALLENBERG UNIVERSITY OF LEIDEN
MARKOV DECISION PROCESSES LODEWIJK KALLENBERG UNIVERSITY OF LEIDEN Preface Branching out from operations research roots of the 950 s, Markov decision processes (MDPs) have gained recognition in such diverse
More informationChapter 16 focused on decision making in the face of uncertainty about one future
9 C H A P T E R Markov Chains Chapter 6 focused on decision making in the face of uncertainty about one future event (learning the true state of nature). However, some decisions need to take into account
More informationISM206 Lecture, May 12, 2005 Markov Chain
ISM206 Lecture, May 12, 2005 Markov Chain Instructor: Kevin Ross Scribe: Pritam Roy May 26, 2005 1 Outline of topics for the 10 AM lecture The topics are: Discrete Time Markov Chain Examples Chapman-Kolmogorov
More information16.4 Multiattribute Utility Functions
285 Normalized utilities The scale of utilities reaches from the best possible prize u to the worst possible catastrophe u Normalized utilities use a scale with u = 0 and u = 1 Utilities of intermediate
More informationA Polynomial-time Nash Equilibrium Algorithm for Repeated Games
A Polynomial-time Nash Equilibrium Algorithm for Repeated Games Michael L. Littman mlittman@cs.rutgers.edu Rutgers University Peter Stone pstone@cs.utexas.edu The University of Texas at Austin Main Result
More informationMULTIPLE CHOICE QUESTIONS DECISION SCIENCE
MULTIPLE CHOICE QUESTIONS DECISION SCIENCE 1. Decision Science approach is a. Multi-disciplinary b. Scientific c. Intuitive 2. For analyzing a problem, decision-makers should study a. Its qualitative aspects
More informationMS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction
MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent
More informationA Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time
A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement
More informationQ-Learning for Markov Decision Processes*
McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of
More informationLecture 3: Markov Decision Processes
Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov
More informationOPTIMALITY OF RANDOMIZED TRUNK RESERVATION FOR A PROBLEM WITH MULTIPLE CONSTRAINTS
OPTIMALITY OF RANDOMIZED TRUNK RESERVATION FOR A PROBLEM WITH MULTIPLE CONSTRAINTS Xiaofei Fan-Orzechowski Department of Applied Mathematics and Statistics State University of New York at Stony Brook Stony
More informationA POMDP Framework for Cognitive MAC Based on Primary Feedback Exploitation
A POMDP Framework for Cognitive MAC Based on Primary Feedback Exploitation Karim G. Seddik and Amr A. El-Sherif 2 Electronics and Communications Engineering Department, American University in Cairo, New
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationINTRODUCTION TO MARKOV DECISION PROCESSES
INTRODUCTION TO MARKOV DECISION PROCESSES Balázs Csanád Csáji Research Fellow, The University of Melbourne Signals & Systems Colloquium, 29 April 2010 Department of Electrical and Electronic Engineering,
More informationLecture 1. Evolution of Market Concentration
Lecture 1 Evolution of Market Concentration Take a look at : Doraszelski and Pakes, A Framework for Applied Dynamic Analysis in IO, Handbook of I.O. Chapter. (see link at syllabus). Matt Shum s notes are
More informationMarkov Decision Processes and Solving Finite Problems. February 8, 2017
Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:
More informationStochastic Processes. Theory for Applications. Robert G. Gallager CAMBRIDGE UNIVERSITY PRESS
Stochastic Processes Theory for Applications Robert G. Gallager CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv Swgg&sfzoMj ybr zmjfr%cforj owf fmdy xix Acknowledgements xxi 1 Introduction and review
More informationProcedia Computer Science 00 (2011) 000 6
Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-
More informationNotes on Coursera s Game Theory
Notes on Coursera s Game Theory Manoel Horta Ribeiro Week 01: Introduction and Overview Game theory is about self interested agents interacting within a specific set of rules. Self-Interested Agents have
More informationThe Complexity of Ergodic Mean-payoff Games,
The Complexity of Ergodic Mean-payoff Games, Krishnendu Chatterjee Rasmus Ibsen-Jensen Abstract We study two-player (zero-sum) concurrent mean-payoff games played on a finite-state graph. We focus on the
More informationCS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability
CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability Due: Thursday 10/15 in 283 Soda Drop Box by 11:59pm (no slip days) Policy: Can be solved in groups (acknowledge collaborators)
More informationExperts in a Markov Decision Process
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 2004 Experts in a Markov Decision Process Eyal Even-Dar Sham Kakade University of Pennsylvania Yishay Mansour Follow
More informationStrategic Dynamic Jockeying Between Two Parallel Queues
Strategic Dynamic Jockeying Between Two Parallel Queues Amin Dehghanian 1 and Jeffrey P. Kharoufeh 2 Department of Industrial Engineering University of Pittsburgh 1048 Benedum Hall 3700 O Hara Street Pittsburgh,
More informationLearning Equilibrium as a Generalization of Learning to Optimize
Learning Equilibrium as a Generalization of Learning to Optimize Dov Monderer and Moshe Tennenholtz Faculty of Industrial Engineering and Management Technion Israel Institute of Technology Haifa 32000,
More informationZero-Sum Stochastic Games An algorithmic review
Zero-Sum Stochastic Games An algorithmic review Emmanuel Hyon LIP6/Paris Nanterre with N Yemele and L Perrotin Rosario November 2017 Final Meeting Dygame Dygame Project Amstic Outline 1 Introduction Static
More informationRisk-Sensitive and Average Optimality in Markov Decision Processes
Risk-Sensitive and Average Optimality in Markov Decision Processes Karel Sladký Abstract. This contribution is devoted to the risk-sensitive optimality criteria in finite state Markov Decision Processes.
More informationPlanning in Markov Decision Processes
Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov
More informationNear-Optimal Control of Queueing Systems via Approximate One-Step Policy Improvement
Near-Optimal Control of Queueing Systems via Approximate One-Step Policy Improvement Jefferson Huang March 21, 2018 Reinforcement Learning for Processing Networks Seminar Cornell University Performance
More informationSuggested solutions for the exam in SF2863 Systems Engineering. December 19,
Suggested solutions for the exam in SF863 Systems Engineering. December 19, 011 14.00 19.00 Examiner: Per Enqvist, phone: 790 6 98 1. We can think of the support center as a Jackson network. The reception
More informationReinforcement Learning and Deep Reinforcement Learning
Reinforcement Learning and Deep Reinforcement Learning Ashis Kumer Biswas, Ph.D. ashis.biswas@ucdenver.edu Deep Learning November 5, 2018 1 / 64 Outlines 1 Principles of Reinforcement Learning 2 The Q
More informationLecture December 2009 Fall 2009 Scribe: R. Ring In this lecture we will talk about
0368.4170: Cryptography and Game Theory Ran Canetti and Alon Rosen Lecture 7 02 December 2009 Fall 2009 Scribe: R. Ring In this lecture we will talk about Two-Player zero-sum games (min-max theorem) Mixed
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Formal models of interaction Daniel Hennes 27.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Taxonomy of domains Models of
More informationMarkov Decision Processes Infinite Horizon Problems
Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld 1 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T)
More informationStochastic Optimization
Chapter 27 Page 1 Stochastic Optimization Operations research has been particularly successful in two areas of decision analysis: (i) optimization of problems involving many variables when the outcome
More informationA monotonic property of the optimal admission control to an M/M/1 queue under periodic observations with average cost criterion
A monotonic property of the optimal admission control to an M/M/1 queue under periodic observations with average cost criterion Cao, Jianhua; Nyberg, Christian Published in: Seventeenth Nordic Teletraffic
More informationAnswers to selected exercises
Answers to selected exercises A First Course in Stochastic Models, Henk C. Tijms 1.1 ( ) 1.2 (a) Let waiting time if passengers already arrived,. Then,, (b) { (c) Long-run fraction for is (d) Let waiting
More informationDiscrete-Time Markov Decision Processes
CHAPTER 6 Discrete-Time Markov Decision Processes 6.0 INTRODUCTION In the previous chapters we saw that in the analysis of many operational systems the concepts of a state of a system and a state transition
More informationComputation and Dynamic Programming
Computation and Dynamic Programming Huseyin Topaloglu School of Operations Research and Information Engineering, Cornell University, Ithaca, New York 14853, USA topaloglu@orie.cornell.edu June 25, 2010
More informationNotation. state space
Notation S state space i, j, w, z states Φ parameter space modelling decision rules and perturbations φ elements of parameter space δ decision rule and stationary, deterministic policy δ(i) decision rule
More informationLecture Notes 7 Random Processes. Markov Processes Markov Chains. Random Processes
Lecture Notes 7 Random Processes Definition IID Processes Bernoulli Process Binomial Counting Process Interarrival Time Process Markov Processes Markov Chains Classification of States Steady State Probabilities
More informationRestless Bandit Index Policies for Solving Constrained Sequential Estimation Problems
Restless Bandit Index Policies for Solving Constrained Sequential Estimation Problems Sofía S. Villar Postdoctoral Fellow Basque Center for Applied Mathemathics (BCAM) Lancaster University November 5th,
More informationReinforcement Learning. Yishay Mansour Tel-Aviv University
Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak
More informationA Simple Solution for the M/D/c Waiting Time Distribution
A Simple Solution for the M/D/c Waiting Time Distribution G.J.Franx, Universiteit van Amsterdam November 6, 998 Abstract A surprisingly simple and explicit expression for the waiting time distribution
More informationTime Reversibility and Burke s Theorem
Queuing Analysis: Time Reversibility and Burke s Theorem Hongwei Zhang http://www.cs.wayne.edu/~hzhang Acknowledgement: this lecture is partially based on the slides of Dr. Yannis A. Korilis. Outline Time-Reversal
More informationLinear and Integer Programming - ideas
Linear and Integer Programming - ideas Paweł Zieliński Institute of Mathematics and Computer Science, Wrocław University of Technology, Poland http://www.im.pwr.wroc.pl/ pziel/ Toulouse, France 2012 Literature
More informationA general algorithm to compute the steady-state solution of product-form cooperating Markov chains
A general algorithm to compute the steady-state solution of product-form cooperating Markov chains Università Ca Foscari di Venezia Dipartimento di Informatica Italy 2009 Presentation outline 1 Product-form
More information1 Random Walks and Electrical Networks
CME 305: Discrete Mathematics and Algorithms Random Walks and Electrical Networks Random walks are widely used tools in algorithm design and probabilistic analysis and they have numerous applications.
More information