Some notes on Markov Decision Theory

Size: px
Start display at page:

Download "Some notes on Markov Decision Theory"


1 Some notes on Markov Decision Theory Nikolaos Laoutaris January,

2 Markov Decision Theory[1, 2, 3, 4] provides a methodology for the analysis of probabilistic sequential decision processes at an infinite/finite planning horizon Queueing Theory + Markov Processes: model a probabilistic system in order to evaluate its performance Markov Decision Theory: goes one step ahead; design the operation of a probabilistic system so as to optimize its performance 2

3 Probabilistic : meaning that there exists an environment, which cannot be described to its full detail, thus it is taken to be random (stochastic), following some probability law that best describes its nature. It is in this environment that our agent is to operate. Sequential : The theory aims at providing the tools for optimizing behaviors, i.e., sequences of decisions, not single decisions. Planning horizon : When deciding our behavior we must take into consideration the length of our intended activity. A poker player has a different behavior when opening a new game, as compared to the case that he is finishing one (and has secured a profit or suffered a loss). Behaviors depend on whether we will be participate in an activity for a finite 2-1

4 amount of time and then abandon it, or have decided to be involved permanently in it. Queuing theory vs decision theory : In a queuing model (say an M/M/1 queue) everything is fixed. The (stochastic) arrival process and the (stochastic) service process are parts of the environment in which we study some performance metric of interest (e.g., the expected queuing delay). An application of decision theory on queues (an M/X/1 queue, where X is an unknown service policy that we want to design and optimize) would consider the arrival process as the only element of the environment. Decision theory would provide the tools to design and optimize the service process so as to achieve a desired goal (e.g., avoid underflows or overflows). 2-2

5 A Markov Decision Process (MDP) is a discrete Markov process {In}n>0 characterized by a tuple S, A, P, C : S is the set of possible states. {In} is in state i at time n iff In = i, 0 i M A is the set of possible actions (or decisions). Following an observation, an action k is taken from the finite action space A, k = 0, 1,..., K P : S A S [0, 1] is the state transition function specifying the probability P {j i, k} pij(k) of observing a transition to state j S after taking action k A in state i S 3

6 Observation instances The process can be either continuous or discrete time. We focus on discrete time MDPs. Discrete time MDPs come in two flavors: discrete MDPs that are time-less (we do not model the time between observation instances, e.g., n > n + 1, or take it to be constant). These are in a way generalizations of Markov chains (some times disc. MDPs of that kind are called controllable Markov chains discrete MDPs that allow for time to pass between observation instances. These are much like generalized versions of semi-markov processes. In such cases the Markov property holds only upon observation instances (and not at arbitrary instances) and we may exploit them to optimize a decision system that acts 3-1

7 upon observation instances To have a continuous time MDP would require the Markov property to be in place at all time instances. This is rather restricting as most processes of interest posses the Markov property in selected times of interest and not generally. The action state can be either homogeneous or non-homogeneous. Homogeneous there is a common action state, A, among which we choose actions according to the current state. Non-homogeneous each state i is associated to a potentially different set of actions, Ai, among which a decision must be made. 3-2

8 C : S A R is a function specifying the cost ci(k) of taking action k A at state i S; ci(k) must depend only on the current state-action pair A policy R = (d0, d1,..., dm ) prescribes an action for each possible state. di(r) = k means that under policy R, action k is taken when the process is in i π(r) = (π0, π1,..., πm ) is the limiting distribution of {In} under policy R π(r) = π(r) P (R) the objective is to find the optimal policy Ropt that minimizes some cost criterion which considers both immediate and subsequent costs from the future evolution of {In} 4

9 Costs are used to drive the agent towards the desired behavior (defined by an objective function). Costs are used with minimization objectives. Alternatively, we may use rewards in conjunction with maximization objectives. Be careful when defining costs : Costs must depend only on the current state. This is a source of errors. Be particularly careful because in many cases the cost may depend on the next state also. In such cases, average over all possible transitions to have a legitimate MDP cost (that depends only on the current state). 4-1

10 Types of policies: Stationary policies always follow the same rule for choosing an action for each possible state, independently of the current time n Non-stationary polices behave differently as time evolves Deterministic policies always take the same action when at the same state: di(r) = k with probability 1 Randomized policies map a probability distribution over the possible actions to each state: di(r) = k with probability ρi(k) : k A ρ i(k) = 1 5

11 Policy space Stationary Non-Stationary Randomized Deterministic Deterministic policy: 1 state 1 action Randomized policy: 1 state 1 probability distribution over the action space 6

12 Stationarity and randomization are concepts that belong to different levels of characterizing a policy. They do not compare directly (e.g., stationary is NOT the opposite of randomized)! Stationarity is all about whether the rule for choosing decisions is affected by time Randomization refers to the way decisions are made for particular states (under a stationary or non-stationary policy) Stationary policies arise when optimizing over an infinite horizon. This makes sense intuitively. It is boundary (time) conditions that might prompt for change of behavior (given a fixed environment). A poker player that already looses money might bet more aggressively towards the end of the game in a final attempt to recover. Similarly, a winning 6-1

13 (rational!) player might avoid excessive risks towards the end of the game (to protect his winnings). Non-stationary policies arise from finite planning horizons. Behavioral changes due to approaching time boundaries appear also in the domain of Game Theory (a game involves at least two interacting agents (players), whereas the discussed decision theory involves only one agent, whose aim is to adapt to his environment, rather than compete with another rational entity). An interesting discussion of such behavioral issues appears in the context of the Iterated Prisoner s Dilemma, and other games 1. 1 William Poundstone, Prisoner s Dilemma: John Von Neumann, Game Theory and the Puzzle of the Bomb, Anchor Books, (highly recommended!) 6-2

14 Cost criteria and planning horizon Finite horizon undiscounted-cost problems require the minimization of the total expected accumulated cost over a finite planning horizon of W observations (transitions): { W } Ei{c}(R) = E (d (R)) I cin In 0 = i n=1 Infinite horizon problems consider an infinite planning horizon. Appropriate for systems that are expected to operate continuously or for systems that have an unknown stopping time 7

15 To understand Ei{c}(R) the expected cost when starting from state i and operating for W time units under policy R remember the following definitions: I0 is the initial state of the process (at n = 0) In is the state of the process at time n din (R) is the decision taken in state I n under policy R cin (d In (R)) is the cost incurred by taken decision d In (R) in state In 7-1

16 Discounted-cost problems (finite/infinite horizon) Attach a discount factor α, 0 < α < 1, to each immediate cost ci(k) thus affecting the relative importance of immediate costs over future costs: Ei{c}(R) = E { n=1 α n cin (d In (R)) I 0 = i } When α 1 future costs tend to count as much as immediate costs. Otherwise, future costs tend to be heavily discounted so, for an optimal performance, more attention must be given to the minimization of immediate costs 8

17 Average cost optimality (goes with infinite horizon) Requires the minimization of the expected average cost per unit of time: Ei{c}(R) = E { lim n 1 n n h=1 cih (d Ih (R)) I 0 = i } (1) As n, P {In = j I0 = i}(r) πj(r), independently of the initial state I0 = i, thus: E{c}(R) = j S πj(r) cj(dj(r)) (2) 9

18 Derivation (1) (2): The limiting probability πj(r) can be written as follows: 1 πj(r) = lim n n n h=1 P {Ih = j I0 = i}(r) (3) 9-1

19 Thus, starting from (1): Ei{c}(R) = E { lim n 1 = lim n n 1 = lim n n = j S = j S = (2) 1 n } n (d (R)) I cih Ih 0 = i h=1 n { } E (d (R)) I cih Ih 0 = i h=1 n h=1 j S cj(dj(r))p {Ih = j I0 = i}(r) 1 cj(dj(r)) lim n n n h=1 P {Ih = j I0 = i}(r) cj(dj(r)) πj(r) (substituting from (3)) 9-2

20 The optimal policy Ropt is the one that incurs the smallest cost: E{c}(Ropt) E{c}(R) over all R S A 1. An optimal police does not always exist under the average cost criterion 2. If an optimal police does exist, it is not guaranteed to be stationary 3. If S is finite and every stationary policy gives rise to an irreducible Markov chain then the stationarity of the optimal policy is guaranteed (+it is non-randomized) (see S.M. Ross [2] for more details) 10

21 Finding the optimal policy 1. Exhaustive enumeration. Suitable only for tiny problems due to O(size(A) size(s) ) complexity 2. Linear Programming (LP) 3. Policy improvement algorithm 4. Value iteration algorithm 11

22 Solving the MDP via Linear Programming The optimal policy can be identified efficiently by transforming the MDP formulation into a linear program Denote: Dik = P {action = k state = i} Dik s completely define a policy Also denote: yik = P {action = k and state = i} 12

23 Clearly the two are related via: yik = πidik (4) Also K πi = yik (5) k=0 From (4), (5): Dik = y ik πi = yik K k=0 y ik (6) 13

24 The are several constraints on yik s: 1. M πi = 1 i=0 M 2. πj = M πipij K i=0 k=0 yik = 1 K yik = M K i=0 k=0 i=0 k=0 3. yik 0 i, k yikpij(k) j The steady-state average cost per unit time is: M E(C) = K πicikdik = M i=0 k=0 i=0 k=0 K cikyik 14

25 yik s are obtained from the following LP: Minimize M z = E{c} = Subject to j : K yjk = k=0 M K i=0 k=0 i=0 k=0 M i=0 k=0 K cik yik (7) K yikpij(k) (8) yik = 1 (9) i, k : yik 0 (10) 15

26 D ik s are then readily available by using equation (6) The D ik s of the optimal solution are either 0 or 1, i.e., the optimal policy is non-randomized This is because the aforementioned LP has a totally unimodular constraint matrix and integer constants These two properties guarantee that an optimal solution to the LP through the Simplex method will return an integral solution (in the current case having 0 s and 1 s). See [5] for more on unimodularity 16

27 A policy improvement algorithm Very efficient for large problems Starts with an arbitrary policy which is progressively improved at each iteration of the algorithm, until the optimal policy is reached Convergence after a finite number of iterations is guaranteed for problems with finite state and action sets S, A 17

28 The theory of policy improvement v n i (R): the total expected cost of starting from state i and operating for n periods under policy R: v n i (R) = c i(k) + M pij(k) v n 1 j (R) (11) j=0 The expected average cost is independent of the initial state i: E{c}(R) = M πj cj(k) (12) j=0 For large n we have: v n i (R) n E{c}(R) + v i(r) (13) vi(r) captures the effect of starting from state i, on the total expected cost v i n (R), thus: v n i (R) vn j (R) v i(r) vj(r) (14) 18

29 substituting v n i (R) n E{c}(R) + v i(r) and v n 1 j (R) (n 1) E{c}(R) + vj(r) into equation (11) we take: E{c}(R) + vi(r) = ci(k) + M pij(k) vj(r) for i = 0, 1,..., M j=0 (15) The system of equation (15) has M + 2 unknowns (E{c}(R), vi(r)) and M + 1 equations. By setting vm (R) = 0 we can find the vi(r)s and the cost associated with a particular policy Theoretically, the recursive equation could be used for an exhaustive search for the optimal policy but this is not computationally efficient 19

30 The policy improvement algorithm Initialization Select an arbitrary initial policy R0. Iteration n Perform the following steps: Value determination For policy Rn solve the system of M + 1 equations E{c}(Rn) = ci(k)+ M pij(k)vj(rn) vi(rn), for 0 i M j=0 for the M + 1 unknown values E{c}(Rn), v0(rn), v1(rn),..., vm 1(Rn). 20

31 Policy improvement Using the values of vi(rn) computed for policy Rn, find an improved policy Rn+1 such that for each state i, Di(Rn+1) = k is the decision that minimizes: ci(k) + M pij(k)vj(rn) vi(rn), for 0 i M (16) j=0 i.e, for each state i minimize (16) and set di(rn+1) equal to the minimizing value of k. This procedure defines a new policy Rn+1 with E{c}(Rn+1) E{c}(Rn) (see Theorem 3.2 in [3]). Optimality test If the current policy Rn+1 is identical to the previous Rn, then it is the optimal policy. Otherwise set n = n + 1 and perform another iteration. 21

32 References [1] H. Mine and S. Osaki, Markovian Decision Processes, Elsevier, Amsterdam, [2] Sheldon M. Ross, Applied Probability Models with Optimization Applications, Dover Publications, New York, [3] Henk C. Tijms, Stochastic Modelling and Analysis: A Computational Approach, John Wiley & Sons, [4] Frederick S. Hillier and Gerald J. Lieberman, Introduction to Operations Research, McGraw-Hill, [5] Christos H. Papadimitriou and Kenneth Steiglitz, Combinatorial Optimization: Algorithms and Complexity, Dover Publications, New York,

1 Markov decision processes

1 Markov decision processes 2.997 Decision-Making in Large-Scale Systems February 4 MI, Spring 2004 Handout #1 Lecture Note 1 1 Markov decision processes In this class we will study discrete-time stochastic systems. We can describe

More information

21 Markov Decision Processes

21 Markov Decision Processes 2 Markov Decision Processes Chapter 6 introduced Markov chains and their analysis. Most of the chapter was devoted to discrete time Markov chains, i.e., Markov chains that are observed only at discrete

More information


Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS 63 2.1 Introduction In this chapter we describe the analytical tools used in this thesis. They are Markov Decision Processes(MDP), Markov Renewal process

More information

The Markov Decision Process (MDP) model

The Markov Decision Process (MDP) model Decision Making in Robots and Autonomous Agents The Markov Decision Process (MDP) model Subramanian Ramamoorthy School of Informatics 25 January, 2013 In the MAB Model We were in a single casino and the

More information

Lecture notes for Analysis of Algorithms : Markov decision processes

Lecture notes for Analysis of Algorithms : Markov decision processes Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with

More information

AM 121: Intro to Optimization Models and Methods: Fall 2018

AM 121: Intro to Optimization Models and Methods: Fall 2018 AM 11: Intro to Optimization Models and Methods: Fall 018 Lecture 18: Markov Decision Processes Yiling Chen Lesson Plan Markov decision processes Policies and value functions Solving: average reward, discounted

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

Finding the Value of Information About a State Variable in a Markov Decision Process 1

Finding the Value of Information About a State Variable in a Markov Decision Process 1 05/25/04 1 Finding the Value of Information About a State Variable in a Markov Decision Process 1 Gilvan C. Souza The Robert H. Smith School of usiness, The University of Maryland, College Park, MD, 20742

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Section Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018

Section Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018 Section Notes 9 Midterm 2 Review Applied Math / Engineering Sciences 121 Week of December 3, 2018 The following list of topics is an overview of the material that was covered in the lectures and sections

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Markov Decision Processes Chapter 17. Mausam

Markov Decision Processes Chapter 17. Mausam Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

Markov decision processes and interval Markov chains: exploiting the connection

Markov decision processes and interval Markov chains: exploiting the connection Markov decision processes and interval Markov chains: exploiting the connection Mingmei Teo Supervisors: Prof. Nigel Bean, Dr Joshua Ross University of Adelaide July 10, 2013 Intervals and interval arithmetic

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Noel Welsh 11 November 2010 Noel Welsh () Markov Decision Processes 11 November 2010 1 / 30 Annoucements Applicant visitor day seeks robot demonstrators for exciting half hour

More information

Markov decision processes

Markov decision processes CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only

More information


RECURSION EQUATION FOR Math 46 Lecture 8 Infinite Horizon discounted reward problem From the last lecture: The value function of policy u for the infinite horizon problem with discount factor a and initial state i is W i, u

More information

Computational complexity estimates for value and policy iteration algorithms for total-cost and average-cost Markov decision processes

Computational complexity estimates for value and policy iteration algorithms for total-cost and average-cost Markov decision processes Computational complexity estimates for value and policy iteration algorithms for total-cost and average-cost Markov decision processes Jefferson Huang Dept. Applied Mathematics and Statistics Stony Brook

More information

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount Yinyu Ye Department of Management Science and Engineering and Institute of Computational

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email:

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information


DISCRETE STOCHASTIC PROCESSES Draft of 2nd Edition DISCRETE STOCHASTIC PROCESSES Draft of 2nd Edition R. G. Gallager January 31, 2011 i ii Preface These notes are a draft of a major rewrite of a text [9] of the same name. The notes and the text are outgrowths

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

Learning to Coordinate Efficiently: A Model-based Approach

Learning to Coordinate Efficiently: A Model-based Approach Journal of Artificial Intelligence Research 19 (2003) 11-23 Submitted 10/02; published 7/03 Learning to Coordinate Efficiently: A Model-based Approach Ronen I. Brafman Computer Science Department Ben-Gurion

More information

Infinite-Horizon Discounted Markov Decision Processes

Infinite-Horizon Discounted Markov Decision Processes Infinite-Horizon Discounted Markov Decision Processes Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 1 Outline The expected

More information

CSE250A Fall 12: Discussion Week 9

CSE250A Fall 12: Discussion Week 9 CSE250A Fall 12: Discussion Week 9 Aditya Menon ( December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.

More information

Total Expected Discounted Reward MDPs: Existence of Optimal Policies

Total Expected Discounted Reward MDPs: Existence of Optimal Policies Total Expected Discounted Reward MDPs: Existence of Optimal Policies Eugene A. Feinberg Department of Applied Mathematics and Statistics State University of New York at Stony Brook Stony Brook, NY 11794-3600

More information

UNCORRECTED PROOFS. P{X(t + s) = j X(t) = i, X(u) = x(u), 0 u < t} = P{X(t + s) = j X(t) = i}.

UNCORRECTED PROOFS. P{X(t + s) = j X(t) = i, X(u) = x(u), 0 u < t} = P{X(t + s) = j X(t) = i}. Cochran eorms934.tex V1 - May 25, 21 2:25 P.M. P. 1 UNIFORMIZATION IN MARKOV DECISION PROCESSES OGUZHAN ALAGOZ MEHMET U.S. AYVACI Department of Industrial and Systems Engineering, University of Wisconsin-Madison,

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Some AI Planning Problems

Some AI Planning Problems Course Logistics CS533: Intelligent Agents and Decision Making M, W, F: 1:00 1:50 Instructor: Alan Fern (KEC2071) Office hours: by appointment (see me after class or send email) Emailing me: include CS533

More information

Motivation for introducing probabilities

Motivation for introducing probabilities for introducing probabilities Reaching the goals is often not sufficient: it is important that the expected costs do not outweigh the benefit of reaching the goals. 1 Objective: maximize benefits - costs.

More information

Stochastic Primal-Dual Methods for Reinforcement Learning

Stochastic Primal-Dual Methods for Reinforcement Learning Stochastic Primal-Dual Methods for Reinforcement Learning Alireza Askarian 1 Amber Srivastava 1 1 Department of Mechanical Engineering University of Illinois at Urbana Champaign Big Data Optimization,

More information

Cover Page. The handle holds various files of this Leiden University dissertation

Cover Page. The handle  holds various files of this Leiden University dissertation Cover Page The handle holds various files of this Leiden University dissertation Author: Smit, Laurens Title: Steady-state analysis of large scale systems : the successive

More information

Discrete planning (an introduction)

Discrete planning (an introduction) Sistemi Intelligenti Corso di Laurea in Informatica, A.A. 2017-2018 Università degli Studi di Milano Discrete planning (an introduction) Nicola Basilico Dipartimento di Informatica Via Comelico 39/41-20135

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Artificial Intelligence Review manuscript No. (will be inserted by the editor) Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Howard M. Schwartz Received:

More information


MATH 56A: STOCHASTIC PROCESSES CHAPTER 1 MATH 56A: STOCHASTIC PROCESSES CHAPTER. Finite Markov chains For the sake of completeness of these notes I decided to write a summary of the basic concepts of finite Markov chains. The topics in this chapter

More information

CS 7180: Behavioral Modeling and Decisionmaking

CS 7180: Behavioral Modeling and Decisionmaking CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and

More information

Probabilistic Model Checking and Strategy Synthesis for Robot Navigation

Probabilistic Model Checking and Strategy Synthesis for Robot Navigation Probabilistic Model Checking and Strategy Synthesis for Robot Navigation Dave Parker University of Birmingham (joint work with Bruno Lacerda, Nick Hawes) AIMS CDT, Oxford, May 2015 Overview Probabilistic

More information

Reductions Of Undiscounted Markov Decision Processes and Stochastic Games To Discounted Ones. Jefferson Huang

Reductions Of Undiscounted Markov Decision Processes and Stochastic Games To Discounted Ones. Jefferson Huang Reductions Of Undiscounted Markov Decision Processes and Stochastic Games To Discounted Ones Jefferson Huang School of Operations Research and Information Engineering Cornell University November 16, 2016

More information

Dynamic Control of a Tandem Queueing System with Abandonments

Dynamic Control of a Tandem Queueing System with Abandonments Dynamic Control of a Tandem Queueing System with Abandonments Gabriel Zayas-Cabán 1 Jungui Xie 2 Linda V. Green 3 Mark E. Lewis 1 1 Cornell University Ithaca, NY 2 University of Science and Technology

More information

Practicable Robust Markov Decision Processes

Practicable Robust Markov Decision Processes Practicable Robust Markov Decision Processes Huan Xu Department of Mechanical Engineering National University of Singapore Joint work with Shiau-Hong Lim (IBM), Shie Mannor (Techion), Ofir Mebel (Apple)

More information

Decision Theory: Q-Learning

Decision Theory: Q-Learning Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Distributed Optimization. Song Chong EE, KAIST

Distributed Optimization. Song Chong EE, KAIST Distributed Optimization Song Chong EE, KAIST Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links

More information

Markov decision processes in minimization of expected costs

Markov decision processes in minimization of expected costs Croatian Operational Research Review 247 CRORR 5(2014), 247 257 Markov decision processes in minimization of expected costs Marija Rukav 1,, Kruno Stražanac 2, Nenad Šuvak 1 and Tomljanović 1 Zoran 1 Department

More information

Markov Processes Hamid R. Rabiee

Markov Processes Hamid R. Rabiee Markov Processes Hamid R. Rabiee Overview Markov Property Markov Chains Definition Stationary Property Paths in Markov Chains Classification of States Steady States in MCs. 2 Markov Property A discrete

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes

More information

Control Theory : Course Summary

Control Theory : Course Summary Control Theory : Course Summary Author: Joshua Volkmann Abstract There are a wide range of problems which involve making decisions over time in the face of uncertainty. Control theory draws from the fields

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

Inventory Ordering Control for a Retrial Service Facility System Semi- MDP

Inventory Ordering Control for a Retrial Service Facility System Semi- MDP International Journal of Engineering Science Invention (IJESI) ISS (Online): 239 6734, ISS (Print): 239 6726 Volume 7 Issue 6 Ver I June 208 PP 4-20 Inventory Ordering Control for a Retrial Service Facility

More information

Infinite-Horizon Average Reward Markov Decision Processes

Infinite-Horizon Average Reward Markov Decision Processes Infinite-Horizon Average Reward Markov Decision Processes Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 1 Outline The average

More information


MARKOV DECISION PROCESSES LODEWIJK KALLENBERG UNIVERSITY OF LEIDEN MARKOV DECISION PROCESSES LODEWIJK KALLENBERG UNIVERSITY OF LEIDEN Preface Branching out from operations research roots of the 950 s, Markov decision processes (MDPs) have gained recognition in such diverse

More information

Chapter 16 focused on decision making in the face of uncertainty about one future

Chapter 16 focused on decision making in the face of uncertainty about one future 9 C H A P T E R Markov Chains Chapter 6 focused on decision making in the face of uncertainty about one future event (learning the true state of nature). However, some decisions need to take into account

More information

ISM206 Lecture, May 12, 2005 Markov Chain

ISM206 Lecture, May 12, 2005 Markov Chain ISM206 Lecture, May 12, 2005 Markov Chain Instructor: Kevin Ross Scribe: Pritam Roy May 26, 2005 1 Outline of topics for the 10 AM lecture The topics are: Discrete Time Markov Chain Examples Chapman-Kolmogorov

More information

16.4 Multiattribute Utility Functions

16.4 Multiattribute Utility Functions 285 Normalized utilities The scale of utilities reaches from the best possible prize u to the worst possible catastrophe u Normalized utilities use a scale with u = 0 and u = 1 Utilities of intermediate

More information

A Polynomial-time Nash Equilibrium Algorithm for Repeated Games

A Polynomial-time Nash Equilibrium Algorithm for Repeated Games A Polynomial-time Nash Equilibrium Algorithm for Repeated Games Michael L. Littman Rutgers University Peter Stone The University of Texas at Austin Main Result

More information


MULTIPLE CHOICE QUESTIONS DECISION SCIENCE MULTIPLE CHOICE QUESTIONS DECISION SCIENCE 1. Decision Science approach is a. Multi-disciplinary b. Scientific c. Intuitive 2. For analyzing a problem, decision-makers should study a. Its qualitative aspects

More information

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent

More information

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan Sandeep Manjanna (*Based on: Convergence of

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information



More information

A POMDP Framework for Cognitive MAC Based on Primary Feedback Exploitation

A POMDP Framework for Cognitive MAC Based on Primary Feedback Exploitation A POMDP Framework for Cognitive MAC Based on Primary Feedback Exploitation Karim G. Seddik and Amr A. El-Sherif 2 Electronics and Communications Engineering Department, American University in Cairo, New

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information


INTRODUCTION TO MARKOV DECISION PROCESSES INTRODUCTION TO MARKOV DECISION PROCESSES Balázs Csanád Csáji Research Fellow, The University of Melbourne Signals & Systems Colloquium, 29 April 2010 Department of Electrical and Electronic Engineering,

More information

Lecture 1. Evolution of Market Concentration

Lecture 1. Evolution of Market Concentration Lecture 1 Evolution of Market Concentration Take a look at : Doraszelski and Pakes, A Framework for Applied Dynamic Analysis in IO, Handbook of I.O. Chapter. (see link at syllabus). Matt Shum s notes are

More information

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Markov Decision Processes and Solving Finite Problems. February 8, 2017 Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:

More information

Stochastic Processes. Theory for Applications. Robert G. Gallager CAMBRIDGE UNIVERSITY PRESS

Stochastic Processes. Theory for Applications. Robert G. Gallager CAMBRIDGE UNIVERSITY PRESS Stochastic Processes Theory for Applications Robert G. Gallager CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv Swgg&sfzoMj ybr zmjfr%cforj owf fmdy xix Acknowledgements xxi 1 Introduction and review

More information

Procedia Computer Science 00 (2011) 000 6

Procedia Computer Science 00 (2011) 000 6 Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-

More information

Notes on Coursera s Game Theory

Notes on Coursera s Game Theory Notes on Coursera s Game Theory Manoel Horta Ribeiro Week 01: Introduction and Overview Game theory is about self interested agents interacting within a specific set of rules. Self-Interested Agents have

More information

The Complexity of Ergodic Mean-payoff Games,

The Complexity of Ergodic Mean-payoff Games, The Complexity of Ergodic Mean-payoff Games, Krishnendu Chatterjee Rasmus Ibsen-Jensen Abstract We study two-player (zero-sum) concurrent mean-payoff games played on a finite-state graph. We focus on the

More information

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability Due: Thursday 10/15 in 283 Soda Drop Box by 11:59pm (no slip days) Policy: Can be solved in groups (acknowledge collaborators)

More information

Experts in a Markov Decision Process

Experts in a Markov Decision Process University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 2004 Experts in a Markov Decision Process Eyal Even-Dar Sham Kakade University of Pennsylvania Yishay Mansour Follow

More information

Strategic Dynamic Jockeying Between Two Parallel Queues

Strategic Dynamic Jockeying Between Two Parallel Queues Strategic Dynamic Jockeying Between Two Parallel Queues Amin Dehghanian 1 and Jeffrey P. Kharoufeh 2 Department of Industrial Engineering University of Pittsburgh 1048 Benedum Hall 3700 O Hara Street Pittsburgh,

More information

Learning Equilibrium as a Generalization of Learning to Optimize

Learning Equilibrium as a Generalization of Learning to Optimize Learning Equilibrium as a Generalization of Learning to Optimize Dov Monderer and Moshe Tennenholtz Faculty of Industrial Engineering and Management Technion Israel Institute of Technology Haifa 32000,

More information

Zero-Sum Stochastic Games An algorithmic review

Zero-Sum Stochastic Games An algorithmic review Zero-Sum Stochastic Games An algorithmic review Emmanuel Hyon LIP6/Paris Nanterre with N Yemele and L Perrotin Rosario November 2017 Final Meeting Dygame Dygame Project Amstic Outline 1 Introduction Static

More information

Risk-Sensitive and Average Optimality in Markov Decision Processes

Risk-Sensitive and Average Optimality in Markov Decision Processes Risk-Sensitive and Average Optimality in Markov Decision Processes Karel Sladký Abstract. This contribution is devoted to the risk-sensitive optimality criteria in finite state Markov Decision Processes.

More information

Planning in Markov Decision Processes

Planning in Markov Decision Processes Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov

More information

Near-Optimal Control of Queueing Systems via Approximate One-Step Policy Improvement

Near-Optimal Control of Queueing Systems via Approximate One-Step Policy Improvement Near-Optimal Control of Queueing Systems via Approximate One-Step Policy Improvement Jefferson Huang March 21, 2018 Reinforcement Learning for Processing Networks Seminar Cornell University Performance

More information

Suggested solutions for the exam in SF2863 Systems Engineering. December 19,

Suggested solutions for the exam in SF2863 Systems Engineering. December 19, Suggested solutions for the exam in SF863 Systems Engineering. December 19, 011 14.00 19.00 Examiner: Per Enqvist, phone: 790 6 98 1. We can think of the support center as a Jackson network. The reception

More information

Reinforcement Learning and Deep Reinforcement Learning

Reinforcement Learning and Deep Reinforcement Learning Reinforcement Learning and Deep Reinforcement Learning Ashis Kumer Biswas, Ph.D. Deep Learning November 5, 2018 1 / 64 Outlines 1 Principles of Reinforcement Learning 2 The Q

More information

Lecture December 2009 Fall 2009 Scribe: R. Ring In this lecture we will talk about

Lecture December 2009 Fall 2009 Scribe: R. Ring In this lecture we will talk about 0368.4170: Cryptography and Game Theory Ran Canetti and Alon Rosen Lecture 7 02 December 2009 Fall 2009 Scribe: R. Ring In this lecture we will talk about Two-Player zero-sum games (min-max theorem) Mixed

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Formal models of interaction Daniel Hennes 27.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Taxonomy of domains Models of

More information

Markov Decision Processes Infinite Horizon Problems

Markov Decision Processes Infinite Horizon Problems Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld 1 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T)

More information

Stochastic Optimization

Stochastic Optimization Chapter 27 Page 1 Stochastic Optimization Operations research has been particularly successful in two areas of decision analysis: (i) optimization of problems involving many variables when the outcome

More information

A monotonic property of the optimal admission control to an M/M/1 queue under periodic observations with average cost criterion

A monotonic property of the optimal admission control to an M/M/1 queue under periodic observations with average cost criterion A monotonic property of the optimal admission control to an M/M/1 queue under periodic observations with average cost criterion Cao, Jianhua; Nyberg, Christian Published in: Seventeenth Nordic Teletraffic

More information

Answers to selected exercises

Answers to selected exercises Answers to selected exercises A First Course in Stochastic Models, Henk C. Tijms 1.1 ( ) 1.2 (a) Let waiting time if passengers already arrived,. Then,, (b) { (c) Long-run fraction for is (d) Let waiting

More information

Discrete-Time Markov Decision Processes

Discrete-Time Markov Decision Processes CHAPTER 6 Discrete-Time Markov Decision Processes 6.0 INTRODUCTION In the previous chapters we saw that in the analysis of many operational systems the concepts of a state of a system and a state transition

More information

Computation and Dynamic Programming

Computation and Dynamic Programming Computation and Dynamic Programming Huseyin Topaloglu School of Operations Research and Information Engineering, Cornell University, Ithaca, New York 14853, USA June 25, 2010

More information

Notation. state space

Notation. state space Notation S state space i, j, w, z states Φ parameter space modelling decision rules and perturbations φ elements of parameter space δ decision rule and stationary, deterministic policy δ(i) decision rule

More information

Lecture Notes 7 Random Processes. Markov Processes Markov Chains. Random Processes

Lecture Notes 7 Random Processes. Markov Processes Markov Chains. Random Processes Lecture Notes 7 Random Processes Definition IID Processes Bernoulli Process Binomial Counting Process Interarrival Time Process Markov Processes Markov Chains Classification of States Steady State Probabilities

More information

Restless Bandit Index Policies for Solving Constrained Sequential Estimation Problems

Restless Bandit Index Policies for Solving Constrained Sequential Estimation Problems Restless Bandit Index Policies for Solving Constrained Sequential Estimation Problems Sofía S. Villar Postdoctoral Fellow Basque Center for Applied Mathemathics (BCAM) Lancaster University November 5th,

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

A Simple Solution for the M/D/c Waiting Time Distribution

A Simple Solution for the M/D/c Waiting Time Distribution A Simple Solution for the M/D/c Waiting Time Distribution G.J.Franx, Universiteit van Amsterdam November 6, 998 Abstract A surprisingly simple and explicit expression for the waiting time distribution

More information

Time Reversibility and Burke s Theorem

Time Reversibility and Burke s Theorem Queuing Analysis: Time Reversibility and Burke s Theorem Hongwei Zhang Acknowledgement: this lecture is partially based on the slides of Dr. Yannis A. Korilis. Outline Time-Reversal

More information

Linear and Integer Programming - ideas

Linear and Integer Programming - ideas Linear and Integer Programming - ideas Paweł Zieliński Institute of Mathematics and Computer Science, Wrocław University of Technology, Poland pziel/ Toulouse, France 2012 Literature

More information

A general algorithm to compute the steady-state solution of product-form cooperating Markov chains

A general algorithm to compute the steady-state solution of product-form cooperating Markov chains A general algorithm to compute the steady-state solution of product-form cooperating Markov chains Università Ca Foscari di Venezia Dipartimento di Informatica Italy 2009 Presentation outline 1 Product-form

More information

1 Random Walks and Electrical Networks

1 Random Walks and Electrical Networks CME 305: Discrete Mathematics and Algorithms Random Walks and Electrical Networks Random walks are widely used tools in algorithm design and probabilistic analysis and they have numerous applications.

More information