Infinite-Horizon Average Reward Markov Decision Processes
|
|
- Cleopatra Lindsey
- 6 years ago
- Views:
Transcription
1 Infinite-Horizon Average Reward Markov Decision Processes Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 1
2 Outline The average reward Classification of MDPs Optimality equations Value iteration in unichain models Policy iteration in unichain models Linear Programming in unichain models Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 2
3 Average Reward Criterion Let π = (d 1, d 2,... ) Π HR Starting at a state s, using policy π leads to a sequence of state-action pairs {X t, Y t }. The sequence of rewards is given by {R t r t (X t, Y t ) : t = 1, 2,... }. The average reward (or gain) from policy π Π HR starting in state s is given by g π (s) lim N [ N ] 1 N Eπ s r(x t, Y t ). t=1 The limit above may not exist, in which case we define [ N ] g (s) π 1 lim inf N N Eπ s r(x t, Y t ), g π +(s) lim sup N 1 N Eπ s t=1 [ N ] r(x t, Y t ). t=1 Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 3
4 Optimality Criteria When g π (s) exists for all s S and π Π HR, a policy π is average optimal if g π (s) g π (s), s S, π Π HR. The value (or optimal gain) is defined by g (s) sup g π (s), s S. π Π HR Let π be an average optimal policy, then g π (s) = g (s) for all s S. Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 4
5 Markov Policies Theorem Suppose π Π HR. For each s S, there exists a π Π MR (which possibly varies with s) for which g π + = g π +, g π = g π, g π = g π whenever g π + = g π, g π + = g π. Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 5
6 Assumptions Stationary rewards and transition probabilities: r(s, a) and p(j s, a) do not vary with time Bounded rewards: r(s, a) M < Finite state spaces Unichain: the transition matrix corresponding to every deterministic stationary policy is unichain (i.e., it consists of a single recurrent class plus a possibly empty set of transient states). Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 6
7 The Average Reward Optimality Equation Unichain Models For unichain models, it can be shown that all stationary policies have constant gain g. Optimality equations: 0 = max a A s In matrix notation: r(s, a) g + j S p(j s, a)h(j) h(s). 0 = max d D {r d ge + (P d I )h} B(g, h). Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 7
8 The Average Reward Optimality Equation Unichain Models Theorem Suppose S is countable. (i) If there exists a scalar g and an h V which satisfy B(g, h) 0, then ge g +; (ii) If there exists a scalar g and an h V which satisfy B(g, h) 0, then ge sup d D MD g d g ; (iii) If there exists a scalar g and an h V which satisfy B(g, h) = 0, then ge = g = g + = g. Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 8
9 Existence of Solutions to the Optimality Equation Unichain Models Theorem Suppose S and A s are finite, r(s, a) M < for all s, a, and the model is unichain. (i) There exists a g R 1 and h V for which 0 = max d D {r d ge + (P d I )h}; (ii) If (g, h ) is any other solution of the average reward optimality equation, then g = g. Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 9
10 Existence of Optimal Policies Unichain Models A decision d h is h-improving if d h argmax d D {r d + P d h}. Theorem Suppose there exists a scalar g and an h V for which B(g, h ) = 0. Then if d is h -improving, (d ) is average optimal. Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 10
11 Existence of Optimal Policies Unichain Models Theorem Suppose S and A s are finite, r(s, a) is bounded, and the model is unichain. Then (i) there exists a stationary average optimal policy; (ii) there exists a scalar g and an h V for which B(g, h ) = 0; (iii) any stationary policy derived from an h -improving decision rule is average optimal; (iv) g e = g + = g. Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 11
12 Value Iteration 1 Select v 0 V, specify ɛ > 0, and set n = 0. 2 For each s S, compute v n+1 (s) by v n+1 (s) = max a A s r(s, a) + j S p(j s, a)v n (j). 3 If sp(v n+1 v n ) < ɛ, go to step 4. Otherwise, increment n by 1 and return to step 2. 4 For each s S, choose and stop. d ɛ (s) argmax a A s r(s, a) + j S p(j s, a)v n+1 (j) Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 12
13 Relative Value Iteration 1 Select u 0 V, choose s S, specify ɛ > 0, set w 0 = u 0 u 0 (s )e, and set n = 0. 2 For each s S, compute u n+1 (s) by u n+1 (s) = max a A s r(s, a) + j S Let w n+1 = u n+1 u n+1 (s )e. p(j s, a)w n (j). 3 If sp(u n+1 u n ) < ɛ, go to step 4. Otherwise, increment n by 1 and return to step 2. 4 For each s S, choose and stop. d ɛ (s) argmax a A s r(s, a) + j S p(j s, a)u n (j) Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 13
14 Policy Iteration 1 Set n = 0 and select an arbitrary decision rule d 0 D. 2 (Policy evaluation) Obtain a scalar g n and an h n V by solving 0 = r dn ge + (P dn I )h. 3 (Policy improvement) Choose d n+1 satisfy Setting d n+1 = d n if possible. d n+1 argmax[r d + P d h n ]. d D 4 If d n+1 = d n, stop and set d = d n. Otherwise increment n by 1 and return to step 2. Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 14
15 Policy Iteration 1 Set n = 0 and select an arbitrary decision rule d 0 D. 2 (Policy evaluation) Obtain a scalar g n and an h n V by solving 0 = r dn ge + (P dn I )h. 3 (Policy improvement) Choose d n+1 satisfy Setting d n+1 = d n if possible. d n+1 argmax[r d + P d h n ]. d D 4 If d n+1 = d n, stop and set d = d n. Otherwise increment n by 1 and return to step 2. Practical consideration: set h n (s 0 ) = 0 for some fixed s 0 S. Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 14
16 Linear Programming Primal linear program is given by min g,h g g + h(s) j S p(j s, a)h(j) r(s, a), s S, a A s. Dual linear program is given by max r(s, a)x(s, a) x s S a A s x(j, a) λp(j s, a)x(s, a) = 0, j S, a A j s S a A s x(s, a) = 1, s S a A s x(s, a) 0, s S, a A s. Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 15
Infinite-Horizon Discounted Markov Decision Processes
Infinite-Horizon Discounted Markov Decision Processes Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 1 Outline The expected
More informationIntroduction to Approximate Dynamic Programming
Introduction to Approximate Dynamic Programming Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Approximate Dynamic Programming 1 Key References Bertsekas, D.P.
More informationLecture notes for Analysis of Algorithms : Markov decision processes
Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with
More informationChapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS
Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS 63 2.1 Introduction In this chapter we describe the analytical tools used in this thesis. They are Markov Decision Processes(MDP), Markov Renewal process
More informationAN ABSTRACT OF THE THESIS OF
AN ABSTRACT OF THE THESIS OF Thai Duong for the degree of Master of Science in Electrical and Computer Engineering presented on June 10, 2013. Title: Adiabatic Markov Decision Process: Convergence of Value
More informationDecision Theory: Markov Decision Processes
Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies
More informationMS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction
MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent
More informationMarkov decision processes and interval Markov chains: exploiting the connection
Markov decision processes and interval Markov chains: exploiting the connection Mingmei Teo Supervisors: Prof. Nigel Bean, Dr Joshua Ross University of Adelaide July 10, 2013 Intervals and interval arithmetic
More informationSection Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018
Section Notes 9 Midterm 2 Review Applied Math / Engineering Sciences 121 Week of December 3, 2018 The following list of topics is an overview of the material that was covered in the lectures and sections
More informationZero-Sum Stochastic Games An algorithmic review
Zero-Sum Stochastic Games An algorithmic review Emmanuel Hyon LIP6/Paris Nanterre with N Yemele and L Perrotin Rosario November 2017 Final Meeting Dygame Dygame Project Amstic Outline 1 Introduction Static
More informationCourse 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016
Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the
More informationStochastic Processes. Theory for Applications. Robert G. Gallager CAMBRIDGE UNIVERSITY PRESS
Stochastic Processes Theory for Applications Robert G. Gallager CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv Swgg&sfzoMj ybr zmjfr%cforj owf fmdy xix Acknowledgements xxi 1 Introduction and review
More informationDiscrete-Time Markov Decision Processes
CHAPTER 6 Discrete-Time Markov Decision Processes 6.0 INTRODUCTION In the previous chapters we saw that in the analysis of many operational systems the concepts of a state of a system and a state transition
More informationReinforcement Learning
Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha
More informationReinforcement Learning: An Introduction
Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is
More informationMarkov Decision Processes and their Applications to Supply Chain Management
Markov Decision Processes and their Applications to Supply Chain Management Jefferson Huang School of Operations Research & Information Engineering Cornell University June 24 & 25, 2018 10 th Operations
More informationDecision Theory: Q-Learning
Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning
More informationTotal Expected Discounted Reward MDPs: Existence of Optimal Policies
Total Expected Discounted Reward MDPs: Existence of Optimal Policies Eugene A. Feinberg Department of Applied Mathematics and Statistics State University of New York at Stony Brook Stony Brook, NY 11794-3600
More informationSimplex Algorithm for Countable-state Discounted Markov Decision Processes
Simplex Algorithm for Countable-state Discounted Markov Decision Processes Ilbin Lee Marina A. Epelman H. Edwin Romeijn Robert L. Smith November 16, 2014 Abstract We consider discounted Markov Decision
More informationReinforcement Learning
Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value
More information, and rewards and transition matrices as shown below:
CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount
More informationDerman s book as inspiration: some results on LP for MDPs
Ann Oper Res (2013) 208:63 94 DOI 10.1007/s10479-011-1047-4 Derman s book as inspiration: some results on LP for MDPs Lodewijk Kallenberg Published online: 4 January 2012 The Author(s) 2012. This article
More informationOn the Approximate Solution of POMDP and the Near-Optimality of Finite-State Controllers
On the Approximate Solution of POMDP and the Near-Optimality of Finite-State Controllers Huizhen (Janey) Yu (janey@mit.edu) Dimitri Bertsekas (dimitrib@mit.edu) Lab for Information and Decision Systems,
More informationMATH4406 Assignment 5
MATH4406 Assignment 5 Patrick Laub (ID: 42051392) October 7, 2014 1 The machine replacement model 1.1 Real-world motivation Consider the machine to be the entire world. Over time the creator has running
More informationReinforcement Learning. Introduction
Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control
More informationAM 121: Intro to Optimization Models and Methods: Fall 2018
AM 11: Intro to Optimization Models and Methods: Fall 018 Lecture 18: Markov Decision Processes Yiling Chen Lesson Plan Markov decision processes Policies and value functions Solving: average reward, discounted
More informationInternet Monetization
Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition
More informationLong-run Average Reward for Markov Decision Processes
Long-run Average Reward for Markov Decision Processes Based on a paper at CAV 2017 Pranav Ashok 1, Krishnendu Chatterjee 2, Przemysław Daca 2, Jan Křetínský 1 and Tobias Meggendorfer 1 August 9, 2017 1
More informationOn Polynomial Cases of the Unichain Classification Problem for Markov Decision Processes
On Polynomial Cases of the Unichain Classification Problem for Markov Decision Processes Eugene A. Feinberg Department of Applied Mathematics and Statistics State University of New York at Stony Brook
More informationReinforcement Learning
Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.
More informationUNCORRECTED PROOFS. P{X(t + s) = j X(t) = i, X(u) = x(u), 0 u < t} = P{X(t + s) = j X(t) = i}.
Cochran eorms934.tex V1 - May 25, 21 2:25 P.M. P. 1 UNIFORMIZATION IN MARKOV DECISION PROCESSES OGUZHAN ALAGOZ MEHMET U.S. AYVACI Department of Industrial and Systems Engineering, University of Wisconsin-Madison,
More informationValue Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes
Value Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes RAÚL MONTES-DE-OCA Departamento de Matemáticas Universidad Autónoma Metropolitana-Iztapalapa San Rafael
More informationReinforcement Learning
Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms
More informationReductions Of Undiscounted Markov Decision Processes and Stochastic Games To Discounted Ones. Jefferson Huang
Reductions Of Undiscounted Markov Decision Processes and Stochastic Games To Discounted Ones Jefferson Huang School of Operations Research and Information Engineering Cornell University November 16, 2016
More informationMinimum average value-at-risk for finite horizon semi-markov decision processes
12th workshop on Markov processes and related topics Minimum average value-at-risk for finite horizon semi-markov decision processes Xianping Guo (with Y.H. HUANG) Sun Yat-Sen University Email: mcsgxp@mail.sysu.edu.cn
More information1 Markov decision processes
2.997 Decision-Making in Large-Scale Systems February 4 MI, Spring 2004 Handout #1 Lecture Note 1 1 Markov decision processes In this class we will study discrete-time stochastic systems. We can describe
More informationOPTIMALITY OF RANDOMIZED TRUNK RESERVATION FOR A PROBLEM WITH MULTIPLE CONSTRAINTS
OPTIMALITY OF RANDOMIZED TRUNK RESERVATION FOR A PROBLEM WITH MULTIPLE CONSTRAINTS Xiaofei Fan-Orzechowski Department of Applied Mathematics and Statistics State University of New York at Stony Brook Stony
More informationMarkov Decision Processes and Dynamic Programming
Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes
More informationReinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina
Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it
More informationStochastic Primal-Dual Methods for Reinforcement Learning
Stochastic Primal-Dual Methods for Reinforcement Learning Alireza Askarian 1 Amber Srivastava 1 1 Department of Mechanical Engineering University of Illinois at Urbana Champaign Big Data Optimization,
More informationOn Finding Optimal Policies for Markovian Decision Processes Using Simulation
On Finding Optimal Policies for Markovian Decision Processes Using Simulation Apostolos N. Burnetas Case Western Reserve University Michael N. Katehakis Rutgers University February 1995 Abstract A simulation
More informationLeast squares policy iteration (LSPI)
Least squares policy iteration (LSPI) Charles Elkan elkan@cs.ucsd.edu December 6, 2012 1 Policy evaluation and policy improvement Let π be a non-deterministic but stationary policy, so p(a s; π) is the
More informationModule 8 Linear Programming. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo
Module 8 Linear Programming CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Policy Optimization Value and policy iteration Iterative algorithms that implicitly solve
More informationMultiagent Value Iteration in Markov Games
Multiagent Value Iteration in Markov Games Amy Greenwald Brown University with Michael Littman and Martin Zinkevich Stony Brook Game Theory Festival July 21, 2005 Agenda Theorem Value iteration converges
More informationComputational complexity estimates for value and policy iteration algorithms for total-cost and average-cost Markov decision processes
Computational complexity estimates for value and policy iteration algorithms for total-cost and average-cost Markov decision processes Jefferson Huang Dept. Applied Mathematics and Statistics Stony Brook
More informationArtificial Intelligence
Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationDistributed Optimization. Song Chong EE, KAIST
Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links
More informationLecture 3: Markov Decision Processes
Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov
More informationDynamic control of a tandem system with abandonments
Dynamic control of a tandem system with abandonments Gabriel Zayas-Cabán 1, Jingui Xie 2, Linda V. Green 3, and Mark E. Lewis 4 1 Center for Healthcare Engineering and Patient Safety University of Michigan
More informationUniform turnpike theorems for finite Markov decision processes
MATHEMATICS OF OPERATIONS RESEARCH Vol. 00, No. 0, Xxxxx 0000, pp. 000 000 issn 0364-765X eissn 1526-5471 00 0000 0001 INFORMS doi 10.1287/xxxx.0000.0000 c 0000 INFORMS Authors are encouraged to submit
More informationCS 7180: Behavioral Modeling and Decisionmaking
CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and
More informationErgodic Theorems. Samy Tindel. Purdue University. Probability Theory 2 - MA 539. Taken from Probability: Theory and examples by R.
Ergodic Theorems Samy Tindel Purdue University Probability Theory 2 - MA 539 Taken from Probability: Theory and examples by R. Durrett Samy T. Ergodic theorems Probability Theory 1 / 92 Outline 1 Definitions
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationMarkov Decision Processes and Dynamic Programming
Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process
More informationPlanning in Markov Decision Processes
Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov
More informationMarkov decision processes
CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only
More informationThe Art of Sequential Optimization via Simulations
The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern California (Based on joint
More informationThis question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.
This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you
More informationMath 4317 : Real Analysis I Mid-Term Exam 1 25 September 2012
Instructions: Answer all of the problems. Math 4317 : Real Analysis I Mid-Term Exam 1 25 September 2012 Definitions (2 points each) 1. State the definition of a metric space. A metric space (X, d) is set
More informationA linear programming approach to nonstationary infinite-horizon Markov decision processes
A linear programming approach to nonstationary infinite-horizon Markov decision processes Archis Ghate Robert L Smith July 24, 2012 Abstract Nonstationary infinite-horizon Markov decision processes (MDPs)
More informationDiscrete planning (an introduction)
Sistemi Intelligenti Corso di Laurea in Informatica, A.A. 2017-2018 Università degli Studi di Milano Discrete planning (an introduction) Nicola Basilico Dipartimento di Informatica Via Comelico 39/41-20135
More informationDynamic Control of a Tandem Queueing System with Abandonments
Dynamic Control of a Tandem Queueing System with Abandonments Gabriel Zayas-Cabán 1 Jungui Xie 2 Linda V. Green 3 Mark E. Lewis 1 1 Cornell University Ithaca, NY 2 University of Science and Technology
More informationMotivation for introducing probabilities
for introducing probabilities Reaching the goals is often not sufficient: it is important that the expected costs do not outweigh the benefit of reaching the goals. 1 Objective: maximize benefits - costs.
More informationRobust Modified Policy Iteration
Robust Modified Policy Iteration David L. Kaufman Department of Industrial and Operations Engineering, University of Michigan 1205 Beal Avenue, Ann Arbor, MI 48109, USA 8davidlk8umich.edu (remove 8s) Andrew
More informationExample I: Capital Accumulation
1 Example I: Capital Accumulation Time t = 0, 1,..., T < Output y, initial output y 0 Fraction of output invested a, capital k = ay Transition (production function) y = g(k) = g(ay) Reward (utility of
More informationReinforcement Learning and Deep Reinforcement Learning
Reinforcement Learning and Deep Reinforcement Learning Ashis Kumer Biswas, Ph.D. ashis.biswas@ucdenver.edu Deep Learning November 5, 2018 1 / 64 Outlines 1 Principles of Reinforcement Learning 2 The Q
More informationReinforcement Learning
Reinforcement Learning Lecture 3: RL problems, sample complexity and regret Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Introduce the
More informationPreference Elicitation for Sequential Decision Problems
Preference Elicitation for Sequential Decision Problems Kevin Regan University of Toronto Introduction 2 Motivation Focus: Computational approaches to sequential decision making under uncertainty These
More informationA Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time
A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement
More informationPracticable Robust Markov Decision Processes
Practicable Robust Markov Decision Processes Huan Xu Department of Mechanical Engineering National University of Singapore Joint work with Shiau-Hong Lim (IBM), Shie Mannor (Techion), Ofir Mebel (Apple)
More informationLagrange duality. The Lagrangian. We consider an optimization program of the form
Lagrange duality Another way to arrive at the KKT conditions, and one which gives us some insight on solving constrained optimization problems, is through the Lagrange dual. The dual is a maximization
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationSTOCHASTIC MODELS FOR RELIABILITY, AVAILABILITY, AND MAINTAINABILITY
STOCHASTIC MODELS FOR RELIABILITY, AVAILABILITY, AND MAINTAINABILITY Ph.D. Assistant Professor Industrial and Systems Engineering Auburn University RAM IX Summit November 2 nd 2016 Outline Introduction
More informationAbstract Dynamic Programming
Abstract Dynamic Programming Dimitri P. Bertsekas Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Overview of the Research Monograph Abstract Dynamic Programming"
More informationStat-491-Fall2014-Assignment-III
Stat-491-Fall2014-Assignment-III Hariharan Narayanan November 6, 2014 1. (4 points). 3 white balls and 3 black balls are distributed in two urns in such a way that each urn contains 3 balls. At each step
More informationArtificial Intelligence & Sequential Decision Problems
Artificial Intelligence & Sequential Decision Problems (CIV6540 - Machine Learning for Civil Engineers) Professor: James-A. Goulet Département des génies civil, géologique et des mines Chapter 15 Goulet
More informationA.Piunovskiy. University of Liverpool Fluid Approximation to Controlled Markov. Chains with Local Transitions. A.Piunovskiy.
University of Liverpool piunov@liv.ac.uk The Markov Decision Process under consideration is defined by the following elements X = {0, 1, 2,...} is the state space; A is the action space (Borel); p(z x,
More informationA note on two-person zero-sum communicating stochastic games
Operations Research Letters 34 26) 412 42 Operations Research Letters www.elsevier.com/locate/orl A note on two-person zero-sum communicating stochastic games Zeynep Müge Avşar a, Melike Baykal-Gürsoy
More information72 HANDBOOK OF MARKOV DECISION PROCESSES optimality. With this in mind, the manager has arbitrarily chosen the longrun average optimal policy (s 0 ;S
2 BIAS OPTIMALITY Mark E. Lewis Martin L. Puterman Abstract: The use of the long-run average reward or the gain as an optimality criterion has received considerable attention in the literature. However,
More informationLearning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods
Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods Yaakov Engel Joint work with Peter Szabo and Dmitry Volkinshtein (ex. Technion) Why use GPs in RL? A Bayesian approach
More informationReinforcement Learning as Classification Leveraging Modern Classifiers
Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions
More informationAn Empirical Algorithm for Relative Value Iteration for Average-cost MDPs
2015 IEEE 54th Annual Conference on Decision and Control CDC December 15-18, 2015. Osaka, Japan An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs Abhishek Gupta Rahul Jain Peter
More informationCSE250A Fall 12: Discussion Week 9
CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More informationA Reinforcement Learning (Nash-R) Algorithm for Average Reward Irreducible Stochastic Games
Learning in Average Reward Stochastic Games A Reinforcement Learning (Nash-R) Algorithm for Average Reward Irreducible Stochastic Games Jun Li Jun.Li@warnerbros.com Kandethody Ramachandran ram@cas.usf.edu
More informationA monotonic property of the optimal admission control to an M/M/1 queue under periodic observations with average cost criterion
A monotonic property of the optimal admission control to an M/M/1 queue under periodic observations with average cost criterion Cao, Jianhua; Nyberg, Christian Published in: Seventeenth Nordic Teletraffic
More informationReinforcement learning
Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error
More informationApproximate Dynamic Programming
Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement
More informationµ n 1 (v )z n P (v, )
Plan More Examples (Countable-state case). Questions 1. Extended Examples 2. Ideas and Results Next Time: General-state Markov Chains Homework 4 typo Unless otherwise noted, let X be an irreducible, aperiodic
More informationNon-homogeneous random walks on a semi-infinite strip
Non-homogeneous random walks on a semi-infinite strip Chak Hei Lo Joint work with Andrew R. Wade World Congress in Probability and Statistics 11th July, 2016 Outline Motivation: Lamperti s problem Our
More informationFinding the Value of Information About a State Variable in a Markov Decision Process 1
05/25/04 1 Finding the Value of Information About a State Variable in a Markov Decision Process 1 Gilvan C. Souza The Robert H. Smith School of usiness, The University of Maryland, College Park, MD, 20742
More informationOn the reduction of total cost and average cost MDPs to discounted MDPs
On the reduction of total cost and average cost MDPs to discounted MDPs Jefferson Huang School of Operations Research and Information Engineering Cornell University July 12, 2017 INFORMS Applied Probability
More informationProcedia Computer Science 00 (2011) 000 6
Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-
More informationMarkov Decision Processes Chapter 17. Mausam
Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.
More informationCHUN-HUA GUO. Key words. matrix equations, minimal nonnegative solution, Markov chains, cyclic reduction, iterative methods, convergence rate
CONVERGENCE ANALYSIS OF THE LATOUCHE-RAMASWAMI ALGORITHM FOR NULL RECURRENT QUASI-BIRTH-DEATH PROCESSES CHUN-HUA GUO Abstract The minimal nonnegative solution G of the matrix equation G = A 0 + A 1 G +
More informationGeneralized Pinwheel Problem
Math. Meth. Oper. Res (2005) 62: 99 122 DOI 10.1007/s00186-005-0443-4 ORIGINAL ARTICLE Eugene A. Feinberg Michael T. Curry Generalized Pinwheel Problem Received: November 2004 / Revised: April 2005 Springer-Verlag
More informationIntroduction to Reinforcement Learning Part 1: Markov Decision Processes
Introduction to Reinforcement Learning Part 1: Markov Decision Processes Rowan McAllister Reinforcement Learning Reading Group 8 April 2015 Note I ve created these slides whilst following Algorithms for
More informationPART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.
Advanced Topics in Machine Learning, GI13, 2010/11 Advanced Topics in Machine Learning, GI13, 2010/11 Answer any THREE questions. Each question is worth 20 marks. Use separate answer books Answer any THREE
More informationMarkov Decision Processes Infinite Horizon Problems
Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld 1 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T)
More information