Introduction to Approximate Dynamic Programming
|
|
- Sharlene Norris
- 5 years ago
- Views:
Transcription
1 Introduction to Approximate Dynamic Programming Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Approximate Dynamic Programming 1
2 Key References Bertsekas, D.P Chapter 6, Approximate Dynamic Programming, Dynamic Programming and Optimal Control, 3rd Edition, Volume II. Available online at Bertsekas, D.P., J.N. Tsitsiklis Neuro-Dynamic Programming. Athena Scientific, Belmont, MA. Powell, W Approximate Dynamic Programming: Solving the Curses of Dimensionality. Wiley-Interscience. Dan Zhang, Spring 2012 Approximate Dynamic Programming 2
3 Outline Setup: infinite horizon discounted MDPs Policy evaluation via Monte Carlo simultion Q-Learning Linear programming based approximate dynamic programming Approximate policy iteration Rollout policies State aggregation Dan Zhang, Spring 2012 Approximate Dynamic Programming 3
4 Setup Infinite-horizon discounted MDP Transition probability p ij (u) Cost: g(i, u, j) with g(i, u, j) < Discounting: α [0, 1). Discrete state space: S is finite Action space U. Dan Zhang, Spring 2012 Approximate Dynamic Programming 4
5 MDP Model Let π Π MD. States visited under policy π follows a Markov chain with transition probability P π = {p ij (π(i)) : i, j S}. Total discounted cost { T } J π (i) = lim E α t g(it π, π(it π ), it+1) π. T t=0 An optimal policy can be computed by solving the optimality equations: J(i) = min p ij (u)[g(i, u, j) + αj(j)]. u U(i) j S Dan Zhang, Spring 2012 Approximate Dynamic Programming 5
6 Value Iteration The optimality equation can be written as J = TJ, where [TJ](i) = min u U(i) p ij (u)[g(i, u, j) + αj(j)]. j S The value function J is a fixed point of the operator T. Under suitable technical conditions, J = lim k T k J 1 for an initial vector J 1. Dan Zhang, Spring 2012 Approximate Dynamic Programming 6
7 Policy Iteration Define [T π J](i) = j S p ij (π(i))[g(i, π(i), j) + αj(j)]. It can be shown that the discounted cost J π incurred by policy π is a fixed point of the operator T π ; i.e., J π = T π J π. Dan Zhang, Spring 2012 Approximate Dynamic Programming 7
8 Policy Iteration (Continued) Policy evaluation: Compute the discounted cost J πk incurred by policy π k, possibly by solving the system of linear equations J πk = g πk + αp πk J πk. Policy improvement: For all i S, let policy π k+1 be defined such that we have π k+1 (i) = argmin p ij (u)[g(i, u, j) + αj πk (j)]. u U(i) j S Stop if π k = π k+1. Dan Zhang, Spring 2012 Approximate Dynamic Programming 8
9 Three Curses of Dimensionality (Powell, 2007) State space is large: Computing and storing the value function v( ) can be difficult for both value iteration and policy iteration algorithms Action space is large: Both value iteration and policy evaluation become difficult Computing expectations with respect to the transition probability matrix can be difficult when the system dynamics is complex Dan Zhang, Spring 2012 Approximate Dynamic Programming 9
10 Policy Evaluation with Monte Carlo Simulation Policy evaluation requires solving linear equations of the form J π = g π + αp π J π. Difficult when P is large or unknown Idea: Simulate the sequence of state {i 0, i 1,... } by following a particular policy π. An approximation J of J π can be updated as follows: J(i k ) (1 γ)j(i k ) + γ[g(i k, π(i k ), i k+1 ) + αj(i k+1 )]. Dan Zhang, Spring 2012 Approximate Dynamic Programming 10
11 Policy Evaluation with Monte Carlo Simulation Policy evaluation requires solving linear equations of the form J π = g π + αp π J π. Difficult when P is large or unknown Idea: Simulate the sequence of state {i 0, i 1,... } by following a particular policy π. An approximation J of J π can be updated as follows: J(i k ) (1 γ)j(i k ) + γ[g(i k, π(i k ), i k+1 ) + αj(i k+1 )]. Why does this make sense? Dan Zhang, Spring 2012 Approximate Dynamic Programming 10
12 Policy Evaluation with Monte Carlo Simulation (continued) The procedure requires storing J, which can be problematic when the state space S is large. Idea: Use a parameterized approximation architecture: J(i, r) = L r l φ l (i), l=1 where φ l ( ) s are pre-specified functions ( feature functions ) and r is a vector of adjustable parameters. Dan Zhang, Spring 2012 Approximate Dynamic Programming 11
13 Policy Evaluation with Monte Carlo Simulation (continued) The procedure requires storing J, which can be problematic when the state space S is large. Idea: Use a parameterized approximation architecture: J(i, r) = L r l φ l (i), l=1 where φ l ( ) s are pre-specified functions ( feature functions ) and r is a vector of adjustable parameters. How to update r? Dan Zhang, Spring 2012 Approximate Dynamic Programming 11
14 Q-Learning Q-learning is based on an alternative representation of the optimality equation: J(i) = min Q(i, u), u U(i) Q(i, u) = j S p ij (u)[g(i, u, j) + αj(j)]. Dan Zhang, Spring 2012 Approximate Dynamic Programming 12
15 Q-Learning Q-learning is based on an alternative representation of the optimality equation: J(i) = min Q(i, u), u U(i) Q(i, u) = j S p ij (u)[g(i, u, j) + αj(j)]. What is the interpretation of Q(i, u)? Dan Zhang, Spring 2012 Approximate Dynamic Programming 12
16 Q-Learning (Continued) Key idea: Evaluate Q(i, u) by using a stochastic approximation iteration: Q(i k, u k ) (1 γ)q(i k, u k )+γ[g(i k, u k, s k )+α min Q(s k, v)], v U(s k ) where the successor state s k is sampled according to the probabilities {p ik,j(u k ) : j S}. Dan Zhang, Spring 2012 Approximate Dynamic Programming 13
17 Q-Learning (Continued) Key idea: Evaluate Q(i, u) by using a stochastic approximation iteration: Q(i k, u k ) (1 γ)q(i k, u k )+γ[g(i k, u k, s k )+α min Q(s k, v)], v U(s k ) where the successor state s k is sampled according to the probabilities {p ik,j(u k ) : j S}. What is the benefit of Q-learning? Dan Zhang, Spring 2012 Approximate Dynamic Programming 13
18 Linear Programming Approach Let θ(s) be positive scalars such that s S θ(s) = 1. The linear programming formulation is given by max θ(i)j(i) J i S J(i) j S αp ij (u)j(j) j S P ij (u)g(i, u, j), i S, u U(i). Dual linear program is given by min p ij (u)g(i, u, j)x(i, u) x i S u U(i) j S u U(i) x(i, u) j S x(i, u) 0, u U(j) αp ji (u)x(j, u) = θ(i), i S, i S, u U(i). Dan Zhang, Spring 2012 Approximate Dynamic Programming 14
19 Linear Programming Approach (continued) The size of LP can be reduced by using a parameterized approximation architecture (Schweitzer and Seidmann, 1985): J(i, r) = L r l φ l (i), l=1 Two solution approaches: Simulation-based approach (de Farias and Van Roy, 2003; also Powell, 2007) Mathematical programming-based approach (Adelman, 2003; Adelman, 2007) Dan Zhang, Spring 2012 Approximate Dynamic Programming 15
20 Approximate Policy Iteration Policy evaluation algorithm can be difficult when the problem is large. Idea: Carry out policy iteration approximately. Dan Zhang, Spring 2012 Approximate Dynamic Programming 16
21 Approximate Policy Iteration (Continued) Approximate policy evaluation: Simulate a sequence of states {i 0, i 1,... } by following policy π k. Let C l = t=l αt l g(i t, π k (i t ), i t+1 ) for all l = 0, 1,.... Let r k be the solution to the regression problem min r { t=0 [ J(i t, r) C t ] 2 } = min r [ L t=0 l=1 ] 2 r l φ l (i t ) C t. Approximate policy improvement: For all i S, let policy π k+1 be defined such that we have π k+1 (i) = argmin p ij (u)[g(i, u, j) + α J(j, r k )]. u U(i) j S Dan Zhang, Spring 2012 Approximate Dynamic Programming 17
22 Rollout Policy Idea: Improve the performance of a given policy. Given policy π, let π be defined such that for each i S, π (i) = argmin p ij (u)[g(i, u, j) + αj π (j)]. u U(i) j S Then it can be shown that π is at least as good as π. Simulation may be involved to estimate J π. Dan Zhang, Spring 2012 Approximate Dynamic Programming 18
23 State Aggregation Idea: Partition the state space into a number of subsets and assume the value function is constant over each partition. State aggregation can be combined with other approximation methods. Dan Zhang, Spring 2012 Approximate Dynamic Programming 19
24 Discussion ADP aims at alleviating computational effort required to solve large-scale dynamic programs The area is still in its infancy A commonly accepted definition of ADP does not seem to exist Open problem: How to specify feature functions? Research to date usually assume they are fixed in advance. Recent advances: Klabjan and Adelman (2007) Develop efficient solution approaches for practical applications could be valuable. Dan Zhang, Spring 2012 Approximate Dynamic Programming 20
Computation and Dynamic Programming
Computation and Dynamic Programming Huseyin Topaloglu School of Operations Research and Information Engineering, Cornell University, Ithaca, New York 14853, USA topaloglu@orie.cornell.edu June 25, 2010
More informationOptimistic Policy Iteration and Q-learning in Dynamic Programming
Optimistic Policy Iteration and Q-learning in Dynamic Programming Dimitri P. Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology November 2010 INFORMS, Austin,
More informationElements of Reinforcement Learning
Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,
More informationReinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019
Reinforcement Learning and Optimal Control ASU, CSE 691, Winter 2019 Dimitri P. Bertsekas dimitrib@mit.edu Lecture 8 Bertsekas Reinforcement Learning 1 / 21 Outline 1 Review of Infinite Horizon Problems
More information6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE
6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE Review of Q-factors and Bellman equations for Q-factors VI and PI for Q-factors Q-learning - Combination of VI and sampling Q-learning and cost function
More informationInfinite-Horizon Average Reward Markov Decision Processes
Infinite-Horizon Average Reward Markov Decision Processes Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 1 Outline The average
More informationInfinite-Horizon Discounted Markov Decision Processes
Infinite-Horizon Discounted Markov Decision Processes Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 1 Outline The expected
More informationOn the Convergence of Optimistic Policy Iteration
Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology
More informationInfinite-Horizon Dynamic Programming
1/70 Infinite-Horizon Dynamic Programming http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html Acknowledgement: this slides is based on Prof. Mengdi Wang s and Prof. Dimitri Bertsekas lecture notes 2/70 作业
More informationProcedia Computer Science 00 (2011) 000 6
Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-
More informationTemporal difference learning
Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).
More informationStochastic Shortest Path Problems
Chapter 8 Stochastic Shortest Path Problems 1 In this chapter, we study a stochastic version of the shortest path problem of chapter 2, where only probabilities of transitions along different arcs can
More informationLecture 4: Misc. Topics and Reinforcement Learning
Approximate Dynamic Programming Lecture 4: Misc. Topics and Reinforcement Learning Mengdi Wang Operations Research and Financial Engineering Princeton University August 1-4, 2015 1/56 Feature Extraction
More informationDynamic Programming and Reinforcement Learning
Dynamic Programming and Reinforcement Learning Reinforcement learning setting action/decision Agent Environment reward state Action space: A State space: S Reward: R : S A S! R Transition:
More informationAn Adaptive Clustering Method for Model-free Reinforcement Learning
An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at
More informationA System Theoretic Perspective of Learning and Optimization
A System Theoretic Perspective of Learning and Optimization Xi-Ren Cao* Hong Kong University of Science and Technology Clear Water Bay, Kowloon, Hong Kong eecao@ee.ust.hk Abstract Learning and optimization
More informationReinforcement Learning and Optimal Control. Chapter 4 Infinite Horizon Reinforcement Learning DRAFT
Reinforcement Learning and Optimal Control by Dimitri P. Bertsekas Massachusetts Institute of Technology Chapter 4 Infinite Horizon Reinforcement Learning DRAFT This is Chapter 4 of the draft textbook
More informationThe Art of Sequential Optimization via Simulations
The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern California (Based on joint
More informationLP-Based Local Approximation for Markov Decision Problems
Konrad-Zuse-Zentrum für Informationstechnik Berlin Takustraße 7 D-14195 Berlin-Dahlem Germany STEFAN HEINZ 1 VOLKER KAIBEL 1,2 MATTHIAS PEINHARDT 1 JÖRG RAMBAU 3 ANDREAS TUCHSCHERER 1 LP-Based Local Approximation
More informationApproximate active fault detection and control
Approximate active fault detection and control Jan Škach Ivo Punčochář Miroslav Šimandl Department of Cybernetics Faculty of Applied Sciences University of West Bohemia Pilsen, Czech Republic 11th European
More informationMarkov Decision Processes and Dynamic Programming
Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes
More informationAbstract Dynamic Programming
Abstract Dynamic Programming Dimitri P. Bertsekas Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Overview of the Research Monograph Abstract Dynamic Programming"
More informationDecision Theory: Markov Decision Processes
Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies
More informationIntroduction to Reinforcement Learning. CMPT 882 Mar. 18
Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and
More informationReinforcement Learning and Optimal Control. Chapter 4 Infinite Horizon Reinforcement Learning DRAFT
Reinforcement Learning and Optimal Control by Dimitri P. Bertsekas Massachusetts Institute of Technology Chapter 4 Infinite Horizon Reinforcement Learning DRAFT This is Chapter 4 of the draft textbook
More informationLinear Programming Formulation for Non-stationary, Finite-Horizon Markov Decision Process Models
Linear Programming Formulation for Non-stationary, Finite-Horizon Markov Decision Process Models Arnab Bhattacharya 1 and Jeffrey P. Kharoufeh 2 Department of Industrial Engineering University of Pittsburgh
More informationValue and Policy Iteration
Chapter 7 Value and Policy Iteration 1 For infinite horizon problems, we need to replace our basic computational tool, the DP algorithm, which we used to compute the optimal cost and policy for finite
More informationLinear Programming Methods
Chapter 11 Linear Programming Methods 1 In this chapter we consider the linear programming approach to dynamic programming. First, Bellman s equation can be reformulated as a linear program whose solution
More informationOnline solution of the average cost Kullback-Leibler optimization problem
Online solution of the average cost Kullback-Leibler optimization problem Joris Bierkens Radboud University Nijmegen j.bierkens@science.ru.nl Bert Kappen Radboud University Nijmegen b.kappen@science.ru.nl
More informationAlleviating tuning sensitivity in Approximate Dynamic Programming
Alleviating tuning sensitivity in Approximate Dynamic Programming Paul Beuchat, Angelos Georghiou and John Lygeros Abstract Approximate Dynamic Programming offers benefits for large-scale systems compared
More informationOptimal Stopping Problems
2.997 Decision Making in Large Scale Systems March 3 MIT, Spring 2004 Handout #9 Lecture Note 5 Optimal Stopping Problems In the last lecture, we have analyzed the behavior of T D(λ) for approximating
More informationReinforcement Learning as Classification Leveraging Modern Classifiers
Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions
More informationMarkov Decision Processes and Dynamic Programming
Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process
More informationDecision Theory: Q-Learning
Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning
More informationarxiv: v1 [cs.lg] 23 Oct 2017
Accelerated Reinforcement Learning K. Lakshmanan Department of Computer Science and Engineering Indian Institute of Technology (BHU), Varanasi, India Email: lakshmanank.cse@itbhu.ac.in arxiv:1710.08070v1
More informationMinimal Residual Approaches for Policy Evaluation in Large Sparse Markov Chains
Minimal Residual Approaches for Policy Evaluation in Large Sparse Markov Chains Yao Hengshuai Liu Zhiqiang School of Creative Media, City University of Hong Kong at Chee Avenue, Kowloon, Hong Kong, S.A.R.,
More informationQ-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming
April 2010 (Revised October 2010) Report LIDS - 2831 Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming Dimitri P. Bertseas 1 and Huizhen Yu 2 Abstract We consider the classical
More informationReinforcement Learning for Long-Run Average Cost
Reinforcement Learning for Long-Run Average Cost Abhijit Gosavi Assistant Professor 261, Technology Bldg, 2200 Bonforte Blvd Colorado State University, Pueblo, Pueblo, CO 81001 Email: gosavi@uscolo.edu,
More informationStochastic Primal-Dual Methods for Reinforcement Learning
Stochastic Primal-Dual Methods for Reinforcement Learning Alireza Askarian 1 Amber Srivastava 1 1 Department of Mechanical Engineering University of Illinois at Urbana Champaign Big Data Optimization,
More informationDuality in Robust Dynamic Programming: Pricing Convertibles, Stochastic Games and Control
Duality in Robust Dynamic Programming: Pricing Convertibles, Stochastic Games and Control Shyam S Chandramouli Abstract Many decision making problems that arise in inance, Economics, Inventory etc. can
More informationReinforcement Learning
Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha
More informationReinforcement Learning
Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.
More informationReinforcement Learning. Yishay Mansour Tel-Aviv University
Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak
More informationCS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study
CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Assignment #1 Roll-out: nice example paper: X.
More informationApproximate dynamic programming for stochastic reachability
Approximate dynamic programming for stochastic reachability Nikolaos Kariotoglou, Sean Summers, Tyler Summers, Maryam Kamgarpour and John Lygeros Abstract In this work we illustrate how approximate dynamic
More informationHOW TO CHOOSE THE STATE RELEVANCE WEIGHT OF THE APPROXIMATE LINEAR PROGRAM?
HOW O CHOOSE HE SAE RELEVANCE WEIGH OF HE APPROXIMAE LINEAR PROGRAM? YANN LE ALLEC AND HEOPHANE WEBER Abstract. he linear programming approach to approximate dynamic programming was introduced in [1].
More informationQ-Learning for Markov Decision Processes*
McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of
More information6.231 DYNAMIC PROGRAMMING LECTURE 17 LECTURE OUTLINE
6.231 DYNAMIC PROGRAMMING LECTURE 17 LECTURE OUTLINE Undiscounted problems Stochastic shortest path problems (SSP) Proper and improper policies Analysis and computational methods for SSP Pathologies of
More informationReinforcement learning
Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error
More informationThis question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.
This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you
More informationLecture notes for Analysis of Algorithms : Markov decision processes
Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with
More informationApproximate Dynamic Programming
Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.
More informationAverage Reward Parameters
Simulation-Based Optimization of Markov Reward Processes: Implementation Issues Peter Marbach 2 John N. Tsitsiklis 3 Abstract We consider discrete time, nite state space Markov reward processes which depend
More informationSolving Factored MDPs with Exponential-Family Transition Models
Solving Factored MDPs with Exponential-Family Transition Models Branislav Kveton Intelligent Systems Program University of Pittsburgh bkveton@cs.pitt.edu Milos Hauskrecht Department of Computer Science
More informationKernel-Based Reinforcement Learning Using Bellman Residual Elimination
Journal of Machine Learning Research () Submitted ; Published Kernel-Based Reinforcement Learning Using Bellman Residual Elimination Brett Bethke Department of Aeronautics and Astronautics Massachusetts
More informationReinforcement Learning. Introduction
Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control
More informationDistributed Optimization. Song Chong EE, KAIST
Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links
More informationReinforcement Learning Part 2
Reinforcement Learning Part 2 Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ From previous tutorial Reinforcement Learning Exploration No supervision Agent-Reward-Environment
More informationWe consider a network revenue management problem where customers choose among open fare products
Vol. 43, No. 3, August 2009, pp. 381 394 issn 0041-1655 eissn 1526-5447 09 4303 0381 informs doi 10.1287/trsc.1090.0262 2009 INFORMS An Approximate Dynamic Programming Approach to Network Revenue Management
More informationAPPROXIMATE DYNAMIC PROGRAMMING A SERIES OF LECTURES GIVEN AT TSINGHUA UNIVERSITY JUNE 2014 DIMITRI P. BERTSEKAS
APPROXIMATE DYNAMIC PROGRAMMING A SERIES OF LECTURES GIVEN AT TSINGHUA UNIVERSITY JUNE 2014 DIMITRI P. BERTSEKAS Based on the books: (1) Neuro-Dynamic Programming, by DPB and J. N. Tsitsiklis, Athena Scientific,
More informationarxiv: v1 [cs.lg] 6 Jun 2013
Policy Search: Any Local Optimum Enjoys a Global Performance Guarantee Bruno Scherrer, LORIA MAIA project-team, Nancy, France, bruno.scherrer@inria.fr Matthieu Geist, Supélec IMS-MaLIS Research Group,
More informationHidden Markov Models (HMM) and Support Vector Machine (SVM)
Hidden Markov Models (HMM) and Support Vector Machine (SVM) Professor Joongheon Kim School of Computer Science and Engineering, Chung-Ang University, Seoul, Republic of Korea 1 Hidden Markov Models (HMM)
More informationReinforcement Learning with Function Approximation. Joseph Christian G. Noel
Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is
More informationA Gentle Introduction to Reinforcement Learning
A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,
More information, and rewards and transition matrices as shown below:
CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount
More informationCS 7180: Behavioral Modeling and Decisionmaking
CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and
More informationArtificial Intelligence & Sequential Decision Problems
Artificial Intelligence & Sequential Decision Problems (CIV6540 - Machine Learning for Civil Engineers) Professor: James-A. Goulet Département des génies civil, géologique et des mines Chapter 15 Goulet
More informationQ-Learning in Continuous State Action Spaces
Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental
More informationAn Empirical Algorithm for Relative Value Iteration for Average-cost MDPs
2015 IEEE 54th Annual Conference on Decision and Control CDC December 15-18, 2015. Osaka, Japan An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs Abhishek Gupta Rahul Jain Peter
More information1 Problem Formulation
Book Review Self-Learning Control of Finite Markov Chains by A. S. Poznyak, K. Najim, and E. Gómez-Ramírez Review by Benjamin Van Roy This book presents a collection of work on algorithms for learning
More informationChapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS
Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS 63 2.1 Introduction In this chapter we describe the analytical tools used in this thesis. They are Markov Decision Processes(MDP), Markov Renewal process
More information9 Improved Temporal Difference
9 Improved Temporal Difference Methods with Linear Function Approximation DIMITRI P. BERTSEKAS and ANGELIA NEDICH Massachusetts Institute of Technology Alphatech, Inc. VIVEK S. BORKAR Tata Institute of
More informationQ-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming
MATHEMATICS OF OPERATIONS RESEARCH Vol. 37, No. 1, February 2012, pp. 66 94 ISSN 0364-765X (print) ISSN 1526-5471 (online) http://dx.doi.org/10.1287/moor.1110.0532 2012 INFORMS Q-Learning and Enhanced
More informationSolving Factored MDPs with Continuous and Discrete Variables
Appeared in the Twentieth Conference on Uncertainty in Artificial Intelligence, Banff, Canada, July 24. Solving Factored MDPs with Continuous and Discrete Variables Carlos Guestrin Milos Hauskrecht Branislav
More informationPrioritized Sweeping Converges to the Optimal Value Function
Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science
More informationMachine Learning I Continuous Reinforcement Learning
Machine Learning I Continuous Reinforcement Learning Thomas Rückstieß Technische Universität München January 7/8, 2010 RL Problem Statement (reminder) state s t+1 ENVIRONMENT reward r t+1 new step r t
More informationOpen Theoretical Questions in Reinforcement Learning
Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem
More informationToday s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes
Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks
More informationWeighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications
May 2012 Report LIDS - 2884 Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications Dimitri P. Bertsekas Abstract We consider a class of generalized dynamic programming
More informationDirect Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms
Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms Jonathan Baxter and Peter L. Bartlett Research School of Information Sciences and Engineering Australian National University
More informationApproximate Dynamic Programming
Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement
More informationApproximate Dynamic Programming for Large Scale Systems. Vijay V. Desai
Approximate Dynamic Programming for Large Scale Systems Vijay V. Desai Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences
More informationReinforcement Learning: the basics
Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46 Introduction Action selection/planning Learning by trial-and-error
More informationState Space Abstractions for Reinforcement Learning
State Space Abstractions for Reinforcement Learning Rowan McAllister and Thang Bui MLG RCC 6 November 24 / 24 Outline Introduction Markov Decision Process Reinforcement Learning State Abstraction 2 Abstraction
More informationFactored State Spaces 3/2/178
Factored State Spaces 3/2/178 Converting POMDPs to MDPs In a POMDP: Action + observation updates beliefs Value is a function of beliefs. Instead we can view this as an MDP where: There is a state for every
More informationA Least Squares Q-Learning Algorithm for Optimal Stopping Problems
LIDS REPORT 273 December 2006 Revised: June 2007 A Least Squares Q-Learning Algorithm for Optimal Stopping Problems Huizhen Yu janey.yu@cs.helsinki.fi Dimitri P. Bertsekas dimitrib@mit.edu Abstract We
More informationCSC321 Lecture 22: Q-Learning
CSC321 Lecture 22: Q-Learning Roger Grosse Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21 Overview Second of 3 lectures on reinforcement learning Last time: policy gradient (e.g. REINFORCE) Optimize
More informationDIFFERENTIAL TRAINING OF 1 ROLLOUT POLICIES
Appears in Proc. of the 35th Allerton Conference on Communication, Control, and Computing, Allerton Park, Ill., October 1997 DIFFERENTIAL TRAINING OF 1 ROLLOUT POLICIES by Dimitri P. Bertsekas 2 Abstract
More informationBayesian Active Learning With Basis Functions
Bayesian Active Learning With Basis Functions Ilya O. Ryzhov Warren B. Powell Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA IEEE ADPRL April 13, 2011 1 / 29
More informationUNCORRECTED PROOFS. P{X(t + s) = j X(t) = i, X(u) = x(u), 0 u < t} = P{X(t + s) = j X(t) = i}.
Cochran eorms934.tex V1 - May 25, 21 2:25 P.M. P. 1 UNIFORMIZATION IN MARKOV DECISION PROCESSES OGUZHAN ALAGOZ MEHMET U.S. AYVACI Department of Industrial and Systems Engineering, University of Wisconsin-Madison,
More informationAPPROXIMATE DYNAMIC PROGRAMMING II: ALGORITHMS
APPROXIMATE DYNAMIC PROGRAMMING II: ALGORITHMS INTRODUCTION WARREN B. POWELL Department of Operations Research and Financial Engineering, Princeton University, Princeton, New Jersey Approximate dynamic
More informationReinforcement Learning
1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision
More informationControl Theory : Course Summary
Control Theory : Course Summary Author: Joshua Volkmann Abstract There are a wide range of problems which involve making decisions over time in the face of uncertainty. Control theory draws from the fields
More informationLecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation
Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free
More informationLoss Bounds for Uncertain Transition Probabilities in Markov Decision Processes
Loss Bounds for Uncertain Transition Probabilities in Markov Decision Processes Andrew Mastin and Patrick Jaillet Abstract We analyze losses resulting from uncertain transition probabilities in Markov
More informationWhat you should know about approximate dynamic programming
What you should know about approximate dynamic programming Warren B. Powell Department of Operations Research and Financial Engineering Princeton University, Princeton, NJ 08544 December 16, 2008 Abstract
More informationTemporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI
Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning
More information16.410/413 Principles of Autonomy and Decision Making
16.410/413 Principles of Autonomy and Decision Making Lecture 23: Markov Decision Processes Policy Iteration Emilio Frazzoli Aeronautics and Astronautics Massachusetts Institute of Technology December
More informationReinforcement Learning. Spring 2018 Defining MDPs, Planning
Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state
More informationApproximate Dynamic Programming Using Bellman Residual Elimination and Gaussian Process Regression
Approximate Dynamic Programming Using Bellman Residual Elimination and Gaussian Process Regression The MIT Faculty has made this article openly available. Please share how this access benefits you. Your
More information