Abstract Dynamic Programming

Similar documents
Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

6.231 DYNAMIC PROGRAMMING LECTURE 17 LECTURE OUTLINE

Optimistic Policy Iteration and Q-learning in Dynamic Programming

Regular Policies in Abstract Dynamic Programming

Affine Monotonic and Risk-Sensitive Models in Dynamic Programming

Abstract Dynamic Programming

Stochastic Shortest Path Problems

6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE

Value and Policy Iteration

On the Convergence of Optimistic Policy Iteration

Markov Decision Processes and Dynamic Programming

Reinforcement Learning and Optimal Control. Chapter 4 Infinite Horizon Reinforcement Learning DRAFT

Reinforcement Learning

Reinforcement Learning and Optimal Control. Chapter 4 Infinite Horizon Reinforcement Learning DRAFT

Procedia Computer Science 00 (2011) 000 6

Q-Learning for Markov Decision Processes*

Infinite-Horizon Dynamic Programming

Q-learning and policy iteration algorithms for stochastic shortest path problems

Total Expected Discounted Reward MDPs: Existence of Optimal Policies

Dynamic Programming and Reinforcement Learning

Internet Monetization

Elements of Reinforcement Learning

Convex Optimization Theory. Chapter 5 Exercises and Solutions: Extended Version

Economics 2010c: Lecture 2 Iterative Methods in Dynamic Programming

APPROXIMATE DYNAMIC PROGRAMMING A SERIES OF LECTURES GIVEN AT TSINGHUA UNIVERSITY JUNE 2014 DIMITRI P. BERTSEKAS

Computational complexity estimates for value and policy iteration algorithms for total-cost and average-cost Markov decision processes

Reductions Of Undiscounted Markov Decision Processes and Stochastic Games To Discounted Ones. Jefferson Huang

Markov Decision Processes and Dynamic Programming

Zero-sum semi-markov games in Borel spaces: discounted and average payoff

Reinforcement Learning. Introduction

1 Stochastic Dynamic Programming

Introduction to Approximate Dynamic Programming

On the Policy Iteration algorithm for PageRank Optimization

5. Solving the Bellman Equation

REMARKS ON THE EXISTENCE OF SOLUTIONS IN MARKOV DECISION PROCESSES. Emmanuel Fernández-Gaucherand, Aristotle Arapostathis, and Steven I.

INTRODUCTION TO MARKOV DECISION PROCESSES

Markov Decision Processes Infinite Horizon Problems

Markov Decision Processes Chapter 17. Mausam

Linearly-solvable Markov decision problems

Infinite-Horizon Discounted Markov Decision Processes

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

HOW TO CHOOSE THE STATE RELEVANCE WEIGHT OF THE APPROXIMATE LINEAR PROGRAM?

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam

Distributed Optimization. Song Chong EE, KAIST

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Approximate Dynamic Programming

Markov Decision Processes Chapter 17. Mausam

1 Markov decision processes

Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS

Infinite-Horizon Average Reward Markov Decision Processes

THIRD EDITION. Dimitri P. Bertsekas. Massachusetts Institute of Technology. Last Updated 10/1/2008. Athena Scientific, Belmont, Mass.

Approximate Dynamic Programming

Zero-Sum Average Semi-Markov Games: Fixed Point Solutions of the Shapley Equation

On Finding Optimal Policies for Markovian Decision Processes Using Simulation

Optimal Stopping Problems

Section Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018

DIFFERENTIAL TRAINING OF 1 ROLLOUT POLICIES

Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

6 Basic Convergence Results for RL Algorithms

Lecture 5: The Bellman Equation

Planning in Markov Decision Processes

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

A.Piunovskiy. University of Liverpool Fluid Approximation to Controlled Markov. Chains with Local Transitions. A.Piunovskiy.

WE consider finite-state Markov decision processes

Average Reward Parameters

MDP Preliminaries. Nan Jiang. February 10, 2019

Zero-Sum Stochastic Games An algorithmic review

Finite-Sample Analysis in Reinforcement Learning

Markov Decision Processes and Dynamic Programming

, and rewards and transition matrices as shown below:

ON THE POLICY IMPROVEMENT ALGORITHM IN CONTINUOUS TIME

The Art of Sequential Optimization via Simulations

Lecture notes for Analysis of Algorithms : Markov decision processes

Williams-Baird Counterexample for Q-Factor Asynchronous Policy Iteration

A System Theoretic Perspective of Learning and Optimization

Slides II - Dynamic Programming

ON STEP SIZES, STOCHASTIC SHORTEST PATHS, AND SURVIVAL PROBABILITIES IN REINFORCEMENT LEARNING. Abhijit Gosavi

Game Theory and its Applications to Networks - Part I: Strict Competition

Computation and Dynamic Programming

University of Warwick, EC9A0 Maths for Economists Lecture Notes 10: Dynamic Programming

A note on two-person zero-sum communicating stochastic games

MATH4406 Assignment 5

On the reduction of total cost and average cost MDPs to discounted MDPs

arxiv: v1 [math.oc] 9 Oct 2018

Approximate Dynamic Programming

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

16.410/413 Principles of Autonomy and Decision Making

Understanding (Exact) Dynamic Programming through Bellman Operators

Introduction to Reinforcement Learning

Optimal Control of an Inventory System with Joint Production and Pricing Decisions

ECE7850 Lecture 7. Discrete Time Optimal Control and Dynamic Programming

Basics of reinforcement learning

Inventory Control with Convex Costs

DYNAMIC LECTURE 5: DISCRETE TIME INTERTEMPORAL OPTIMIZATION

On the Approximate Solution of POMDP and the Near-Optimality of Finite-State Controllers

Decision Theory: Markov Decision Processes

Error Bounds for Approximations from Projected Linear Equations

Transcription:

Abstract Dynamic Programming Dimitri P. Bertsekas Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Overview of the Research Monograph Abstract Dynamic Programming" Athena Scientific, 2013 Bertsekas (M.I.T.) Abstract Dynamic Programming 1 / 28

Abstract Dynamic Programming Main Objective Unification of the core theory and algorithms of total cost sequential decision problems Simultaneous treatment of a variety of problems: MDP, sequential games, sequential minimax, multiplicative cost, risk-sensitive, etc Methodology Define a problem by its mathematical signature": the mapping defining the optimality equation Structure of this mapping (contraction, monotonicity, etc) determines the analytical and algorithmic theory of the problem Fixed point theory: An important connection Bertsekas (M.I.T.) Abstract Dynamic Programming 2 / 28

Three Main Classes of Total Cost DP Problems Discounted: Discount factor < 1 and bounded cost per stage Dates to 50s (Bellman, Shapley) Nicest results Undiscounted (Positive and Negative DP): N-step horizon costs are going or with N Dates to 60s (Blackwell, Strauch) Not nearly as powerful results compared with the discounted case Stochastic Shortest Path (SSP): Also known as first passage or transient programming Aim is to reach a termination state at min expected cost Dates to 60s (Eaton-Zadeh, Derman, Pallu de la Barriere) Results are almost as strong as for the discounted case (under appropriate conditions) Bertsekas (M.I.T.) Abstract Dynamic Programming 3 / 28

Corresponding Abstract Models Contractive: Patterned after discounted The DP mapping is a sup-norm contraction (Denardo 1967) Monotone Increasing/Decreasing: Patterned after positive and negative DP No reliance on contraction properties, just monotonicity (Bertsekas 1977) Semicontractive: Patterned after stochastic shortest path Some policies are regular"/contractive; others are not, but assumptions are imposed so there exist optimal regular" policies New research, inspired by SSP, where regular" policies are the proper" ones (the ones that terminate w.p.1) Bertsekas (M.I.T.) Abstract Dynamic Programming 4 / 28

Outline 1 Problem Formulation 2 Results Overview 3 Semicontractive Models 4 Affine Monotonic/Risk-Sensitive Models Bertsekas (M.I.T.) Abstract Dynamic Programming 5 / 28

Abstract DP Mappings State and control spaces: X, U Control constraint: u U(x) Stationary policies: µ : X U, with µ(x) U(x) for all x Monotone Mappings Abstract monotone mapping H : X U E(X) R J J = H(x, u, J) H(x, u, J ), x, u where E(X) is the set of functions J : X [, ] Mappings T µ and T (T µj)(x) = H ( x, µ(x), J ), x X, J R(X) (TJ)(x) = inf (TµJ)(x) = inf H(x, u, J), µ u U(x) x X, J R(X) Stochastic Optimal Control - MDP example: (TJ)(x) = inf u U(x) E { g(x, u, w) + αj ( f (x, u, w) )} Bertsekas (M.I.T.) Abstract Dynamic Programming 7 / 28

Abstract Problem Formulation Abstract Optimization Problem Given an initial function J R(X) and policy µ, define J µ(x) = lim sup (T N J)(x), µ x X N Find J (x) = inf µ J µ(x) and an optimal µ attaining the infimum Notes Theory revolves around fixed point properties of mappings T µ and T : J µ = T µj µ, J = TJ These are generalized forms of Bellman s equation Algorithms are special cases of fixed point algorithms We restrict attention (initially) to issues involving only stationary policies Bertsekas (M.I.T.) Abstract Dynamic Programming 8 / 28

Examples With a Dynamic System x k+1 = f ( x k, µ(x k ), w k ) Stochastic Optimal Control J(x) 0, (T µj)(x) = E w { g(x, µ(x), w) + αj ( f (x, µ(x), w) )} J µ(x 0 ) = lim N Ew 0,w 1,... { N α k g ( ) } x k, µ(x k ), w k Minimax - Sequential Games J(x) 0, (T µj)(x) = sup w W (x) { ( )} g(x, u, w) + αj f (x, u, w) J µ(x 0 ) = lim N Multiplicative Cost Problems sup k=0 w 0,w 1,... k=0 N α k g ( ) x k, µ(x k ), w k J(x) 1, (T µj)(x) = E w { g(x, µ(x), w)j ( f (x, µ(x), w) )} J µ(x 0 ) = lim N Ew 0,w 1,... { N g ( ) } x k, µ(x k ), w k k=0 Bertsekas (M.I.T.) Abstract Dynamic Programming 9 / 28

Examples With a Markov Chain: Transition Probs. p ik,i k+1 (u k ) Finite-State Markov and Semi-Markov Decision Processes n ( )( ( ) ) J(x) 0, (T µj)(i) = p ij µ(i) g(i, µ(i), j) + αij µ(i) J(j) J µ(i 0 ) = lim sup N i=1 { N ( ( E αi0 µ(i0 ) ) ( a ik i k+1 µ(ik ) )) g ( ) } i k, µ(i k ), i k+1 k=0 where α ij (u) are state and control-dependent discount factors Undiscounted Exponential Cost J(x) 1, (T µj)(i) = J µ(x 0 ) = lim sup N E n i=1 ( ) ( ) p ij µ(i) e h i,µ(i),j J(j) {e h (i 0,µ(i 0 ),i 1 ) e h ( i N,µ(i N ),i N+1 )} Bertsekas (M.I.T.) Abstract Dynamic Programming 10 / 28

Contractive (C) Models All T µ are contractions within set of bounded functions B(X), w.r.t. a common (weighted) sup-norm and contraction modulus (e.g., discounted problems) Monotone Increasing (I) and Monotone Decreasing (D) J T µ J J T µ J (e.g., negative DP problems) (e.g., positive DP problems) Semicontractive (SC) T µ has contraction-like" properties for some µ - to be discussed (e.g., SSP problems) Semicontractive Nonnegative (SC + ) Semicontractive, and in addition J 0 and J 0 = H(x, u, J) 0, x, u (e.g., affine monotonic, exponential/risk-sensitive problems) Bertsekas (M.I.T.) Abstract Dynamic Programming 11 / 28

1 Problem Formulation 2 Results Overview 3 Semicontractive Models 4 Affine Monotonic/Risk-Sensitive Models Bertsekas (M.I.T.) Abstract Dynamic Programming 13 / 28

Bellman s Equation Optimality/Bellman s Equation J = TJ always holds under our assumptions Bellman s Equation for Policies: Cases (C), (I), and (D) J µ = T µj µ always holds Bellman s Equation for Policies: Case (SC) J µ = T µj µ holds only for µ: regular" J µ may take values for irregular" µ Bertsekas (M.I.T.) Abstract Dynamic Programming 14 / 28

Uniqueness of Solution of Bellman s Equations Case (C) T is a contraction within B(X) and J is its unique fixed point Cases (I), (D) T has multiple fixed points (some partial results hold) Case (SC) J is the unique fixed point of T within a subset of J R(X) with regular" behavior Bertsekas (M.I.T.) Abstract Dynamic Programming 15 / 28

Optimality Conditions (A Complicated Story) Cases (C), (I), and (SC - under one set of assumptions) µ is optimal if and only if T µ J = TJ Case (SC - under another set of assumptions) A regular" µ is optimal if and only if T µ J = TJ Case (D) µ is optimal if and only if T µ J µ = TJ µ Bertsekas (M.I.T.) Abstract Dynamic Programming 16 / 28

Asynchronous Convergence of Value Iteration: J k+1 = TJ k Case (C) T k J J for all J B(X) Case (D) T k J J Case (I) T k J J under additional compactness" conditions Case (SC) T k J J for all J R(X) within a set of regular" behavior Bertsekas (M.I.T.) Abstract Dynamic Programming 17 / 28

Policy Iteration: T µ k+1j µ k = TJ µ k (A Complicated Story) Classical Form of Exact PI (C): Convergence starting with any µ (SC): Convergence starting with a regular" µ (not if irregular" µ arise) (I), (D): Convergence fails Optimistic/Modified PI (Combination of VI and PI) (C): Convergence starting with any µ (SC): Convergence starting with any µ after a major modification in the policy evaluation step: Solving an optimal stopping" problem instead of a linear equation (D): Convergence starting with initial condition J (I): Convergence may fail (special conditions required) Asynchronous Optimistic/Modified PI (Combination of VI and PI) (C): Fails in the standard form. Works after a major modification (SC): Works after a major modification (D), (I): Convergence may fail (special conditions required) Bertsekas (M.I.T.) Abstract Dynamic Programming 18 / 28

Results Overview: Approximate DP Approximate J µ and J within a subspace spanned by basis functions Aim for approximate versions of value iteration, policy iteration, and linear programming Simulation-based algorithms are common No mathematical model is necessary (a computer simulator of the controller system is sufficient) Very large and complex problems has been addressed Case (C) A wide variety of results thanks to the underlying contraction property Approximate value iteration and Q-learning Approximate policy iteration, pure and optimistic/modified Cases (C), (I), (D), (SC) Hardly any results available Bertsekas (M.I.T.) Abstract Dynamic Programming 19 / 28

1 Problem Formulation 2 Results Overview 3 Semicontractive Models 4 Affine Monotonic/Risk-Sensitive Models Bertsekas (M.I.T.) Abstract Dynamic Programming 21 / 28

Semicontractive Models: Formulation Key idea: Introduce a domain of regularity," S E(X) S-Regular policy µ S-Irregular policy µ Tµ k J J Tµ k J J J µ Set S J T k µ J J µ Definition: A policy µ is S-regular if J µ S and is the only fixed point of T µ within S Starting function J does not affect J µ, i.e. T k µj J µ J S Bertsekas (M.I.T.) Abstract Dynamic Programming 22 / 28

Typical Assumptions in Semicontractive Models 1st Set of Assumptions (Plus Additional Technicalities) There exists an S-regular policy and irregular policies are bad": For each irregular µ and J S, there is at least one x X such that lim sup(tµj)(x) k = k 2nd Set of Assumptions (Plus Additional Technicalities) There exists an *optimal* S-regular policy Perturbation-Type Assumptions (Plus Additional Technicalities) There exists an *optimal* S-regular policy µ If H is perturbed by an additive δ > 0, each S-regular policy is also δ-s-regular (i.e., regular for the δ-perturbed problem), and every δ-s-irregular policy µ is bad", i.e., there is at least one x X such that lim sup (Tµ,δJ k µ,δ)(x) = k Bertsekas (M.I.T.) Abstract Dynamic Programming 23 / 28

Semicontractive Example: Shortest Paths with Exponential Cost a 2 1 b Destination 0 Bellman equation { J(1) = min { exp(b), exp(a)j(2) } J(2) = exp(a)j(1) a Two policies: J 1; S = {J J 0} or S = {J J > 0} or S = {J J J} Noncyclic µ: 2 1 0 (S-regular except when S = {J J J} and b < 0) (T µj)(1) = exp(b), (T µj)(2) = exp(a)j(1) J µ(1) = exp(b), J µ(2) = exp(a + b) Cyclic µ: 2 1 2 (S-irregular except when S = {J J 0} and a < 0) (T µj)(1) = exp(a)j(2), (T µj)(2) = exp(a)j(1) J µ(1) = J µ(2) = lim k ( exp(a) ) k Bertsekas (M.I.T.) Abstract Dynamic Programming 24 / 28

Five Special Cases (Each Covered by a Different Theorem!) a 2 1 a b Destination Bellman equation { J(1) = min { exp(b), exp(a)j(2) } 0 J(2) = exp(a)j(1) a > 0: J (1) = exp(b), J (2) = exp(a + b), is the unique fixed point w/ J > 0 (1st set of assumptions applies with S = {J J > 0}) Set of fixed points of T is { J J(1) = J(2) 0 } a = 0, b > 0: J (1) = J (2) = 1 (perturbation assumptions apply) Set of fixed points of T is { J J(1) = J(2) exp(b) } a = 0, b = 0: J (1) = J (2) = 1 (2nd set of assumptions applies with S = {J J J}) Set of fixed points of T is { J J(1) = J(2) 1 } a = 0, b < 0: J (1) = J (2) = exp(b) (perturbation assumptions apply) Set of fixed points of T is { J J(1) = J(2) exp(b) } a < 0: J (1) = J (2) = 0 is the unique fixed point of T (contractive case) Bertsekas (M.I.T.) Abstract Dynamic Programming 25 / 28

1 Problem Formulation 2 Results Overview 3 Semicontractive Models 4 Affine Monotonic/Risk-Sensitive Models Bertsekas (M.I.T.) Abstract Dynamic Programming 27 / 28

An Example: Affine Monotonic/Risk-Sensitive Models T µ is linear of the form T µj = A µj + b µ with b µ 0 and J 0 = A µj 0 S = {J 0 J} or S = {J 0 < J} or S: J bounded above and away from 0 Special case I: Negative DP model, J(x) 0, A µ : Transition prob. matrix Special case II: Multiplicative model w/ termination state 0, J(x) 1 H(x, u, J) = p x0 (u)g(x, u, 0) + y X p xy(u)g(x, u, y)j(y) A µ(x, y) = p xy ( µ(x) ) g ( x, µ(x), y ), bµ(x) = p x0 (u)g(x, u, 0) Special case III: Exponential cost w/ termination state 0, J(x) 1 A µ(x, y) = p xy ( µ(x) ) exp ( h(x, µ(x), y) ), bµ(x) = p x0 ( µ(x) ) exp ( h(x, µ(x), 0) ) Bertsekas (M.I.T.) Abstract Dynamic Programming 28 / 28

SC Assumptions Translated to Affine Monotonic µ is S-regular if and only if lim k (Ak µj)(x) = 0, (A m µb µ)(x) <, x X, J S m=0 The 1st Set of Assumptions There exists an S-regular policy; also inf µ:s regular J µ S If µ: S-irregular, there is at least one x X such that (A m µb µ)(x) = m=0 Compactness and continuity conditions hold Notes: Value and (modified) policy iteration algorithms are valid State and control spaces need not be finite Related (but different) results are possible under alternative conditions Bertsekas (M.I.T.) Abstract Dynamic Programming 29 / 28

Concluding Remarks Abstract DP is based on the connections of DP with fixed point theory Aims at unification and insight through abstraction Semicontractive models fill a conspicuous gap in the theory from the 60s-70s Affine monotonic is a natural and useful model Abstract DP models with approximations require more research Abstract DP models with restrictions, such as measurability of policies, require more research Bertsekas (M.I.T.) Abstract Dynamic Programming 30 / 28

Thank you! Bertsekas (M.I.T.) Abstract Dynamic Programming 31 / 28