Stochastic Shortest Path Problems

Similar documents
Value and Policy Iteration

6.231 DYNAMIC PROGRAMMING LECTURE 17 LECTURE OUTLINE

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications

5. Solving the Bellman Equation

Linear Programming Methods

Abstract Dynamic Programming

On the Convergence of Optimistic Policy Iteration

Regular Policies in Abstract Dynamic Programming

Reinforcement Learning and Optimal Control. Chapter 4 Infinite Horizon Reinforcement Learning DRAFT

Reinforcement Learning and Optimal Control. Chapter 4 Infinite Horizon Reinforcement Learning DRAFT

Affine Monotonic and Risk-Sensitive Models in Dynamic Programming

Optimistic Policy Iteration and Q-learning in Dynamic Programming

Markov Decision Processes

Q-learning and policy iteration algorithms for stochastic shortest path problems

Lecture notes for Analysis of Algorithms : Markov decision processes

Markov Decision Processes and Dynamic Programming

1 Stochastic Dynamic Programming

Stochastic Dynamic Programming. Jesus Fernandez-Villaverde University of Pennsylvania

1 Markov decision processes

Distributed Optimization. Song Chong EE, KAIST

Stochastic Safest and Shortest Path Problems

Note that in the example in Lecture 1, the state Home is recurrent (and even absorbing), but all other states are transient. f ii (n) f ii = n=1 < +

Introduction to Approximate Dynamic Programming

PAGERANK OPTIMIZATION BY EDGE SELECTION

Q-Learning for Markov Decision Processes*

Markov Decision Processes Chapter 17. Mausam

6.231 DYNAMIC PROGRAMMING LECTURE 13 LECTURE OUTLINE

Markov Decision Processes Infinite Horizon Problems

Lecture 1: Overview of percolation and foundational results from probability theory 30th July, 2nd August and 6th August 2007

Understanding (Exact) Dynamic Programming through Bellman Operators

Economics 2010c: Lecture 2 Iterative Methods in Dynamic Programming

Markov decision processes and interval Markov chains: exploiting the connection

0.1 Uniform integrability

Markov Chains and Stochastic Sampling

Dynamic Programming and Reinforcement Learning

M17 MAT25-21 HOMEWORK 6

Lecture 5: The Bellman Equation

6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE

A Shadow Simplex Method for Infinite Linear Programs

STA205 Probability: Week 8 R. Wolpert

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam

Course 212: Academic Year Section 1: Metric Spaces

Weighted Sums of Orthogonal Polynomials Related to Birth-Death Processes with Killing

Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS

Uniform turnpike theorems for finite Markov decision processes

Chapter 1. Measure Spaces. 1.1 Algebras and σ algebras of sets Notation and preliminaries

SMSTC (2007/08) Probability.

NATIONAL UNIVERSITY OF SINGAPORE Department of Mathematics MA4247 Complex Analysis II Lecture Notes Part II

Math 117: Infinite Sequences

Markov Decision Processes and Dynamic Programming

Lecture 20 : Markov Chains

3 Integration and Expectation

On the Policy Iteration algorithm for PageRank Optimization

Robustness of policies in Constrained Markov Decision Processes

Elements of Reinforcement Learning

A.Piunovskiy. University of Liverpool Fluid Approximation to Controlled Markov. Chains with Local Transitions. A.Piunovskiy.

ORIE 633 Network Flows October 4, Lecture 10

First In-Class Exam Solutions Math 410, Professor David Levermore Monday, 1 October 2018

MDP Preliminaries. Nan Jiang. February 10, 2019

Fixed Points and Contractive Transformations. Ron Goldman Department of Computer Science Rice University

P (A G) dp G P (A G)

Procedia Computer Science 00 (2011) 000 6

1 Sequences of events and their limits

MATH4406 Assignment 5

UNIVERSITY OF YORK. MSc Examinations 2004 MATHEMATICS Networks. Time Allowed: 3 hours.

Markov Processes Hamid R. Rabiee

Abstract Dynamic Programming

Numerical Methods in Economics

Lecture 5. If we interpret the index n 0 as time, then a Markov chain simply requires that the future depends only on the present and not on the past.

7. Shortest Path Problems and Deterministic Finite State Systems

Stochastic Dynamic Programming: The One Sector Growth Model

Reinforcement Learning

Dantzig s pivoting rule for shortest paths, deterministic MDPs, and minimum cost to time ratio cycles

Effi ciency in Repeated Games with Local Monitoring

THIRD EDITION. Dimitri P. Bertsekas. Massachusetts Institute of Technology. Last Updated 10/1/2008. Athena Scientific, Belmont, Mass.

MATH41011/MATH61011: FOURIER SERIES AND LEBESGUE INTEGRATION. Extra Reading Material for Level 4 and Level 6

C.7. Numerical series. Pag. 147 Proof of the converging criteria for series. Theorem 5.29 (Comparison test) Let a k and b k be positive-term series

, and rewards and transition matrices as shown below:

Lecture 2: Convergence of Random Variables

Decomposability and time consistency of risk averse multistage programs

Journal of Computational and Applied Mathematics. Projected equation methods for approximate solution of large linear systems

Homework 4, 5, 6 Solutions. > 0, and so a n 0 = n + 1 n = ( n+1 n)( n+1+ n) 1 if n is odd 1/n if n is even diverges.

WE consider finite-state Markov decision processes

2. Transience and Recurrence

Example I: Capital Accumulation

Worst case analysis for a general class of on-line lot-sizing heuristics

Section Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018

Invariant measures for iterated function systems

Real Time Value Iteration and the State-Action Value Function

Ergodic Theorems. Samy Tindel. Purdue University. Probability Theory 2 - MA 539. Taken from Probability: Theory and examples by R.

Basic math for biology

RECURSION EQUATION FOR

Statistics 150: Spring 2007

Probability and Measure

Inventory Control with Convex Costs

Lecture #8: We now take up the concept of dynamic programming again using examples.

The Borel-Cantelli Group

University of Warwick, EC9A0 Maths for Economists Lecture Notes 10: Dynamic Programming

On Finding Optimal Policies for Markovian Decision Processes Using Simulation

Transcription:

Chapter 8 Stochastic Shortest Path Problems 1 In this chapter, we study a stochastic version of the shortest path problem of chapter 2, where only probabilities of transitions along different arcs can be controlled, and the objective is to minimize the expected length of the path. We discuss Bellman s equation, value and policy iteration, for the case of a finite state space. Technical difficulties arise because the DP operator is not necessarily a contraction as in the discounted cost problems of chapter 6. 8.1 Problem Formulation The stochastic shortest path problem of this chapter is an infinite horizon problem where 1. There is no discounting, i.e., α =1in the notation of chapter 6. 2. There is a special absorbing state, denoted t, representing the destination : p tt (u) =1, u. 3. The state space X = {1,..., n, t} and the control constraint sets U(i) are finite. 4. The destination state is cost-free: c(t, u) =, u. Clearly the cost of any policy starting from t is, hence we can omit t from the notation and define the vectors J, c µ R n and P µ R n n as in section 7.1. The difference however is that P µ is substochastic instead of being stochastic. That is n p ij (u) =1 p it (u) 1, j=1 1 This version: November 22 29. 95

t 1 2 t 1 2 1 Policy µ Policy µ Figure 8.1: Transition diagrams and costs for two policies in a stochastic shortest path problem. The costs are written next to the arrows. with strict inequality if p it (u) >. The operators T, T µ are defined as in chapter 6 with α =1, i.e., T µ J = c µ + P µ J n TJ(i) = min u U(i) c(i, u)+ p ij (u)j(j). j=1 Bellman s Equation Need Not Hold for SSP In general, even under the assumptions above, the DP operator for SSP is not a contraction. Without additional assumptions, the theory developed in chapter 6 for discounted cost problems runs into difficulties. Consider a problem with state space {1, 2, t} and two policies with transition probabilities and costs as given on Fig. 8.1. The equation J = TJ can be written J(1) = min{ 1,J(2)} J(2) = J(1), and is satisfied for any J of the form J(1) = J(2) = δ with δ 1. The policy µ is clearly optimal with cost vector J(1) = J(2) = 1. The difficulty arises because the policy µ starting from 1 or 2 never reaches the final state yet has a finite (zero) cost. 8.2 DP Theory In view of the example of section 8.1, consider the following definition Definition 8.2.1. A stationary policy µ is said to be proper if, when using this policy, there is a positive probability that the destination will be reached after at most n stages, regardless of the initial state. That is ρ µ = max P µ(x n t x = i) < 1. (8.1) i=1,...,n 96

A stationary policy that is not proper is said to be improper. It is not hard to see that µ is a proper policy if and only if in the transition graph of the corresponding Markov chain, each state is connected to the destination by some path. We make the following assumptions. Assumption 8.2.1. There exists at least one proper policy. Assumption 8.2.2. For every improper policy µ, the corresponding cost J µ (i) is infinite for at least one state i. That is, some component of N 1 k= P µ k c µ diverges as N. In the case of a deterministic shortest path problem, assumption 8.2.1 is satisfied if and only if every node is connected to the destination by some path, and assumption 8.2.2 is satisfied if and only if each cycle that does not contain the destination has positive length. Assumption 8.2.2 is satisfied for example if c(i, u) is positive for all i t and u U(i). Another important case where the assumptions are verified is when all policies are proper. Note that if a policy is proper, then almost surely the destination must be reached after a finite number of steps. This is because a consequence of 8.1 is that P µ (x 2n t x = i) =P µ (x 2n t x n t, x = i)p µ (x n t x = i) and more generally ρ 2 µ, P µ (x k t x = i) ρ k/n µ, i =1,..., n. (8.2) We can then invoke the Borel-Cantelli lemma to conclude. Moreover under a proper policy the cost is finite, since N 1 J µ (i) lim ρ k/n µ max c(j, µ(j)) <. (8.3) N j=1,...,n k= Under assumptions 8.2.1 and 8.2.2, the dynamic programming theory is similar to the one for the discounted cost problems of chapter 6. Note that here the constant ρ µ in a sense plays the role of the discount factor, see (8.3). Note in particular that (8.2) implies that for all J, we have lim P µ k J =. k Proposition 8.2.1. 1. For a proper policy µ, the associated cost vector satisfies lim k Tµ k J = J µ, for every vector J, and J µ is the unique solution of the equation J µ = T µ J µ. Moreover, a policy µ such that J T µ J is satisfied for some vector J is proper. 97

2. The optimal cost vector is the unique solution of Bellman s equation J = TJ, and we have lim k T k J = J for every vector J. A stationary policy µ is optimal if and only if T µ J = TJ. Proof. We have and so T k µ J = P k µ J + lim T µ k J = lim k k k 1 m= k 1 m= P m µ c µ, P m µ c µ = J µ, by (8.2) (the fact that the limit exists is also a consequence of (8.2)). Then taking the limit as k in T k+1 µ J = c µ + P µ T k µ J we obtain J µ = T µ J µ. Finally for uniqueness, suppose J = T µ J, then J = T k µ J for all k and letting k we get J = J µ. Now consider a policy µ for which J T µ J for some vector J. Then for all k J T k µ J = P k µ J + k 1 m= P m µ c µ, so no component of the sum can diverge and so µ is not improper (hence proper) by our assumption 8.2.2. Next consider the second statement. Note first that because the control sets are finite, for any vector J we can find a policy µ such that TJ = T µ J. Suppose Bellman s equation has two solutions J and J, with corresponding policies µ and µ such that J = T µ J and J = T µ J. Then by the first statement µ and µ must be proper, and so J = J µ and J = J µ. Then J = T k J Tµ k J for all k so J J µ = J. Similarly J J so T has at most one fixed point. Now we show the existence of a fixed point for T. By assumption 8.2.1 there is at least one proper policy µ, with cost J µ. We use this policy to start policy iteration. Assume that at stage k we have a proper policies µ,..., µ k such that J µ TJ µ J µ1... J µk 1 TJ µk 1 J µk. (8.4) Then we choose µ k+1 such that TJ µk = T µk+1 J µk. Hence T µk+1 J µk T µk J µk = J µk so µ k+1 is proper by the first statement. Also by monotonicity of T µk+1 we have J µk+1 = lim k T k µ k+1 J µk T µk+1 J µk = TJ µk J µk, 98

which completes the induction. Since the set of proper policies is finite, some policy µ must be repeated within the sequence {µ k }. By the inequalities (8.4) we get J µ = TJ µ for this policy. We still have to show that the unique fixed point J µ of T just constructed is the optimal cost vector J, and that T k J J µ = J for all J. We start with the second assertion. Let δ>,e= [1,..., 1] T and Ĵ be the vector satisfying Ĵ = T µ Ĵ + δe. There is a unique such vector Ĵ, which is the cost of the proper policy µ with the cost c µ replaced by c µ + δe. This interpretation also implies J µ Ĵ and so J µ = TJ µ T Ĵ T µĵ = Ĵ δe Ĵ. Using the monotonicity of T, we have then J µ = T k J µ T k Ĵ T k 1 Ĵ Ĵ, k 1. Hence the sequence T k Ĵ converges to some vector J. mapping T is continuous, so We can see that the T J = T ( lim k T k Ĵ) = lim k T k+1 Ĵ = J. By the uniqueness of the fixed point of T, we conclude J = J µ. Now take any vector J. Recalling the interpretation of Ĵ above, we can always find δ> such that J Ĵ and in fact such that the following stronger condition holds J µ δe J Ĵ. (8.5) We already know lim k T k Ĵ = J µ. Moreover, since P µ e e for any policy µ, we have T (J δe) TJ δe for any vector J. Thus J µ δe = TJ µ δe T (J µ δe) TJ µ = J µ. Hence T k (J µ δe) is monotonically increasing and bounded above, so as before we conclude that lim k T k (J µ δe) =J µ. Finally by monotonicity of T and (8.5) we obtain lim k T k J = J µ. We can now show that J µ = J. Consider a policy π = {µ,µ 1,...}, and let J be the zero vector. Then T µ T µ1... T µk 1 J T k J, and taking the lim sup on both sides we obtain J π J µ, hence J µ = J. Finally if µ is optimal, then J µ = J and by our assumptions µ must be proper. Hence T µ J = T µ J µ = J µ = J = TJ. Conversely if TJ = T µ J, since we also have J = TJ it follows from the first statement that µ is proper. By unicity of the fixed point of T µ we get J µ = J so µ is optimal. 99

8.3 Value and Policy Iteration Value Iteration The convergence of the value iteration algorithm is shown in proposition 8.2.1. Moreover, there are error bounds similar to the ones discussed in section 7.1 for discounted cost problems, and the Gauss-Seidel value iteration method works and typically converges faster than the ordinary value iteration method. Moreover, there are certain SSP problems for which we have better convergence results. Note for example that for deterministic shortest path problems, value iteration terminates in a finite number of step. More generally, let us assume that the transition graph corresponding to some optimal policy µ is acyclic. This requires in particular that there are no positive self-transitions p ii (µ (i)) > for i t. This last property can always be achieved however, as follows. Define a new SSP problem with transition probabilities { if j = i, p ij (u) = p ij(u) 1 p ii(u) if j i, for i =1,..., n and costs g(i, u) =g(i, u)/(1 p ii (u)), i =1,..., n. This new SSP is equivalent to the original one in the sense that it has the same optimal costs and policies. Moreover it satisfies the assumption p ii (µ (i)) =. Under the preceding acyclicity assumption, the value iteration method yields J after at most n iterations when started from the vector J such that J(i) = for all i =1,..., n. Hence such a vector could be a good choice to start VI in any case, even if the acyclicity property is not clear. Indeed consider the sets of states S = {t}, { S k+1 = i p ij (µ (i)) =, for all j/ k m=s m },k =, 1,... Hence S k+1 is the set of states which from which we go to k m=s m at the next step with probability 1. A consequence of the acyclicity assumption is that the sets are non empty until we reach k such that k m= S m = {1,..., n, t}. Indeed suppose S k = for k< k, and take i / S := k 1 m= S m. Then there is an edge starting from i that does not enter S. Follow this edge to reach a new state i 1 / S, and repeat, until you come back to i, contradiction (note that because an optimal policy is proper, the path constructed cannot end at some state different from t). Then we show by induction that the value iteration method starting with the vector J = above yields T k J(i) =J (i) for all i k m=s m,i t. This is vacuously true for k =. Then we have by the monotonicity of T that J T k+1 J, and for all i k+1 m= S m T k+1 J(i) c(i, µ (i)) + p ij (µ (i))j (j) j k m= Sm = J (i). 1

This completes the induction. Hence in this case value iteration gives the optimal costs for all states after k iterations. Policy Iteration Policy iteration and approximate policy iteration work similarly to the discounted cost case. 11