MS&E338 Reinforcement Learning Lecture 2 - April 4 208 Algorithms for MDPs and Their Convergence Lecturer: Ben Van Roy Scribe: Matthew Creme and Kristen Kessel Bellman operators Recall from last lecture that we define two Bellman operators The first (T V )(s) R(s a) + s S P sa (s )V (s ) } The second (T π V )(s) = [ π(a s) R(s a) + ] P sa (s )V (s ) s S which is associated with policy π We refer to V = T V and V π = T π V as the Bellman equations The Bellman operators T and T π satisfy two properties: monotonicity and contraction Proposition For all V V π with V V T V T V and T π V T π V Proof We refer the reader to Lecture for this proof In proving the second property that T and T π are contraction operators we first introduce two key concepts: maximum survival time and weighted max-norm Definition 2 The maximum survival time of state s denoted is defined as π E π [ s 0 = s] < Intuitively can be thought of as how far we must look ahead in order to plan well in the worst case given that we start at state s Note that we can equivalently express as E π s 0 = s + } P sa (s )τ(s ) π t= We use this last equation in Propositions 4 and 5 Definition 3 The weighted max-norm of V is defined as Proposition 4 For all V V V /τ V (s) T V T V /τ α V V /τ where α <
Proof First recall from Definition 2 that the maximum survival time of state s denoted satisfies } Rearranging this expression yields max = + max + P sa (s )τ(s ) } P sa (s )τ(s ) } P sa (s )τ(s ) = = max = α by definition of α Then by definition of T and /τ we have T V T V /τ = max R(s a) + } P sa (s )V (s ) max R(s a) + P sa (s )V (s )} s S s S /τ max R(s a) + } P sa (s )V (s ) max R(s a) + P sa (s )V (s )} s S s S max R(s a) + P sa (s )V (s ) R(s a) P sa (s )V (s ) max P sa (s )V (s ) P sa (s )V (s ) max P sa (s ) (V (s ) V (s )) max P sa (s ) V (s ) V (s ) τ(s τ(s ) ) as required max max α V V /τ = α V V /τ Proposition 5 For all V V π P sa (s ) V (s ) V (s ) τ(s τ(s ) ) P sa (s )τ(s V (s ) V (s ) ) max s S τ(s ) T π V T π V /τ α V V /τ 2
where α < Proof The proof is analogous to the proof for Proposition 4 above Proposition 6 T has a unique fixed point Note that Proposition 6 is true for any contraction and is known as the contraction mapping theorem or the Banach fixed-point theorem Proof Existence: Consider the sequence V i } where V i = T V i beginning with V 0 (ie the sequence V T V T 2 V ) By Proposition 4 T V T 2 V /τ α V T V /τ T 2 V T 3 V /τ α T V T 2 V /τ α 2 V T V /τ T k V T k+ V /τ α k V T V /τ Since α < the sequence V T V T 2 V is a Cauchy sequence and as it is over a Euclidean space which is complete the sequence must converge As the sequence converges there exists some final V such that V = T V Uniqueness: Suppose some other ˆV V is a fixed point of T We then have ˆV = T ˆV and V ˆV = T V T ˆV /τ /τ α V ˆV /τ where the inequality follows from Proposition 4 unique Proposition 7 For all π T π has a unique fixed point Proof The proof is analogous to the proof for Proposition 6 above As α < we must have V = ˆV and therefore ˆV is Note that Propositions 6 and 7 are not necessarily true for MDP s with infinite state spaces as there can be value functions with infinite norm With these two properties monotonicity and contraction of T and T π in hand we can next prove that both policy and value iteration converge 2 Planning Algorithms 2 Value Iteration Algorithm Value Iteration Initialize V 0 for k = 0 2 do V k+ = T V k Proposition 8 V k converges to V Proof This is a Corollary of Proposition 6 3
22 Policy Iteration Algorithm 2 Policy Iteration Initialize π 0 for k = 0 2 do Solve V k = T πk V k Select π k+ st T πk+ V k = T V k Note that if V k = V then π k+ is optimal and if π k is optimal then V k = V Proposition 9 In policy iteration there exists k 0 such that V k = V for all k k 0 Proof First V k = T πk V k by Algorithm 2 and T πk V k T V k by definition of T πk and T Furthermore T V k = T πk+ V k by Algorithm 2 and therefore V k T πk+ V k T 2 π k+ V k T 3 π k+ V k by the monotonicity of T πk+ by the monotonicity of T πk+ V k+ where the last inequality follows from the fact that V k+ is the fixed point of T πk+ and therefore the sequence must converge to V k+ Thus the sequence V 0 V } is a monotonically increasing sequence bounded above by V and the sequence must therefore converge As there are finitely many deterministic policies this convergence must happen in finite time Now suppose the sequence of V s converges to V k We then have V k+ = T πk+ V k+ and V k = T πk+ V k This implies V k = T V k and therefore V k = V Thus the sequence V 0 V } converges to V as required 23 Linear Programming We look to solve the following linear program (LP) which is the dual of the LP presented in Lecture : min ρ(s)v (s) V where ρ(s) is the initial state distribution st V T V Proposition 0 If ρ(s) > 0 for all s S then V is the unique optimum Note that since V does not depend on ρ(s) if necessary we can enforce ρ(s) > 0 for all s S by simply replacing ρ(s) in the objective with another nonzero function without changing the optimal V Proof First V is feasible as V = T V thus satisfying the constraint Now for any feasible V V T V T 2 V T k V by the monotonicity of T Furthermore since T is a contraction with fixed point V (ie T V = V ) the sequence V T V must converge to V Therefore for all feasible V V dominates V (ie V V ) Thus V minimizes the objective ρ(s)v (s) and V is the optimal solution as required 4
24 Asynchronous Value Iteration Algorithm 3 Asynchronous Value Iteration Initialize V 0 for k = 0 2 do Select some state s k S V (s k ) := (T V )(s k ) Note that unlike in Algorithm where we update each state synchronously here we update only a single state at a time Proposition If for all s S s appears in the sequence (s 0 s ) infinitely often then V converges to V Proof Homework 5