Understanding (Exact) Dynamic Programming through Bellman Operators

Size: px

Start display at page:

Download "Understanding (Exact) Dynamic Programming through Bellman Operators"

Whitney Powers
5 years ago
Views:

1 Understanding (Exact) Dynamic Programming through Bellman Operators Ashwin Rao ICME, Stanford University January 15, 2019 Ashwin Rao (Stanford) Bellman Operators January 15, / 11

2 Overview 1 Value Functions as Vectors 2 Bellman Operators 3 Contraction and Monotonicity 4 Policy Evaluation 5 Policy Iteration 6 Value Iteration 7 Policy Optimality Ashwin Rao (Stanford) Bellman Operators January 15, / 11

3 Value Functions as Vectors Assume State pace S consists of n states: {s 1, s 2,..., s n } Assume Action space A consists of m actions {a 1, a 2,..., a m } This exposition extends easily to continuous state/action spaces too We denote a stochastic policy as π(a s) (probability of a given s ) Abusing notation, deterministic policy denoted as π(s) = a Consider n-dim space R n, each dim corresponding to a state in S Think of a Value Function (VF) v: S R as a vector in this space With coordinates [v(s 1 ), v(s 2 ),..., v(s n )] Value Function (VF) for a policy π is denoted as v π : S R Optimal VF denoted as v : S R such that for any s S, v (s) = max v π(s) π Ashwin Rao (Stanford) Bellman Operators January 15, / 11

4 Some more notation Denote R a s as the Expected Reward upon action a in state s Denote Ps,s a as the probability of transition s s upon action a Define R π (s) = π(a s) R a s a A P π (s, s ) = a A π(a s) P a s,s Denote R π as the vector [R π (s 1 ), R π (s 2 ),..., R π (s n )] Denote P π as the matrix [P π (s i, s i )], 1 i, i n Denote γ as the MDP discount factor Ashwin Rao (Stanford) Bellman Operators January 15, / 11

5 Bellman Operators B π and B We define operators that transform a VF vector to another VF vector Bellman Policy Operator B π (for policy π) operating on VF vector v: B π v = R π + γp π v B π is a linear operator with fixed point v π, meaning B π v π = v π Bellman Optimality Operator B operating on VF vector v: (B v)(s) = max a {Ra s + γ Ps,s a v(s )} s S B is a non-linear operator with fixed point v, meaning B v = v Define a function G mapping a VF v to a deterministic greedy policy G(v) as follows: G(v)(s) = arg max{r a s + γ Ps,s a v(s )} a s S B G(v) v = B v for any VF v (Policy G(v) achieves the max in B ) Ashwin Rao (Stanford) Bellman Operators January 15, / 11

6 Contraction and Monotonicity of Operators Both B π and B are γ-contraction operators in L norm, meaning: For any two VFs v 1 and v 2, B π v 1 B π v 2 γ v 1 v 2 B v 1 B v 2 γ v 1 v 2 So we can invoke Contraction Mapping Theorem to claim fixed point We use the notation v 1 v 2 for any two VFs v 1, v 2 to mean: v 1 (s) v 2 (s) for all s S Also, both B π and B are monotonic, meaning: For any two VFs v 1 and v 2, v 1 v 2 B π v 1 B π v 2 v 1 v 2 B v 1 B v 2 Ashwin Rao (Stanford) Bellman Operators January 15, / 11

7 Policy Evaluation B π satisfies the conditions of Contraction Mapping Theorem B π has a unique fixed point v π, meaning B π v π = v π This is a succinct representation of Bellman Expectation Equation Starting with any VF v and repeatedly applying B π, we will reach v π lim N BN π v = v π for any VF v This is a succinct representation of the Policy Evaluation Algorithm Ashwin Rao (Stanford) Bellman Operators January 15, / 11

8 Policy Improvement Let π k and v πk denote the Policy and the VF for the Policy in iteration k of Policy Iteration Policy Improvement Step is: π k+1 = G(v πk ), i.e. deterministic greedy Earlier we argued that B v = B G(v) v for any VF v. Therefore, B v πk = B G(vπk )v πk = B πk+1 v πk (1) We also know from operator definitions that B v B π v for all π, v Combining (1) and (2), we get: B v πk B πk v πk = v πk (2) Monotonicity of B πk+1 implies B πk+1 v πk v πk B N π k+1 v πk... B 2 π k+1 v πk B πk+1 v πk v πk v πk+1 = lim N BN π k+1 v πk v πk Ashwin Rao (Stanford) Bellman Operators January 15, / 11

9 Policy Iteration We have shown that in iteration k + 1 of Policy Iteration, v πk+1 v πk If v πk+1 = v πk, the above inequalities would hold as equalities So this would mean B v πk = v πk But B has a unique fixed point v So this would mean v πk = v Thus, at each iteration, Policy Iteration either strictly improves the VF or achieves the optimal VF v Ashwin Rao (Stanford) Bellman Operators January 15, / 11

10 Value Iteration B satisfies the conditions of Contraction Mapping Theorem B has a unique fixed point v, meaning B v = v This is a succinct representation of Bellman Optimality Equation Starting with any VF v and repeatedly applying B, we will reach v lim N BN v = v for any VF v This is a succinct representation of the Value Iteration Algorithm Ashwin Rao (Stanford) Bellman Operators January 15, / 11

11 Greedy Policy from Optimal VF is an Optimal Policy Earlier we argued that B G(v) v = B v for any VF v. Therefore, B G(v )v = B v But v is the fixed point of B, meaning B v = v. Therefore, B G(v )v = v But we know that B G(v ) has a unique fixed point v G(v ). Therefore, v = v G(v ) This says that simply following the deterministic greedy policy G(v ) (created from the Optimal VF v ) in fact achieves the Optimal VF v In other words, G(v ) is an Optimal (Deterministic) Policy Ashwin Rao (Stanford) Bellman Operators January 15, / 11

Reinforcement Learning

Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value