Markov Decision Processes and Solving Finite Problems. February 8, 2017

Size: px

Start display at page:

Download "Markov Decision Processes and Solving Finite Problems. February 8, 2017"

Susanna Griffith
6 years ago
Views:

1 Markov Decision Processes and Solving Finite Problems February 8, 2017

2 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15: Learning Q-functions: Q-learning, SARSA, and others Feb 22: Advanced Q-functions: replay buffers, target networks, double Q-learning next... Advanced model learning and imitation learning next... Advanced policy gradient methods, and the exploration problem

3 Overview for This Lecture This lecture assumes you have a known system with a finite number of states and actions. How to exactly solve for optimal policy Value iteration Policy iteration Modified policy iteration

4 How Does This Lecture Fit In? Value Iteration deep Q-network methods small updates + neural nets small updates + neural nets Policy Iteration deep policy gradient methods

5 Markov Decision Process Defined by the following components: S: state space, a set of states of the environment. A: action space, a set of actions, which the agent selects from at each timestep. P(r, s s, a): a transition probability distribution. Alternatively, P(s s, a) and one of R(s), R(s, a) or R(s, a, s )

6 Partially Observed MDPs Instead of observing full state s, agent observes y, with y P(y s). A MDP can be trivially mapped onto a POMDP A POMDP can be mapped onto an MDP: s 0 = {y 0 }, s 1 = {y 0, y 1 }, s 2 = {y 0, y 1, y 2 },...

7 Simple MDP: Frozen Lake Gym: FrozenLake-v0 START state, GOAL state, other locations are FROZEN (safe) or HOLE (unsafe). Episode terminates when GOAL or HOLE state is reached Receive reward=1 when entering GOAL, 0 otherwise 4 directions are actions, but you move in wrong direction with probability 0.5.

8 Policies Deterministic policies a = π(s) Stochastic policies a π(a s)

9 Problems Involving MDPs Policy optimization: maximize expected reward with respect to policy π [ ] maximize E r t π t=0 Policy evaluation: compute expected return for fixed policy π return := sum of future rewards in an episode (i.e., a trajectory) Discounted return: rt + γr t+1 + γ 2 r t Undiscounted return: r t + r t r T 1 + V (s T ) Performance of policy: η(π) = E [ t=0 γt r t ] State value function: V π (s) = E [ t=0 γt r t s 0 = s] State-action value function: Q π (s, a) = E [ t=0 γt r t s 0 = s, a 0 = a]

10 Value Iteration: Finite Horizon Case Problem: max π 0 max... max E [r 0 + r r T 1 + V T (s T )] π 1 π T 1 Swap maxes and expectations: [ max E π 0 r 0 + max E π 1 Solve innermost problem: for each s S [ r max π T 1 E [r T 1 + V T (s T )] π T 1 (s), V T 1 (s) = maximize E st [r T 1 + V T (s T )] a Original problem becomes max π 0 E r 0 + max E π 1 r max E [r T 1 + V T (s T )] π T 1 }{{} V T 1 (s) [ max E π 0 r 0 + max E π 1 ]] [ r max π T 2 E [r T 2 + V T 1 (s T 1 )] ]]

11 Value Iteration: Finite Horizon Case Algorithm 1 Finite Horizon Value Iteration for t = T 1, T 2,..., 0 do for s S do π t (s), V t (s) = maximize a E [r t + V t+1 (s t+1 )] end for end for

12 Discounted Setting Discount factor γ [0, 1), downweights future rewards Discounted return: r 0 + γr 1 + γ 2 r Coefficients (1, γ, γ 2,... ) informally, we re adding up 1 + γ + γ 2 + = 1/(1 γ) timesteps. Effective time horizon 1/(1 γ). Want to solve for policy that ll optimize discounted sum of rewards from each state. Discounted problem can be obtained by adding transitions to sink state (where agent gets stuck and receives zero reward) { P(s P(s s, a) with probability γ s, a) = sink state with probability 1 γ

13 Infinite-Horizon VI Via Finite-Horizon VI maximize E [ r 0 + γr 1 + γ 2 r ] π 0 π 1 π 2... Can rewrite with nested sum [ [ max E π 0 r 0 + γ max E π 1 r 1 + γ max E π 2 [ ]]] r Pretend there s finite horizon T, ignore r T, r T +1,... error ɛ rmax γ T /(1 γ) resulting nonstationary policy only suboptimal by ɛ π0, V 0 converges to optimal policy as T.

14 Infinite-Horizon VI Algorithm 2 Infinite-Horizon Value Iteration Initialize V (0) arbitrarily. for n = 0, 1, 2,... until termination condition do for s S do π (n+1) (s), V (n+1) (s) = maximize a E st [r T 1 + γv (n) (s T )] end for end for Note that V (n) is exactly V 0 in a finite-horizon problem with n timesteps.

15 Infinite-Horizon VI: Operator View V R S VI update is a function T : R S R S, called backup operator [T V ](s) = max E s a s,a [r + γv (s )] Algorithm 3 Infinite-Horizon Value Iteration (v2) Initialize V (0) arbitrarily. for n = 0, 1, 2,... until termination condition do V (n+1) = T V (n) end for

16 Contraction Mapping View Backup operator T is a contraction with modulus γ under -norm T V T W γ V W By contraction-mapping principle, B has a fixed point, called V, and iterates V, T V, T 2 V, V,γ.

17 Policy Evaluation Problem: how to evaluate fixed policy π: V π,γ (s) = E [ r 0 + γr 1 + γ 2 r s 0 = s ] Can consider finite-horizon problem E [r 0 + r r T 1 + v T (s T )] = E [r 0 + γe [r γe [r T 1 + V T (s T )]]] Backwards recursion involves a backup operation V t = T π V t+1, where [T π V ](s) = E s s,a=π(s) [r + γv (s )] T π is also a contraction with modulus γ, sequence V, T π V, (T π ) 2 V, V π,γ V = T π V is a linear equation that we can solve exactly: V (s) = s P(s s, a = π(s))[r(s, a, s ) + γv (s )]

18 Policy Iteration: Overview Alternate between 1. Evaluate policy π V π 2. Set new policy to be greedy policy for V π [ π(s) = arg max E s a s,a r + γv π (s ) ] Guaranteed to converge to optimal policy and value function in a finite number of iterations, when γ < 1 Value function converges faster than in value iteration M. L. Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons,

19 Policy Iteration: Operator Form Algorithm 4 Policy Iteration Initialize π (0). for n = 1, 2,... do V (n 1) = Solve[V = T π(n 1) V ] π (n) = GV π(n 1) end for

20 Policy Iteration: Convergence Policy sequence π (0), π (1), π (2),... is monotonically improving, with nondecreasing value function: V π(0) V π(1) V π(2).... Informal argument: Switch policy at first timestep from π (0) to π (1) Before: V (s0 ) = E s1 s 0,a 0=π(s 0) [r 0 + γv π (s 1 )] After: V (s0 ) = max a0 E s1 s 0,a 0=π(s 0) [r 0 + γv (s 1 )] Vπ (1) π (0) π (0) π (0)... V π (0) π (0) π (0) π (0)... Vπ (1) π (1) π (0) π (0)... V π (1) π (0) π (0) π (0) V π (1) π (1) π (1) π (1)... V π (0) π (0) π (0) π (0)... If the value function does not increase, then we re done: V π (n) = V π (n+1) V π (n) = T V π (n) π (n) = π.

21 Modified Policy Iteration Update π to be the greedy policy, then value function with k backups (k-step lookahead) Algorithm 5 Modified Policy Iteration Initialize V (0). for n = 1, 2,... do π (n+1) = GV (n) V (n+1) = (T π(n+1) ) k V (n), for integer k 1 end for k = 1: value iteration k = : policy iteration See Puterman s textbook 2 for more details M. L. Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons,

22 The End Homework: will be released later today or early tomorrow, due on Feb 22 Next time: policy gradient methods: infinitesimal policy iteration

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)