Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

Size: px

Start display at page:

Download "Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019"

Bartholomew Stafford
5 years ago
Views:

1 Reinforcement Learning and Optimal Control ASU, CSE 691, Winter 2019 Dimitri P. Bertsekas Lecture 8 Bertsekas Reinforcement Learning 1 / 21

2 Outline 1 Review of Infinite Horizon Problems 2 Exact Policy Iteration 3 Approximations with Policy Iteration Bertsekas Reinforcement Learning 2 / 21

3 x 0 Stochastic DP Problems Random Transition x k+1 = f(x k,u k,w k )... x k x k+1 Random Cost k g(x k,u k,w k ) Infinite Horizon... Infinite number of stages, and stationary system and cost System x k+1 = f (x k, u k, w k ) with state, control, and random disturbance. Policies π = {µ 0, µ 1,...} with µ k (x) U(x) for all x and k. Special scalar α with 0 < α 1. If α < 1 the problem is called discounted. Cost of stage k: α k g ( x k, µ k (x k ), w k ). Cost of a policy J π(x 0 ) = lim N Ew k { N 1 α k g ( ) } x k, µ k (x k ), w k k=0 Optimal cost function J (x 0 ) = min π J π(x 0 ) If α = 1 we assume a special cost-free termination state t. The objective is to reach t at minimum expected cost. The problem is called stochastic shortest path (SSP) problem. Bertsekas Reinforcement Learning 4 / 21

4 Main Results - Finite-State Notation - Discounted Problems Convergence of VI Given any initial conditions J 0 (1),..., J 0 (n), the sequence { J k (i) } generated by VI J k+1 (i) = min u U(i) converges to J (i) for each i. Bellman s equation n p ij (u) ( g(i, u, j) + αj k (j) ), i = 1,..., n, The optimal cost function J = ( J (1),..., J (n) ) satisfies the equation J (i) = min u U(i) n p ij (u) ( g(i, u, j) + αj (j) ), i = 1,..., n, and is the unique solution of this equation. Optimality condition A stationary policy µ is optimal if and only if for every state i, µ(i) attains the minimum in the Bellman equation. Bertsekas Reinforcement Learning 5 / 21

5 Additional Results: Bellman Equation and Value Iteration for Policies Fix a policy µ with cost function J µ. Change the problem so the only control available at i is just µ(i) [not the set U(i)]. Apply our Bellman equation and VI convergence results: The VI algorithm (for policy µ), n ( ) ( J k+1 (i) = p ij µ(i) g ( i, µ(i), j ) ) + αj k (j), i = 1,..., n, converges to the cost J µ(i) for each i, for any initial conditions J 0 (1),..., J 0 (n). J µ is the unique solution of the Bellman equation (of policy µ) n ( ) ( J µ(i) = p ij µ(i) g ( i, µ(i), j ) ) + αj µ(j), i = 1,..., n Solving this linear system of n equations with n unknowns, the costs J µ(i), is called evaluation of policy µ. Evaluation of µ can be done by exact solution of the Bellman equation (e.g., Gaussian elimination), or iteratively with the VI algorithm (most likely for large n). Similar results hold for SSP problems. Bertsekas Reinforcement Learning 6 / 21

6 We introduce the DP operators (T µj)(i) = n i=1 (TJ)(i) = min u U(i) Shorthand Notation ( ) ( p ij µ(i) g ( i, µ(i), j ) ) + αj(j), i = 1,..., n, n p ij (u) ( g(i, u, j) + αj(j) ), i = 1,..., n They provide convenience of notation AND a vehicle for unification. T µ and T form the mathematical signature" of a DP problem, and serve to unify the DP theory (extensions to minimax, games, infinite spaces problems, etc). Their critical property is monotonicity (as J increases so does T µj and TJ); see the Abstract DP" book (DPB, 2018). All the DP results/algorithms can be written in math shorthand using T and T µ VI algorithm: J k+1 = TJ k, J k+1 = T µj k, k = 0, 1,... Bellman equation: J = TJ, J µ = T µj µ. µ is optimal if and only if TJ = T µj. Bertsekas Reinforcement Learning 7 / 21

7 Contraction Property of T and T µ n ( ) ( (T µj)(i) = p ij µ(i) g ( i, µ(i), j ) ) + αj(j), i = 1,..., n, i=1 n (TJ)(i) = min p ij (u) ( g(i, u, j) + αj(j) ), i = 1,..., n u U(i) In our discounted and SSP problems, T and T µ are contractions Introduce a (weighted max) norm for the vectors J = ( J(1),..., J(n) ) : J(i) J = max i=1,...,n v(i), where v(1),..., v(n) are some positive scalars. Definition: A mapping H that maps J = ( J(1),..., J(n) ) to the vector HJ = ( (HJ)(1),..., (HJ)(n) ) is a contraction if for some ρ with 0 < ρ < 1 HJ HJ ρ J J, for all J, J For our discounted and SSP problems, under our assumptions, T and T µ are contractions (in addition to being monotone). For the discounted problem, ρ = α and v(i) 1. This is the mathematical reason why our problems are so nice! Bertsekas Reinforcement Learning 8 / 21

8 Policy Iteration (PI) Algorithm: Generates a Sequence of Policies {µ k } Initial Policy Evaluate Cost Function J µ of Current policy µ Policy Cost Evaluation Generate Improved Policy µ Policy Improvement Given the current policy µ k, a PI consists of two phases: Policy evaluation computes J µ k (i), i = 1,..., n, as the solution of the (linear) Bellman equation system n ( J µ k (i) = p ij µ k (i) )( g ( i, µ k (i), j ) ) + αj µ k (j), i = 1,..., n Policy improvement then computes a new policy µ k+1 as n µ k+1 (i) arg min p ij (u) ( g(i, u, j) + αj µ k (j) ), u U(i) Compactly (in shorthand): PI is written as T µ k+1j µ k = TJ µ k. i = 1,..., n Bertsekas Reinforcement Learning 10 / 21

9 Proof of Policy Improvement Property PI finite-step convergence: PI generates an improving sequence of policies, i.e., J µ k+1(i) J µ k (i) for all i and k, and terminates with an optimal policy. We will show that J µ J µ, where µ is obtained from µ by PI Denote by J N the cost function of a policy that applies µ for the first N stages and applies µ thereafter. We have the Bellman equation J µ(i) = n p ) ij( ( µ(i) g ( i, µ(i), j ) ) + αj µ(j), so n ( ) ( J 1 (i) = p ij µ(i) g ( i, µ(i), j ) ) + αj µ(j) J µ(i) (by policy improvement eq.) From the definition of J 2 and J 1, monotonicity, and the preceding relation, we have n ( ) ( J 2 (i) = p ij µ(i) g ( i, µ(i), j ) ) n ( ) ( +αj 1 (j) p ij µ(i) g ( i, µ(i), j ) ) +αj µ(j) = J 1 (i) so J 2 (i) J 1 (i) J µ(i) for all i. Continuing similarly, we obtain J N+1 (i) J N (i) J µ(i) for all i and N. Since J N J µ (VI for µ converges), it follows that J µ J µ. Bertsekas Reinforcement Learning 11 / 21

10 Optimistic PI: Like Standard PI, but Policy Evaluation is Approximate, and Based on a Finite Number of VI Generates sequences of cost function approximations {J k } and policies {µ k } Given the typical function J k : Policy improvement computes a policy µ k such that µ k (i) arg min u U(i) n p ij (u) ( g(i, u, j) + αj k (j) ), i = 1,..., n Optimistic policy evaluation starts with Ĵk,0 = J k, and uses m k VI iterations for policy µ k to compute Ĵk,1,..., Ĵk,m k according to Ĵ k,m+1 (i) = n ( p ij (µ k (i)) g ( i, µ k (i), j ) ) + αĵk,m(j) for all i = 1,..., n, m = 0,..., m k 1, and sets J k+1 = Ĵk,m k. Convergence (using a cost improvement argument similar to standard PI) For the optimistic PI algorithm, we have J k J and J µ k J. Bertsekas Reinforcement Learning 12 / 21

11 Multistep Policy Iteration: Policy Improvement with Multistep Lookahead Motivation: It may yield a better policy µ k+1 than with one-step lookahead, at the expense of a more complex policy improvement operation. Given the typical policy µ k : Policy evaluation computes J µ k (i), i = 1,..., n, as the solution of the (linear) system of Bellman equations J µ k (i) = n ( p ij µ k (i) )( g ( i, µ k (i), j ) ) + αj µ k (j), i = 1,..., n Policy improvement with l-step lookahead then solves the l-stage problem with terminal cost function J µ k. If {ˆµ 0,..., ˆµ l 1 } is the optimal policy of this problem, then the new policy µ k+1 is ˆµ 0. Convergence (using similar argument to standard PI) Exact multistep PI has the same solid convergence properties as its one-step lookahead counterpart. Bertsekas Reinforcement Learning 13 / 21

12 Policy Iteration for Q-Factors (Can be Used in Model-Free/Monte Carlo Contexts) Initial Policy Evaluate Q-Factor Q µ of Current policy µ Policy Q-Factor Evaluation Generate Improved Policy µ Policy Improvement Given the typical policy µ k : Policy evaluation computes Q µ k (i, u), for all i = 1,..., n, and u U(i), as the solution of the (linear) system of equations n ( ( Q µ k (i, u) = p ij (u) g(i, u, j) + αq µ k j, µ k (j) )) Policy improvement then computes a new policy µ k+1 as µ k+1 (i) arg min Q µ k (i, u), i u U(i) = 1,..., n Bertsekas Reinforcement Learning 14 / 21

13 A Working Break: Think About Approximate PI Initial Policy Evaluate Cost Function J µ of Current policy µ Policy Cost Evaluation Generate Improved Policy µ Policy Improvement How would you introduce approximations into PI? What would make sense for: Approximation in policy evaluation? Approximation in policy improvement? Give examples (problem approximation, rollout, MPC, neural nets...) Bertsekas Reinforcement Learning 16 / 21

14 Approximation in Value Space for Infinite Horizon Problems Approximate minimization First Step Future n min u U(i) pij(u)( g(i, u, j)+ J(j) ) Approximations: Replace E{ } with nominal values (certainty equivalence) Adaptive simulation Monte Carlo tree search Computation of J: Problem approximation Rollout Approximate 2 PI Parametric approximation Aggregation We will focus on rollout, and particularly on approximate PI schemes, which operate as follows: Several policies µ 0, µ 1,..., µ m are generated, starting with an initial policy µ 0. Each policy µ k is evaluated approximately, with a cost function J µ k, often with the use of a parametric approximation/neural network approach. The next policy µ k+1 is generated by policy improvement based on J µ k. The approximate evaluation J µ m of the last policy in the sequence is used as the lookahead approximation J in a one-step or multistep lookahead minimization. Bertsekas Reinforcement Learning 17 / 21

15 Rollout The pure form of rollout : Approximation in value space with J = J µ µ is called the base policy, and is usually evaluated by Monte-Carlo. The rollout policy is the result of a single policy improvement using µ. So the rollout policy improves over the base policy. Lookahead Tree Terminal Cost Approximation J ik Selective Depth Rollout Policy µ A States ik+1 States ik+2 Variants of rollout (l-step lookahead, truncated rollout, terminal cost approx) l-step lookahead, then rollout with policy µ for a limited number of steps, and finally a terminal cost approximation. This is a single optimistic policy iteration combined with multistep lookahead. Bertsekas Reinforcement Learning 18 / 21

16 Approximate (Nonoptimistic) Policy Iteration - Error Bound - Stability J µ k Error Zone J Width ( +2 )/(1 ) 2 PI index k Assume an approximate policy evaluation error satisfying max Jµ k (i) J µ k (i) δ i=1,...,n and an approximate policy improvement error satisfying n ( max p ij µ k+1 (i) )( g(i, µ k+1 (i), j) + α J µ k (j) ) i=1,...,n min u U(i) n p ij (u) ( g(i, u, j) + α J µ k (j) ) ɛ Bertsekas Reinforcement Learning 19 / 21

17 Error Bound for the Case Where Policies Converge J µ k J Error Zone Width ( +2 )/(1 ) PI index k A better error bound (by a factor 1 α) holds if the generated policy sequence {µ k } converges to some policy. Convergence of policies is guaranteed in some cases; approximate PI using aggregation is one of them. Bertsekas Reinforcement Learning 20 / 21

18 About the Next Lecture We will cover: PI with parametric approximation methods Linear programming approach Q-learning Additional methods; temporal differences PLEASE READ AS MUCH OF SECTIONS AS YOU CAN PLEASE DOWNLOAD THE LATEST VERSIONS FROM MY WEBSITE Bertsekas Reinforcement Learning 21 / 21

Optimistic Policy Iteration and Q-learning in Dynamic Programming

Optimistic Policy Iteration and Q-learning in Dynamic Programming Dimitri P. Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology November 2010 INFORMS, Austin,