APPROXIMATE DYNAMIC PROGRAMMING A SERIES OF LECTURES GIVEN AT TSINGHUA UNIVERSITY JUNE 2014 DIMITRI P. BERTSEKAS

Size: px

Start display at page:

Download "APPROXIMATE DYNAMIC PROGRAMMING A SERIES OF LECTURES GIVEN AT TSINGHUA UNIVERSITY JUNE 2014 DIMITRI P. BERTSEKAS"

Cleopatra Moore
5 years ago
Views:

1 APPROXIMATE DYNAMIC PROGRAMMING A SERIES OF LECTURES GIVEN AT TSINGHUA UNIVERSITY JUNE 2014 DIMITRI P. BERTSEKAS Based on the books: (1) Neuro-Dynamic Programming, by DPB and J. N. Tsitsiklis, Athena Scientific, 1996 (2) Dynamic Programming and Optimal Control, Vol. II: Approximate Dynamic Programming, by DPB, Athena Scientific, 2012 (3) Abstract Dynamic Programming, by DPB, Athena Scientific, For a fuller set of slides, see

2 APPROXIMATE DYNAMIC PROGRAMMING BRIEF OUTLINE I Our subject: Large-scale DP based on approximations and in part on simulation. This has been a research area of great interest for the last 25 years known under various names (e.g., reinforcement learning, neurodynamic programming) Emerged through an enormously fruitful crossfertilization of ideas from artificial intelligence and optimization/control theory Deals with control of dynamic systems under uncertainty, but applies more broadly (e.g., discrete deterministic optimization) A vast range of applications in control theory, operations research, artificial intelligence, and beyond... The subject is broad with rich variety of theory/math, algorithms, and applications. Our focus will be mostly on algorithms... less on theory and modeling

3 APPROXIMATE DYNAMIC PROGRAMMING Our aim: BRIEF OUTLINE II A state-of-the-art account of some of the major topics at a graduate level Show how to use approximation and simulation to address the dual curses of DP: dimensionality and modeling Our 6-lecture plan: Two lectures on exact DP with emphasis on infinite horizon problems and issues of largescale computational methods One lecture on general issues of approximation and simulation for large-scale problems One lecture on approximate policy iteration based on temporal differences(td)/projected equations/galerkin approximation One lecture on aggregation methods One lecture on Q-learning, and other methods, such as approximation in policy space

4 APPROXIMATE DYNAMIC PROGRAMMING LECTURE 1 LECTURE OUTLINE Introduction to DP and approximate DP Finite horizon problems The DP algorithm for finite horizon problems Infinite horizon problems Basic theory of discounted infinite horizon problems

5 DP AS AN OPTIMIZATION METHODOLOGY Generic optimization problem: min g(u) u U where u is the optimization/decision variable, g(u) is the cost function, and U is the constraint set Categories of problems: Discrete (U is finite) or continuous Linear (g is linear and U is polyhedral) or nonlinear Stochastic or deterministic: In stochastic problems the cost involves a stochastic parameter w, which is averaged, i.e., it has the form g(u) = E w { G(u,w) } where w is a random parameter. DP deals with multistage stochastic problems Information about w is revealed in stages Decisions are also made in stages and make use of the available information Its methodology is different

6 BASIC STRUCTURE OF STOCHASTIC DP Discrete-time system x k+1 = f k (x k,u k,w k ), k = 0,1,...,N 1 k: Discrete time x k : State; summarizes past information that is relevant for future optimization u k : Control; decision to be selected at time k from a given set w k : Random parameter (also called disturbance or noise depending on the context) N: Horizon or number of times control is applied Cost function that is additive over time E { g N (x N )+ N 1 k=0 g k (x k,u k,w k ) Alternativesystemdescription:P(x k+1 x k,u k ) x k+1 = w k with P(w k x k,u k ) = P(x k+1 x k,u k ) }

7 INVENTORY CONTROL EXAMPLE Discrete-time system x k+1 = f k (x k,u k,w k ) = x k +u k w k Cost function that is additive over time { } E g N (x N )+ = E { N 1 N 1 k=0 g k (x k,u k,w k ) ( cuk +r(x k +u k w k ) )} k=0

8 ADDITIONAL ASSUMPTIONS Probability distribution of w k does not depend on past values w k 1,...,w 0, but may depend on x k and u k Otherwise past values of w, x, or u would be useful for future optimization The constraint set from which u k is chosen at time k depends at most on x k, not on prior x or u Optimization over policies (also called feedback control laws): These are rules/functions u k = µ k (x k ), k = 0,...,N 1 that map state/inventory to control/order(closedloop optimization, use of feedback) MAJOR DISTINCTION: We minimize over sequences of functions (mapping inventory to order) {µ 0,µ 1,...,µ N 1 } NOT over sequences of controls/orders {u 0,u 1,...,u N 1 }

9 GENERIC FINITE-HORIZON PROBLEM System x k+1 = f k (x k,u k,w k ), k = 0,...,N 1 Control contraints u k U k (x k ) Probability distribution P k ( x k,u k ) of w k Policies π = {µ 0,...,µ N 1 }, where µ k maps states x k into controls u k = µ k (x k ) and is such that µ k (x k ) U k (x k ) for all x k Expected cost of π starting at x 0 is J π (x 0 ) = E { g N (x N )+ Optimal cost function N 1 k=0 J (x 0 ) = min π J π (x 0 ) Optimal policy π satisfies J π (x 0 ) = J (x 0 ) g k (x k,µ k (x k ),w k ) When produced by DP, π is independent of x 0. }

10 PRINCIPLE OF OPTIMALITY Let π = {µ 0,µ 1,...,µ N 1 } be optimal policy Consider the tail subproblem whereby we are at x k at time k and wish to minimize the costto-go from time k to time N E { g N (x N )+ N 1 l=k g l ( xl,µ l (x l ),w l ) } and the tail policy {µ k,µ k+1,...,µ N 1 } x k Tail Subproblem 0 k N Time Principle of optimality: The tail policy is optimal for the tail subproblem (optimization of the futuredoesnotdependonwhatwedidinthepast) DP solves ALL the tail subroblems At the generic step, it solves ALL tail subproblems of a given time length, using the solution of the tail subproblems of shorter time length

11 DP ALGORITHM Computes for all k and states x k : J k (x k ): opt. cost of tail problem starting at x k Initial condition: J N (x N ) = g N (x N ) Go backwards, k = N 1,...,0, using J k (x k ) = min u k U k (x k ) E w k { gk (x k,u k,w k ) +J k+1 ( fk (x k,u k,w k ) )}, To solve tail subproblem at time k minimize kth-stage cost + Opt. cost of next tail problem starting from next state at time k +1 Then J 0 (x 0 ), generated at the last step, is equal to the optimal cost J (x 0 ). Also, the policy π = {µ 0,...,µ N 1 } where µ k (x k) minimizes in the right side above for each x k and k, is optimal Proof by induction

12 PRACTICAL DIFFICULTIES OF DP The curse of dimensionality Exponential growth of the computational and storage requirements as the number of state variables and control variables increases Quick explosion of the number of states in combinatorial problems The curse of modeling Sometimes a simulator of the system is easier to construct than a model There may be real-time solution constraints A family of problems may be addressed. The dataoftheproblemtobesolvedisgivenwith little advance notice The problem data may change as the system is controlled need for on-line replanning All of the above are motivations for approximation and simulation

13 A MAJOR IDEA: COST APPROXIMATION Use a policy computed from the DP equation where the optimal cost-to-go function J k+1 is replaced by an approximation J k+1. Apply µ k (x k ), which attains the minimum in { min E ( g k (x k,u k,w k )+ J k+1 fk (x k,u k,w k ) )} u k U k (x k ) Some approaches: (a) ProblemApproximation:Use J k derivedfrom a related but simpler problem (b) Parametric Cost-to-Go Approximation: Use as J k a function of a suitable parametric form, whose parameters are tuned by some heuristic or systematic scheme(we will mostly focus on this) This is a major portion of Reinforcement Learning/Neuro-Dynamic Programming (c) Rollout Approach: Use as J k the cost of some suboptimal policy, which is calculated either analytically or by simulation

14 ROLLOUT ALGORITHMS At each k and state x k, use the control µ k (x k ) that minimizes in min u k U k (x k ) E{ g k (x k,u k,w k )+ J k+1 ( fk (x k,u k,w k ) )}, where J k+1 is the cost-to-go of some heuristic policy (called the base policy). Cost improvement property: The rollout algorithm achieves no worse(and usually much better) cost than the base policy starting from the same state. Main difficulty: Calculating J k+1 (x) may be computationally intensive if the cost-to-go of the base policy cannot be analytically calculated. May involve Monte Carlo simulation if the problem is stochastic. Things improve in the deterministic case (an important application is discrete optimization). Connection w/ Model Predictive Control(MPC).

15 INFINITE HORIZON PROBLEMS Same as the basic problem, but: The number of stages is infinite. The system is stationary. Total cost problems: Minimize J π (x 0 ) = lim N E w k k=0,1,... { N 1 k=0 α k g ( x k,µ k (x k ),w k ) } Discounted problems (α < 1, bounded g) Stochastic shortest path problems (α = 1, finite-state system with a termination state) - we will discuss sparringly Discounted and undiscounted problems with unbounded cost per stage - we will not cover Average cost problems - we will not cover Infinite horizon characteristics: Challenging analysis, elegance of solutions and algorithms Stationary policies π = {µ,µ,...} and stationary forms of DP play a special role

16 DISCOUNTED PROBLEMS/BOUNDED COST Stationary system x k+1 = f(x k,u k,w k ), k = 0,1,... Cost of a policy π = {µ 0,µ 1,...} J π (x 0 ) = lim N E w k k=0,1,... { N 1 k=0 α k g ( x k,µ k (x k ),w k ) } with α < 1, and g is bounded [for some M, we have g(x,u,w) M for all (x,u,w)] Optimal cost function: J (x) = min π J π (x) Boundedness of g guarantees that all costs are well-defined and bounded: Jπ (x) M 1 α All spaces are arbitrary - only boundedness of g is important (there are math fine points, e.g. measurability, but they don t matter in practice) Important special case: All underlying spaces finite; a (finite spaces) Markovian Decision Problem or MDP All algorithms ultimately work with a finite spaces MDP approximating the original problem

17 SHORTHAND NOTATION FOR DP MAPPINGS For any function J of x, denote (TJ)(x) = min u U(x) E w { g(x,u,w)+αj ( f(x,u,w) )}, x TJ is the optimal cost function for the onestage problem with stage cost g and terminal cost function αj. T operates on bounded functions of x to produce other bounded functions of x For any stationary policy µ, denote (T µ J)(x) = E w { g ( x,µ(x),w ) +αj ( f(x,µ(x),w) )}, x The critical structure of the problem is captured in T and T µ The entire theory of discounted problems can be developed in shorthand using T and T µ True for many other DP problems. T andt µ provideapowerfulunifyingframework for DP. This is the essence of the book Abstract Dynamic Programming

18 FINITE-HORIZON COST EXPRESSIONS ConsideranN-stagepolicyπ N 0 = {µ 0,µ 1,...,µ N 1 } with a terminal cost J: J π N 0 (x 0 ) = E = E { α N J(x k )+ N 1 l=0 α l g ( x l,µ l (x l ),w l ) } { g ( ) } x 0,µ 0 (x 0 ),w 0 +αjπ N 1 (x 1 ) = (T µ0 J π N 1 )(x 0 ) where π N 1 = {µ 1,µ 2,...,µ N 1 } By induction we have J π N 0 (x) = (T µ0 T µ1 T µn 1 J)(x), x For a stationary policy µ the N-stage cost function (with terminal cost J) is J π N 0 = T N µ J where T N µ is the N-fold composition of T µ Similarly the optimal N-stage cost function (with terminal cost J) is T N J T N J = T(T N 1 J) is just the DP algorithm

19 SHORTHAND THEORY A SUMMARY Infinite horizon cost function expressions [with J 0 (x) 0] J π (x) = lim N (T µ 0 T µ1 T µn J 0 )(x), J µ (x) = lim N (TN µ J 0 )(x) Bellman s equation: J = TJ, J µ = T µ J µ Optimality condition: µ: optimal <==> T µ J = TJ Value iteration: For any (bounded) J J (x) = lim k (Tk J)(x), x Policy iteration: Given µ k, Policy evaluation: Find J µ k by solving J µ k = T µ kj µ k Policy improvement: Find µ k+1 such that T µ k+1j µ k = TJ µ k

20 TWO KEY PROPERTIES Monotonicity property: For any J and J such that J(x) J (x) for all x, and any µ (TJ)(x) (TJ )(x), x, (T µ J)(x) (T µ J )(x), x. Constant Shift property: For any J, any scalar r, and any µ ( T(J +re) ) (x) = (TJ)(x)+αr, x, ( Tµ (J +re) ) (x) = (T µ J)(x)+αr, x, where e is the unit function [e(x) 1]. Monotonicity is present in all DP models(undiscounted, etc) Constant shift is special to discounted models Discounted problems have another property of major importance: T and T µ are contraction mappings (we will show this later)

21 CONVERGENCE OF VALUE ITERATION For all bounded J, J (x) = lim k (Tk J)(x), for all x Proof: For simplicity we give the proof for J 0. Foranyinitialstatex 0,andpolicyπ = {µ 0,µ 1,...}, J π (x 0 ) = E = E { α l g ( ) } x l,µ l (x l ),w l l=0 { k 1 l=0 α l g ( x l,µ l (x l ),w l ) } +E The tail portion satisfies { E l=k { α l g ( ) } x l,µ l (x l ),w l l=k α l g ( x l,µ l (x l ),w l ) } αk M 1 α, where M g(x,u,w). Take min over π of both sides, then lim as k. Q.E.D.

22 BELLMAN S EQUATION The optimal cost function J is a solution of Bellman s equation, J = TJ, i.e., for all x, J (x) = min u U(x) E w Proof: For all x and k, { g(x,u,w)+αj ( f(x,u,w) )} J (x) αk M 1 α (Tk J 0 )(x) J (x)+ αk M 1 α, where J 0 (x) 0 and M g(x,u,w). Applying T to this relation, and using Monotonicity and Constant Shift, (TJ )(x) αk+1 M 1 α (Tk+1 J 0 )(x) (TJ )(x)+ αk+1 M 1 α Taking the limit as k and using the fact lim k (Tk+1 J 0 )(x) = J (x) we obtain J = TJ. Q.E.D.

23 THE CONTRACTION PROPERTY Contraction property: For any bounded functions J and J, and any µ, max (TJ)(x) (TJ )(x) αmax J(x) J (x), x max (Tµ J)(x) (T µ J )(x) αmax J(x) J (x). x Proof: Denote c = max x S J(x) J (x). Then x x J(x) c J (x) J(x)+c, x Apply T to both sides, and use the Monotonicity and Constant Shift properties: (TJ)(x) αc (TJ )(x) (TJ)(x)+αc, x Hence (TJ)(x) (TJ )(x) αc, x. Q.E.D. Note: This implies that J is the unique solution of J = TJ, and J µ is the unique solution of J µ = T µ J µ

24 NEC. AND SUFFICIENT OPT. CONDITION A stationary policy µ is optimal if and only if µ(x) attains the minimum in Bellman s equation for each x; i.e., TJ = T µ J, or, equivalently, for all x, µ(x) arg min u U(x) E w { g(x,u,w)+αj ( f(x,u,w) )} Proof:IfTJ = T µ J, thenusingbellman sequation (J = TJ ), we have J = T µ J, sobyuniquenessofthefixedpointoft µ,weobtain J = J µ ; i.e., µ is optimal. Conversely, if the stationary policy µ is optimal, we have J = J µ, so J = T µ J. Combining this with Bellman s Eq. (J = TJ ), we obtain TJ = T µ J. Q.E.D.

25 APPROXIMATE DYNAMIC PROGRAMMING LECTURE 2 LECTURE OUTLINE Review of discounted problem theory Review of shorthand notation Algorithms for discounted DP Value iteration Various forms of policy iteration Optimistic policy iteration Q-factors and Q-learning Other DP models - Continuous space and time A more abstract view of DP Asynchronous algorithms

26 DISCOUNTED PROBLEMS/BOUNDED COST Stationary system with arbitrary state space x k+1 = f(x k,u k,w k ), k = 0,1,... Cost of a policy π = {µ 0,µ 1,...} J π (x 0 ) = lim N E w k k=0,1,... { N 1 k=0 α k g ( x k,µ k (x k ),w k ) } withα < 1,andforsomeM,wehave g(x,u,w) M for all (x,u,w) Shorthand notation for DP mappings (operate on functions of state to produce other functions) (TJ)(x) = min u U(x) E w { g(x,u,w)+αj ( f(x,u,w) )}, x TJ is the optimal cost function for the one-stage problem with stage cost g and terminal cost αj. For any stationary policy µ (T µ J)(x) = E w { g ( x,µ(x),w ) +αj ( f(x,µ(x),w) )}, x

27 SHORTHAND THEORY A SUMMARY Bellman s equation: J = TJ, J µ = T µ J µ or J (x) = min u U(x) E w { g(x,u,w)+αj ( f(x,u,w) )}, x J µ (x) = E w { g ( x,µ(x),w ) +αjµ ( f(x,µ(x),w) )}, x Optimality condition: i.e., µ: optimal <==> T µ J = TJ µ(x) arg min u U(x) E w { g(x,u,w)+αj ( f(x,u,w) )}, x Value iteration: For any (bounded) J J (x) = lim k (Tk J)(x), Policy iteration: Given µ k, x Find J µ k from J µ k = T µ kj µ k (policy evaluation); then Find µ k+1 such that T µ k+1j µ k = TJ µ k (policy improvement)

28 MAJOR PROPERTIES Monotonicity property: For any functions J and J on the state space X such that J(x) J (x) for all x X, and any µ (TJ)(x) (TJ )(x), (T µ J)(x) (T µ J )(x), x X Contraction property: For any bounded functions J and J, and any µ, max (TJ)(x) (TJ )(x) αmax J(x) J (x), x max (T µ J)(x) (T µ J )(x) αmax J(x) J (x) x Compact Contraction Notation: TJ TJ α J J, T µ J T µ J α J J, where for any bounded function J, we denote by J the sup-norm x J = max J(x) x x

29 THE TWO MAIN ALGORITHMS: VI AND PI Value iteration: For any (bounded) J J (x) = lim k (Tk J)(x), x Policy iteration: Given µ k Policy evaluation: Find J µ k by solving J µ k(x) = E w { ( g x,µ k (x),w ) ( +αj µ k f(x,µ k (x),w) )}, x or J µ k = T µ kj µ k Policy improvement: Let µ k+1 be such that µ k+1 (x) arg min u U(x) E w { g(x,u,w)+αjµ k ( f(x,u,w) )}, x or T µ k+1j µ k = TJ µ k For the case of n states, policy evaluation is equivalent to solving an n n linear system of equations: J µ = g µ +αp µ J µ Forlargen, exactpiisoutofthequestion(even though it terminates finitely as we will show)

30 JUSTIFICATION OF POLICY ITERATION We can show that J µ k J µ k+1 for all k Proof: For given k, we have J µ k = T µ kj µ k TJ µ k = T µ k+1j µ k Using the monotonicity property of DP, J µ k T µ k+1j µ k T 2 µ k+1 J µ k lim N TN µ k+1 J µ k Since we have J µ k J µ k+1. lim N TN µ k+1 J µ k = J µ k+1 If J µ k = J µ k+1, all above inequalities hold as equations, so J µ k solves Bellman s equation. Hence J µ k = J Thus at iteration k either the algorithm generates a strictly improved policy or it finds an optimal policy For a finite spaces MDP, the algorithm terminates with an optimal policy For infinite spaces MDP, convergence (in an infinite number of iterations) can be shown

31 OPTIMISTIC POLICY ITERATION Optimistic PI: This is PI, where policy evaluation is done approximately, with a finite number of VI So we approximate the policy evaluation J µ T m µ J for some number m [1, ) and initial J Shorthand definition: For some integers m k T µ kj k = TJ k, J k+1 = T m k µ k J k, k = 0,1,... If m k 1 it becomes VI If m k = it becomes PI Converges for both finite and infinite spaces discounted problems (in an infinite number of iterations) Typically works faster than VI and PI (for large problems)

32 APPROXIMATE PI Suppose that the policy evaluation is approximate, J k J µ k δ, k = 0,1,... and policy improvement is approximate, T µ k+1j k TJ k ǫ, k = 0,1,... where δ and ǫ are some positive scalars. Error Bound I: The sequence {µ k } generated by approximate policy iteration satisfies lim sup k J µ k J ǫ+2αδ (1 α) 2 Typical practical behavior: The method makes steady progress up to a point and then the iterates J µ k oscillate within a neighborhood of J. ErrorBound II:Ifinadditionthesequence {µ k } terminates at µ (i.e., keeps generating µ) J µ J ǫ+2αδ 1 α

33 Q-FACTORS I Optimal Q-factor of (x, u): Q (x,u) = E{g(x,u,w)+αJ (x)} with x = f(x,u,w). It is the cost of starting at x, applying u is the 1st stage, and an optimal policy after the 1st stage We can write Bellman s equation as J (x) = min u U(x) Q (x,u), x, We can equivalently write the VI method as J k+1 (x) = min u U(x) Q k+1(x,u), x, where Q k+1 is generated by Q k+1 (x,u) = E with x = f(x,u,w) { } g(x,u,w)+α min Q k(x,v) v U(x)

34 Q-FACTORS II Q-factors are costs in an augmented problem where states are (x,u) They satisfy a Bellman equation Q = FQ where { } (FQ)(x,u) = E g(x,u,w)+α min Q(x,v) v U(x) where x = f(x,u,w) VI and PI for Q-factors are mathematically equivalent to VI and PI for costs They require equal amount of computation... they just need more storage Having optimal Q-factors is convenient when implementing an optimal policy on-line by µ (x) = min u U(x) Q (x,u) Once Q (x,u) are known, the model [g and E{ }] is not needed. Model-free operation Q-Learning (to be discussed later) is a sampling method that calculates Q (x,u) using a simulator of the system (no model needed)

35 OTHER DP MODELS We have looked so far at the (discrete or continuous spaces) discounted models for which the analysis is simplest and results are most powerful Other DP models include: Undiscounted problems (α = 1): They may include a special termination state (stochastic shortest path problems) Continuous-time finite-state MDP: The time between transitions is random and state-andcontrol-dependent (typical in queueing systems, called Semi-Markov MDP). These can be viewed as discounted problems with stateand-control-dependent discount factors Continuous-time, continuous-space models: Classical automatic control, process control, robotics Substantial differences from discrete-time Mathematically more complex theory (particularly for stochastic problems) Deterministic versions can be analyzed using classical optimal control theory Admit treatment by DP, based on time discretization

36 CONTINUOUS-TIME MODELS System equation: dx(t)/dt = f ( x(t),u(t) ) Cost function: 0 g ( x(t),u(t) ) Optimal cost starting from x: J (x) δ-discretizationoftime: x k+1 = x k +δ f(x k,u k ) Bellman equation for the δ-discretized problem: J δ (x) = min u { δ g(x,u)+j δ ( x+δ f(x,u) )} Take δ 0, to obtain the Hamilton-Jacobi- Bellmanequation[assuminglim δ 0 J δ (x) = J (x)] 0 = min u { g(x,u)+ J (x) f(x,u) }, Policy Iteration (informally): x Policy evaluation: Given current µ, solve 0 = g ( x,µ(x) ) + J µ (x) f ( x,µ(x) ), x Policy improvement: Find µ(x) argmin u { g(x,u)+ Jµ (x) f(x,u) }, Note: Need to learn J µ (x) NOT J µ (x) x

37 A MORE GENERAL/ABSTRACT VIEW OF DP Let Y be a real vector space with a norm A function F : Y Y is said to be a contraction mapping if for some ρ (0,1), we have Fy Fz ρ y z, for all y,z Y. ρ is called the modulus of contraction of F. Important example: Let X be a set (e.g., state space in DP), v : X R be a positive-valued function. Let B(X) be the set of all functions J : X R such that J(x)/v(x) is bounded over x. We define a norm on B(X), called the weighted sup-norm, by J = max x X J(x) v(x). Important special case: The discounted problem mappings T and T µ [for v(x) 1, ρ = α].

38 CONTRACTION MAPPINGS: AN EXAMPLE Consider extension from finite to countable state space, X = {1,2,...}, and a weighted sup norm withrespecttowhichtheonestagecostsarebounded Suppose that T µ has the form (T µ J)(i) = b i +α j Xa ij J(j), i = 1,2,... where b i and a ij are some scalars. Then T µ is a contraction with modulus ρ if and only if j X a ij v(j) v(i) Consider T, ρ, i = 1,2,... (TJ)(i) = min µ (T µ J)(i), i = 1,2,... where for each µ M, T µ is a contraction mapping with modulus ρ. Then T is a contraction mapping with modulus ρ Allows extensions of main DP results from bounded one-stage cost to interesting unbounded one-stage cost cases.

39 CONTRACTION MAPPING FIXED-POINT TH. Contraction Mapping Fixed-Point Theorem: If F : B(X) B(X) is a contraction with modulus ρ (0,1), then there exists a unique J B(X) such that J = FJ. Furthermore, if J is any function in B(X), then {F k J} converges to J and we have F k J J ρ k J J, k = 1,2,... This is a special case of a general result for contraction mappings F : Y Y over normed vector spaces Y that are complete: every sequence {y k } that is Cauchy (satisfies y m y n 0 as m,n ) converges. The space B(X) is complete (see the text for a proof).

40 ABSTRACT FORMS OF DP We consider an abstract form of DP based on monotonicity and contraction Abstract Mapping: Denote R(X): set of realvalued functions J : X R, and let H : X U R(X) R be a given mapping. We consider the mapping (TJ)(x) = min H(x,u,J), x X. u U(x) We assume that (TJ)(x) > for all x X, so T maps R(X) into R(X). Abstract Policies: Let M be the set of policies, i.e., functions µ such that µ(x) U(x) for all x X. For each µ M, we consider the mapping T µ : R(X) R(X) defined by (T µ J)(x) = H ( x,µ(x),j ), x X. Find a function J R(X) such that J (x) = min u U(x) H(x,u,J ), x X

41 Discounted problems EXAMPLES H(x,u,J) = E { g(x,u,w)+αj ( f(x,u,w) )} Discounted discrete-state continuous-time Semi-Markov Problems (e.g., queueing) H(x,u,J) = G(x,u)+ n m xy (u)j(y) y=1 where m xy are discounted transition probabilities, defined by the distribution of transition times Minimax Problems/Games [ ( )] H(x,u,J) = max g(x,u,w)+αj f(x,u,w) w W(x,u) Shortest Path Problems H(x,u,J) = { axu +J(u) if u d, if u = d a xd where d is the destination. There are stochastic and minimax versions of this problem

42 ASSUMPTIONS Monotonicity: If J,J R(X) and J J, H(x,u,J) H(x,u,J ), x X, u U(x) We can show all the standard analytical and computational results of discounted DP if monotonicity and the following assumption holds: Contraction: For every J B(X), the functions T µ J and TJ belong to B(X) For some α (0,1), and all µ and J,J B(X), we have T µ J T µ J α J J With just monotonicity assumption(as in undiscounted problems) we can still show various forms of the basic results under appropriate conditions A weaker substitute for contraction assumption is semicontractiveness: (roughly) for some µ, T µ is a contraction and for others it is not; also the noncontractive µ are not optimal

43 RESULTS USING CONTRACTION Proposition 1: The mappings T µ and T are weighted sup-norm contraction mappings with modulus α over B(X), and have unique fixed points in B(X), denoted J µ and J, respectively (cf. Bellman s equation). Proof: From the contraction property of H. Proposition 2: For any J B(X) and µ M, lim k Tk µj = J µ, lim k Tk J = J (cf. convergence of value iteration). Proof: From the contraction property of T µ and T. Proposition 3: We have T µ J = TJ if and only if J µ = J (cf. optimality condition). Proof: T µ J = TJ, then T µ J = J, implying J = J µ. Conversely, if J µ = J, then T µ J = T µ J µ = J µ = J = TJ.

44 RESULTS USING MON. AND CONTRACTION Optimality of fixed point: J (x) = min µ M J µ(x), x X Existence of a nearly optimal policy: For every ǫ > 0, there exists µ ǫ M such that J (x) J µǫ (x) J (x)+ǫ, x X Nonstationary policies: Consider the set Π of all sequences π = {µ 0,µ 1,...} with µ k M for all k, and define J π (x) = liminf k (T µ 0 T µ1 T µk J)(x), x X, with J being any function (the choice of J does not matter) We have J (x) = min π Π J π(x), x X

45 THE TWO MAIN ALGORITHMS: VI AND PI Value iteration: For any (bounded) J J (x) = lim k (Tk J)(x), x Policy iteration: Given µ k Policy evaluation: Find J µ k by solving J µ k = T µ kj µ k Policy improvement: Find µ k+1 such that T µ k+1j µ k = TJ µ k Optimistic PI: This is PI, where policy evaluation is carried out by a finite number of VI Shorthand definition: For some integers m k T µ kj k = TJ k, J k+1 = T m k µ k J k, k = 0,1,... If m k 1 it becomes VI If m k = it becomes PI For intermediate values of m k, it is generally more efficient than either VI or PI

46 ASYNCHRONOUS ALGORITHMS Motivation for asynchronous algorithms Faster convergence Parallel and distributed computation Simulation-based implementations General framework: Partition X into disjoint nonempty subsets X 1,...,X m, and use separate processor l updating J(x) for x X l Let J be partitioned as J = (J 1,...,J m ), where J l is the restriction of J on the set X l. Synchronous VI algorithm: J t+1 l (x) = T(J1 t,...,jt m)(x), x X l, l = 1,...,m Asynchronous VI algorithm: For some subsets of times R l, J t+1 l (x) = { τ T(J l1 (t) 1,...,J τ lm(t) m )(x) if t R l, Jl t(x) if t / R l where t τ lj (t) are communication delays

47 ONE-STATE-AT-A-TIME ITERATIONS Important special case: Assume n states, a separate processor for each state, and no delays Generate a sequence of states {x 0,x 1,...}, generated in some way, possibly by simulation (each state is generated infinitely often) Asynchronous VI: J t+1 l = { T(J t 1,...,J t n)(l) if l = x t, J t l if l x t, where T(J t 1,...,Jt n)(l) denotes the l-th component of the vector T(J t 1,...,Jt n) = TJ t, The special case where {x 0,x 1,...} = {1,...,n,1,...,n,1,...} is the Gauss-Seidel method

48 ASYNCHRONOUS CONV. THEOREM I KEY FACT: VI and also PI (with some modifications) still work when implemented asynchronously Assumethatforalll,j = 1,...,m, R l isinfinite and lim t τ lj (t) = Proposition:LetT haveauniquefixedpointj, and assume that there is a sequence of nonempty subsets { S(k) } R(X) with S(k +1) S(k) for all k, and with the following properties: (1) Synchronous Convergence Condition: Every sequence {J k } with J k S(k) for each k, converges pointwise to J. Moreover, TJ S(k+1), J S(k), k = 0,1,... (2) BoxCondition: Forallk, S(k)isaCartesian product of the form S(k) = S 1 (k) S m (k), where S l (k) is a set of real-valued functions on X l, l = 1,...,m. Then for every J S(0), the sequence {J t } generated by the asynchronous algorithm converges pointwise to J.

49 ASYNCHRONOUS CONV. THEOREM II Interpretation of assumptions: J = (J 1, J 2 ) S 2 (0) S(0) S(k) S(k + 1) J TJ S 1 (0) A synchronous iteration from any J in S(k) moves into S(k + 1) (component-by-component) Convergence mechanism: J 1 Iterations J = (J 1, J 2 ) S(0) S(k) S(k + 1) J J 2 Iteration Key: Independent component-wise improvement. An asynchronous component iteration from any J in S(k) moves into the corresponding component portion of S(k +1)

50 APPROXIMATE DYNAMIC PROGRAMMING LECTURE 3 LECTURE OUTLINE Review of discounted DP Introduction to approximate DP Approximation architectures Simulation-based approximate policy iteration Approximate policy evaluation Some general issues about approximation and simulation

51 REVIEW

52 DISCOUNTED PROBLEMS/BOUNDED COST Stationary system with arbitrary state space x k+1 = f(x k,u k,w k ), k = 0,1,... Cost of a policy π = {µ 0,µ 1,...} J π (x 0 ) = lim N E w k k=0,1,... { N 1 k=0 α k g ( x k,µ k (x k ),w k ) } withα < 1,andforsomeM,wehave g(x,u,w) M for all (x,u,w) Shorthand notation for DP mappings (operate on functions of state to produce other functions) (TJ)(x) = min u U(x) E w { g(x,u,w)+αj ( f(x,u,w) )}, x TJ is the optimal cost function for the one-stage problem with stage cost g and terminal cost αj For any stationary policy µ (T µ J)(x) = E w { g ( x,µ(x),w ) +αj ( f(x,µ(x),w) )}, x

53 MDP - TRANSITION PROBABILITY NOTATION We will mostly assume the system is an n-state (controlled) Markov chain We will often switch to Markov chain notation States i = 1,...,n (instead of x) Transition probabilities p ik i k+1 (u k ) [instead of x k+1 = f(x k,u k,w k )] Stagecostg(i k,u k,i k+1 )[insteadofg(x k,u k,w k )] Cost functions J = ( J(1),...,J(n) ) (vectors in R n ) Cost of a policy π = {µ 0,µ 1,...} J π (i) = lim N E i k k=1,2,... { N 1 k=0 α k g ( i k,µ k (i k ),i k+1 ) i0 = i } Shorthand notation for DP mappings (TJ)(i) = min u U(i) n p ij (u) ( g(i,u,j)+αj(j) ), i = 1,...,n, j=1 (T µ J)(i) = n ( )( ( ) ) p ij µ(i) g i,µ(i),j +αj(j), i = 1,...,n j=1

54 SHORTHAND THEORY A SUMMARY Bellman s equation: J = TJ, J µ = T µ J µ or J (i) = min u U(i) n p ij (u) ( g(i,u,j)+αj (j) ), i j=1 J µ (i) = n ( )( ( ) p ij µ(i) g i,µ(i),j +αjµ (j) ), i j=1 Optimality condition: i.e., µ: optimal <==> T µ J = TJ µ(i) arg min u U(i) n p ij (u) ( g(i,u,j)+αj (j) ), i j=1

55 THE TWO MAIN ALGORITHMS: VI AND PI Value iteration: For any J R n J (i) = lim k (Tk J)(i), i = 1,...,n Policy iteration: Given µ k Policy evaluation: Find J µ k by solving J µ k(i) = n ( p ij µ k (i) )( g ( i,µ k (i),j ) +αj µ k(j) ), i = 1,...,n j=1 or J µ k = T µ kj µ k Policy improvement: Let µ k+1 be such that µ k+1 (i) arg min u U(i) n p ij (u) ( g(i,u,j)+αj µ k(j) ), i j=1 or T µ k+1j µ k = TJ µ k Policy evaluation is equivalent to solving an n n linear system of equations For large n, exact PI is out of the question. We use instead optimistic PI (policy evaluation with a few VIs)

56 APPROXIMATE DP

57 GENERAL ORIENTATION TO ADP ADP (late 80s - present) is a breakthrough methodology that allows the application of DP to problems with many or infinite number of states. Other names for ADP are: reinforcement learning (RL). neuro-dynamic programming (NDP). adaptive dynamic programming (ADP). We will mainly adopt an n-state discounted model (the easiest case - but think of HUGE n). Extensions to other DP models (continuous space, continuous-time, not discounted) are possible (but more quirky). We will set aside for later. There are many approaches: Problem approximation Simulation-based approaches (we will focus on these) Simulation-based methods are of three types: Rollout (we will not discuss further) Approximation in value space Approximation in policy space

58 WHY DO WE USE SIMULATION? One reason: Computational complexity advantage in computing sums/expectations involving a very large number of terms Any sum n i=1 a i can be written as an expected value: n a i = i=1 n i=1 ξ i a i ξ i = E ξ { ai ξ i }, whereξ isanyprob.distributionover{1,...,n} It can be approximated by generating many samples {i 1,...,i k } from {1,...,n}, according to distribution ξ, and Monte Carlo averaging: n i=1 a i = E ξ { ai ξ i } 1 k k t=1 a it ξ it Simulation is also convenient when an analytical model of the system is unavailable, but a simulation/computer model is possible.

59 APPROXIMATION IN VALUE AND POLICY SPACE

60 APPROXIMATION IN VALUE SPACE Approximate J or J µ from a parametric class J(i;r)whereiisthecurrentstateandr = (r 1,...,r m ) is a vector of tunable scalars weights Use J in place of J or J µ in various algorithms and computations Role of r: By adjusting r we can change the shape of J so that it is close to J or J µ Two key issues: The choice of parametric class J(i;r) (the approximation architecture) Method for tuning the weights ( training the architecture) Success depends strongly on how these issues are handled... also on insight about the problem A simulator may be used, particularly when there is no mathematical model of the system (but there is a computer model) We will focus on simulation, but this is not the only possibility We may also use parametric approximation for Q-factors or cost function differences

APPROXIMATION ARCHITECTURES Divided in linear and nonlinear [i.e., linear or nonlinear dependence of J(i;r) on r] Linear architectures are easier to train, but nonlinear ones (e.g.

61 APPROXIMATION ARCHITECTURES Divided in linear and nonlinear [i.e., linear or nonlinear dependence of J(i;r) on r] Linear architectures are easier to train, but nonlinear ones (e.g., neural networks) are richer Computer chess example: Think of board position as state and move as control Uses a feature-based position evaluator that assigns a score (or approximate Q-factor) to each position/move Feature Extraction Features: Material balance, Mobility, Safety, etc Weighting of Features Score Position Evaluator Relatively few special features and weights, and multistep lookahead

62 LINEAR APPROXIMATION ARCHITECTURES Often, the features encode much of the nonlinearity inherent in the cost function approximated Then the approximation may be quite accurate without a complicated architecture (as an extreme example, the ideal feature is the true cost function) With well-chosen features, we can use a linear architecture: J(i;r) = φ(i) r, i = 1,...,n, or s J(r) = Φr = Φ j r j j=1 Φ: the matrix whose rows are φ(i), i = 1,...,n, Φ j is the jth column of Φ State i Feature Extraction Mapping Feature Vector φ(i) Linear Mapping Linear Cost Approximator φ(i) r This is approximation on the subspace S = {Φr r R s } spanned by the columns of Φ (basis functions) Many examples of feature types: Polynomial approximation, radial basis functions, etc

63 ILLUSTRATIONS: POLYNOMIAL TYPE Polynomial Approximation, e.g., a quadratic approximating function. Let the state be i = (i 1,...,i q ) (i.e., have q dimensions ) and define φ 0 (i) = 1, φ k (i) = i k, φ km (i) = i k i m, k,m = 1,...,q Linear approximation architecture: J(i;r) = r 0 + q r k i k + q q r km i k i m, k=1 k=1m=k where r has components r 0, r k, and r km. Interpolation: A subset I of special/representative states is selected, and the parameter vector r has one component r i per state i I. The approximating function is J(i;r) = r i, i I, J(i;r) = interpolation using the values at i I, i / I For example, piecewise constant, piecewise linear, more general polynomial interpolations.

64 A DOMAIN SPECIFIC EXAMPLE Tetris game (used as testbed in competitions)... TERMINATION J (i): optimal score starting from position i Number of states > (for board) Success with just 22 features, readily recognized by tetris players as capturing important aspects of the board position (heights of columns, etc)

65 APPROX. PI - OPTION TO APPROX. J µ OR Q µ Use simulation to approximate the cost J µ of the current policy µ Generate improved policy µ by minimizing in (approx.) Bellman equation Initial Policy Evaluate Approximate Cost J µ (i,r) Approximate Policy Evaluation Generate Improved Policy µ Policy Improvement Altenatively approximate the Q-factors of µ Initial Policy Evaluate Approximate Q-Factors Q µ (i, u,r) Approximate Policy Evaluation Generate Improved Policy µ µ(i) = arg min u U(i) Qµ (i, u,r) Policy Improvement

66 APPROXIMATING J OR Q Approximation of the optimal cost function J Q-Learning: Use a simulation algorithm to approximate the Q-factors Q (i,u) = g(i,u)+α n p ij (u)j (j); j=1 and the optimal costs J (i) = min u U(i) Q (i,u) Bellman Error approach: Find r to min r E i { ( J(i;r) (T J)(i;r) ) 2 } where E i { } is taken with respect to some distribution over the states Approximate Linear Programming (we will not discuss here) Q-learning can also be used with approximations Q-learning and Bellman error approach can also be used for policy evaluation

67 APPROXIMATION IN POLICY SPACE A brief discussion; we will return to it later. Use parametrization µ(i; r) of policies with a vector r = (r 1,...,r s ). Examples: Polynomial, e.g., µ(i;r) = r 1 +r 2 i+r 3 i 2 Linear feature-based µ(i;r) = φ 1 (i) r 1 +φ 2 (i) r 2 Optimize the cost over r. For example: Each value of r defines a stationary policy, withcoststartingatstateidenotedby J(i;r). Let (p 1,...,p n ) be some probability distribution over the states, and minimize over r n p i J(i;r) i=1 Use a random search, gradient, or other method A special case: The parameterization of the policies is indirect, through a cost approximation architecture Ĵ, i.e., n µ(i;r) arg min p ij (u) ( g(i,u,j)+αĵ(j;r)) u U(i) j=1

68 APPROXIMATE POLICY EVALUATION METHODS

69 DIRECT POLICY EVALUATION Approximate the cost of the current policy by using least squares and simulation-generated cost samples Amounts to projection of J µ onto the approximation subspace J µ 0 Subspace S = {Φr r R s } ΠJ µ Direct Method: Projection of cost vector J µ Solution by least squares methods Regular and optimistic policy iteration Nonlinear approximation architectures may also be used

70 DIRECT EVALUATION BY SIMULATION Projection by Monte Carlo Simulation: Compute the projection ΠJ µ of J µ on subspace S = {Φr r R s }, with respect to a weighted Euclidean norm ξ Equivalently, find Φr, where n ( r = arg min r R s Φr J µ 2 ξ = arg min ξ i φ(i) r J µ (i) ) 2 r R s i=1 Setting to 0 the gradient at r, ( n ) 1 n r = ξ i φ(i)φ(i) ξ i φ(i)j µ (i) i=1 i=1 Generatesamples { (i 1,J µ (i 1 )),...,(i k,j µ (i k )) } using distribution ξ Approximate by Monte Carlo the two expected values with low-dimensional calculations ( k ) 1 k ˆr k = t )φ(i t ) φ(i t )J µ (i t ) t=1φ(i t=1 Equivalent least squares alternative calculation: k ( ˆr k = arg min φ(it ) r J µ (i t ) ) 2 r R s t=1

71 INDIRECT POLICY EVALUATION An example: Galerkin approximation Solve the projected equation Φr = ΠT µ (Φr) where Π is projection w/ respect to a suitable weighted Euclidean norm J µ T µ (Φr) 0 Subspace S = {Φr r R s } ΠJ µ 0 Subspace S = {Φr r R s } Φr = ΠT µ (Φr) Direct Method: Projection of cost vector J µ Indirect Method: Solving a projected form of Bellman s equation Solution methods that use simulation (to manage the calculation of Π) TD(λ): Stochastic iterative algorithm for solving Φr = ΠT µ (Φr) LSTD(λ): Solves a simulation-based approximation w/ a standard solver LSPE(λ): A simulation-based form of projected value iteration; essentially Φr k+1 = ΠT µ (Φr k )+ simulation noise

72 BELLMAN EQUATION ERROR METHODS Another example of indirect approximate policy evaluation: min Φr T µ (Φr) 2 r ξ ( ) where ξ is Euclidean norm, weighted with respect to some distribution ξ It is closely related to the projected equation/galerkin approach(with a special choice of projection norm) Several ways to implement projected equation and Bellman error methods by simulation. They involve: Generating many random samples of states i k using the distribution ξ Generatingmanysamplesoftransitions(i k,j k ) using the policy µ Form a simulation-based approximation of the optimality condition for projection problem or problem (*) (use sample averages in place of inner products) Solve the Monte-Carlo approximation of the optimality condition Issues for indirect methods: How to generate the samples? How to calculate r efficiently?

73 ANOTHER INDIRECT METHOD: AGGREGATION A first idea: Group similar states together into aggregate states x 1,...,x s ; assign a common cost value r i to each group x i. Solve an aggregate DP problem, involving the aggregate states, to obtain r = (r 1,...,r s ). This is called hard aggregation x 1 x x 3 x Φ = More general/mathematical view: Solve Φr = ΦDT µ (Φr) where the rows of D and Φ are prob. distributions (e.g., D and Φ aggregate rows and columns of the linear system J = T µ J) ComparewithprojectedequationΦr = ΠT µ (Φr). Note: ΦD is a projection in some interesting cases

74 AGGREGATION AS PROBLEM APPROXIMATION Disaggregation Probabilities d xi i Original System States p ij (u),, g(i, u,j) j Aggregation Probabilities φ jy Q x ˆp xy (u) = ĝ(x, u) = Aggregate States n i=1 n i=1 d xi d xi n n j=1 j=1 y p ij (u)φ jy, p ij (u)g(i, u,j) Aggregation can be viewed as a systematic approach for problem approximation. Main elements: Solve (exactly or approximately) the aggregate problem by any kind of VI or PI method(including simulation-based methods) Use the optimal cost of the aggregate problem to approximate the optimal cost of the original problem Because an exact PI algorithm is used to solve the approximate/aggregate problem the method behaves more regularly than the projected equation approach

75 APPROXIMATE POLICY ITERATION ISSUES

76 THEORETICAL BASIS OF APPROXIMATE PI If policies are approximately evaluated using an approximation architecture such that max i J(i,r k ) J µ k(i) δ, k = 0,1,... If policy improvement is also approximate, max i (T µ k+1 J)(i,r k ) (T J)(i,r k ) ǫ, k = 0,1,... Error bound: The sequence {µ k } generated by approximate policy iteration satisfies lim sup k max i ( Jµ k(i) J (i) ) ǫ+2αδ (1 α) 2 Typical practical behavior: The method makes steady progress up to a point and then the iterates J µ k oscillate within a neighborhood of J. Oscillations are quite unpredictable. Some bad examples of oscillations have been constructed. In practice oscillations between policies is probably not the major concern.

77 THE ISSUE OF EXPLORATION To evaluate a policy µ, we need to generate cost samples using that policy - this biases the simulation by underrepresenting states that are unlikely to occur under µ Cost-to-go estimates of underrepresented states may be highly inaccurate This seriously impacts the improved policy µ This is known as inadequate exploration - a particularly acute difficulty when the randomness embodied in the transition probabilities is relatively small (e.g., a deterministic system) Some remedies: Frequently restart the simulation and ensure that the initial states employed form a rich and representative subset Occasionally generate transitions that use a randomly selected control rather than the one dictated by the policy µ Other methods: Use two Markov chains (one is the chain of the policy and is used to generate the transition sequence, the other is used to generate the state sequence).

78 APPROXIMATING Q-FACTORS Given J(i;r), policy improvement requires a model [knowledge of p ij (u) for all controls u U(i)] Model-free alternative: Approximate Q-factors Q(i,u;r) n p ij (u) ( g(i,u,j)+αj µ (j) ) j=1 and use for policy improvement the minimization µ(i) arg min u U(i) Q(i,u;r) risanadjustableparametervectorand Q(i,u;r) is a parametric architecture, such as s Q(i,u;r) = r m φ m (i,u) m=1 We can adapt any of the cost approximation approaches, e.g., projected equations, aggregation Use the Markov chain with states (i, u), so p ij (µ(i)) is the transition prob. to (j,µ(i)), 0 to other (j,u ) Major concern: Acutely diminished exploration

79 SOME GENERAL ISSUES

80 STOCHASTIC ALGORITHMS: GENERALITIES Consider solution of a linear equation x = b+ Ax by using m simulation samples b + w k and A+W k, k = 1,...,m, where w k,w k are random, e.g., simulation noise Think of x = b + Ax as approximate policy evaluation (projected or aggregation equations) Stoch.approx.(SA)approach: Fork = 1,...,m x k+1 = (1 γ k )x k +γ k ( (b+wk )+(A+W k )x k ) Monte Carlo estimation(mce) approach: Form Monte Carlo estimates of b and A b m = 1 m m (b+w k ), A m = 1 m m (A+W k ) k=1 k=1 Then solve x = b m +A m x by matrix inversion x m = (1 A m ) 1 b m or iteratively TD(λ) and Q-learning are SA methods LSTD(λ) and LSPE(λ) are MCE methods

81 COSTS OR COST DIFFERENCES? Consider the exact policy improvement process. To compare two controls u and u at x, we need E { g(x,u,w) g(x,u,w)+α ( J µ (x) J µ (x ) )} where x = f(x,u,w) and x = f(x,u,w) Approximate J µ (x) or D µ (x,x ) = J µ (x) J µ (x )? Approximating D µ (x,x ) avoids noise differencing. This can make a big difference Important point: D µ satisfies a Bellman equation for a system with state (x,x ) D µ (x,x ) = E { G µ (x,x,w)+αd µ (x,x ) } where x = f ( x,µ(x),w ), x = f ( x,µ(x ),w ) and G µ (x,x,w) = g ( x,µ(x),w ) g ( x,µ(x ),w ) D µ can be learned by the standard methods (TD, LSTD, LSPE, Bellman error, aggregation, etc). This is known as differential training.

82 AN EXAMPLE (FROM THE NDP TEXT) System and cost per stage: x k+1 = x k +δu k, g(x,u) = δ(x 2 +u 2 ) δ > 0 is very small; think of discretization of continuous-time problem involving dx(t)/dt = u(t) Consider policy µ(x) = 2x. Its cost function is J µ (x) = 5x2 4 (1+δ)+O(δ2 ) and its Q-factor is Q µ (x,u) = 5x2 4 +δ ( 9x 2 4 +u2 + 5 ) 2 xu +O(δ 2 ) The important part for policy improvement is δ (u xu ) When J µ (x) [or Q µ (x,u)] is approximated by J µ (x;r) [or by Q µ (x,u;r)], it will be dominated by 5x2 4 and will be lost

83 6.231 DYNAMIC PROGRAMMING LECTURE 4 LECTURE OUTLINE Review of approximation in value space Approximate VI and PI Projected Bellman equations Matrix form of the projected equation Simulation-based implementation LSTD and LSPE methods Optimistic versions Multistep projected Bellman equations Bias-variance tradeoff

84 REVIEW

85 DISCOUNTED MDP System: Controlled Markov chain with states i = 1,...,n, and finite control set U(i) at state i Transition probabilities: p ij (u) p ij (u) p ii (u) i j p jj (u) p ji (u) Cost of a policy π = {µ 0,µ 1,...} starting at state i: { N } J π (i) = lim N E k=0 α k g ( i k,µ k (i k ),i k+1 ) i0 = i with α [0,1) Shorthand notation for DP mappings (TJ)(i) = min u U(i) n p ij (u) ( g(i,u,j)+αj(j) ), i = 1,...,n, j=1 (T µ J)(i) = n ( )( ( ) ) p ij µ(i) g i,µ(i),j +αj(j), i = 1,...,n j=1

86 SHORTHAND THEORY A SUMMARY Bellman s equation: J = TJ, J µ = T µ J µ or J (i) = min u U(i) n p ij (u) ( g(i,u,j)+αj (j) ), i j=1 J µ (i) = n ( )( ( ) p ij µ(i) g i,µ(i),j +αjµ (j) ), i j=1 Optimality condition: i.e., µ: optimal <==> T µ J = TJ µ(i) arg min u U(i) n p ij (u) ( g(i,u,j)+αj (j) ), i j=1

87 THE TWO MAIN ALGORITHMS: VI AND PI Value iteration: For any J R n J (i) = lim k (Tk J)(i), i = 1,...,n Policy iteration: Given µ k Policy evaluation: Find J µ k by solving J µ k(i) = n ( p ij µ k (i) )( g ( i,µ k (i),j ) +αj µ k(j) ), i = 1,...,n j=1 or J µ k = T µ kj µ k Policy improvement: Let µ k+1 be such that µ k+1 (i) arg min u U(i) n p ij (u) ( g(i,u,j)+αj µ k(j) ), i j=1 or T µ k+1j µ k = TJ µ k Policy evaluation is equivalent to solving an n n linear system of equations For large n, exact PI is out of the question (even though it terminates finitely)

88 APPROXIMATION IN VALUE SPACE Approximate J or J µ from a parametric class J(i;r),whereiisthecurrentstateandr = (r 1,...,r s ) is a vector of tunable scalars weights Think n: HUGE, s: (Relatively) SMALL Many types of approximation architectures [i.e., parametric classes J(i;r)] to select from Any r R s defines a (suboptimal) one-step lookahead policy µ(i) = arg min u U(i) n p ij (u) ( g(i,u,j)+α J(j;r) ), i j=1 We want to find a good r We will focus mostly on linear architectures J(r) = Φr where Φ is an n s matrix whose columns are viewed as basis functions

89 LINEAR APPROXIMATION ARCHITECTURES We have J(i;r) = φ(i) r, i = 1,...,n where φ(i), i = 1,...,n is the ith row of Φ, or s J(r) = Φr = Φ j r j j=1 where Φ j is the jth column of Φ State i Feature Extraction Mapping Feature Vector φ(i) Linear Mapping Linear Cost Approximator φ(i) r This is approximation on the subspace S = {Φr r R s } spanned by the columns of Φ (basis functions) Many examples of feature types: Polynomial approximation, radial basis functions, etc Instead of computing J µ or J, which is hugedimensional, we compute the low-dimensional r = (r 1,...,r s ) using low-dimensional calculations

90 APPROXIMATE VALUE ITERATION

91 APPROXIMATE (FITTED) VI Approximates sequentially J k (i) = (T k J 0 )(i), k = 1,2,..., with J k (i;r k ) The starting function J 0 is given (e.g., J 0 0) Approximate (Fitted) Value Iteration: A sequential fit toproduce J k+1 from J k,i.e., J k+1 T J k or (for a single policy µ) J k+1 T µ Jk TJ 0 T J 1 T J 2 J 0 J 1 J 2 J 3 Subspace S = {Φr r R s } Fitted Value Iteration AfteralargeenoughnumberN ofsteps, J N (i;r N ) is used as approximation J(i;r) to J (i) Possibly use (approximate) projection Π with respect to some projection norm, J k+1 ΠT J k

92 WEIGHTED EUCLIDEAN PROJECTIONS Consider a weighted Euclidean norm J ξ = n ( ) 2, ξ i J(i) i=1 where ξ = (ξ 1,...,ξ n ) is a positive distribution (ξ i > 0 for all i). Let Π denote the projection operation onto S = {Φr r R s } with respect to this norm, i.e., for any J R n, where ΠJ = Φr r = arg min r R s Φr J 2 ξ Recall that weighted Euclidean projection can be implemented by simulation and least squares, i.e., sampling J(i) according to ξ and solving min r R s k ( φ(it ) r J(i t ) ) 2 t=1

93 FITTED VI - NAIVE IMPLEMENTATION Select/sample a small subset I k of representative states For each i I k, given J k, compute (T J k )(i) = min u U(i) n p ij (u) ( g(i,u,j)+α J k (j;r) ) j=1 Fit the function J k+1 (i;r k+1 ) to the small set of values (T J k )(i), i I k (for example use some form of approximate projection) Simulation can be used for model-free implementation Error Bound: If the fit is uniformly accurate within δ > 0, i.e., then lim sup k max i J k+1 (i) T J k (i) δ, max ( Jk (i,r k ) J (i) ) 2αδ i=1,...,n (1 α) 2 But there is a potential problem!

94 AN EXAMPLE OF FAILURE Consider two-state discounted MDP with states 1 and 2, and a single policy. Deterministic transitions: 1 2 and 2 2 Transition costs 0, so J (1) = J (2) = 0. Consider (exact) fitted VI scheme that approximates cost functions within S = { (r,2r) r R } ( ) 1 with a weighted least squares fit; here Φ = 2 Given J k = (r k,2r k ),wefind J k+1 = (r k+1,2r k+1 ), where J k+1 = Π ξ (T J k ), with weights ξ = (ξ 1,ξ 2 ): [ ( r k+1 = argmin ξ 1 r (T Jk )(1) ) 2 ( +ξ2 2r (T Jk )(2) ) ] 2 r With straightforward calculation r k+1 = αβr k, whereβ = 2(ξ 1 +2ξ 2 )/(ξ 1 +4ξ 2 ) > 1 So if α > 1/β (e.g., ξ 1 = ξ 2 = 1), the sequence {r k } diverges and so does { J k }. Difficulty is that T is a contraction, but Π ξ T (= least squares fit composed with T) is not.

95 NORM MISMATCH PROBLEM For the method to converge, we need Π ξ T to be a contraction; the contraction property of T is not enough TJ 0 T J 1 T J 2 J 0 J 2 = Π ξ (T J 1 ) J J 1 = Π ξ (TJ 0 ) 3 = Π ξ (T J 2 ) Subspace S = {Φr r R s } Fitted Value Iteration with Projection { } We need a vector of weights ξ such that T is a contraction with respect to the weighted Euclidean norm ξ Then we can show that Π ξ T is a contraction with respect to ξ We will come back to this issue

96 APPROXIMATE POLICY ITERATION

97 APPROXIMATE PI Initial Policy Evaluate Approximate Cost J µ (i,r) Approximate Policy Evaluation Generate Improved Policy µ Policy Improvement Evaluation of typical policy µ: Linear cost function approximation J µ (r) = Φr, where Φ is full rank n s matrix with columns the basis functions, and ith row denoted φ(i). Policy improvement to generate µ: n µ(i) = arg min p ij (u) ( g(i,u,j)+αφ(j) r ) u U(i) j=1 Error Bound (same as approximate VI): If max i J µ k(i,r k ) J µ k(i) δ, k = 0,1,... the sequence {µ k } satisfies ( lim supmax Jµ k(i) J (i) ) 2αδ k i (1 α) 2

98 POLICY EVALUATION Let s consider approximate evaluation of the cost of the current policy by using simulation. Direct policy evaluation - Cost samples generated by simulation, and optimization by least squares Indirect policy evaluation - solving the projected equation Φr = ΠT µ (Φr) where Π is projection w/ respect to a suitable weighted Euclidean norm J µ T µ (Φr) 0 Subspace S = {Φr r R s } ΠJ µ 0 Subspace S = {Φr r R s } Φr = ΠT µ (Φr) Direct Method: Projection of cost vector J µ Indirect Method: Solving a projected form of Bellman s equation Recall that projection can be implemented by simulation and least squares

99 PI WITH INDIRECT POLICY EVALUATION Initial Policy Evaluate Approximate Cost J µ (i,r) Approximate Policy Evaluation Generate Improved Policy µ Policy Improvement Given the current policy µ: We solve the projected Bellman s equation Φr = ΠT µ (Φr) WeapproximatethesolutionJ µ ofbellman s equation J = T µ J with the projected equation solution J µ (r)

100 KEY QUESTIONS AND RESULTS Does the projected equation have a solution? Under what conditions is the mapping ΠT µ a contraction, so ΠT µ has unique fixed point? Assumption: The Markov chain corresponding to µ has a single recurrent class and no transient states, i.e., it has steady-state probabilities that are positive ξ j = lim N 1 N N P(i k = j i 0 = i) > 0 k=1 Note that ξ j is the long-term frequency of state j. Proposition: (Norm Matching Property) Assume that the projection Π is withrespect to ξ, where ξ = (ξ 1,...,ξ n ) is the steady-state probability vector. Then: (a) ΠT µ is contraction of modulus α with respect to ξ. (b) The unique fixed point Φr of ΠT µ satisfies J µ Φr ξ 1 1 α 2 J µ ΠJ µ ξ

101 PRELIMINARIES: PROJECTION PROPERTIES Important property of the projection Π on S with weighted Euclidean norm ξ. For all J R n, Φr S, the Pythagorean Theorem holds: J Φr 2 ξ = J ΠJ 2 ξ + ΠJ Φr 2 ξ J Φr ΠJ Subspace S = {Φr r R s } The Pythagorean Theorem implies that the projection is nonexpansive, i.e., ΠJ Π J ξ J J ξ, for all J, J R n. To see this, note that Π(J J) 2 ξ Π(J J) 2 ξ + (I Π)(J J) 2 ξ = J J 2 ξ

102 PROOF OF CONTRACTION PROPERTY Lemma: If P is the transition matrix of µ, Pz ξ z ξ, z R n Proof: Let p ij be the components of P. For all z R n, we have Pz 2 ξ = n = i=1 n j=1 ξ i n p ij z j j=1 2 n n ξ i p ij zj 2 i=1 j=1 n n ξ i p ij zj 2 = ξ j zj 2 = z 2 ξ, i=1 j=1 where the inequality follows from the convexity of the quadratic function, and the next to last equalityfollowsfromthedefiningproperty n i=1 ξ ip ij = ξ j of the steady-state probabilities. Using the lemma, the nonexpansiveness of Π, and the definition T µ J = g +αpj, we have ΠT µ J ΠT µ J ξ T µ J T µ J ξ = α P(J J) ξ α J J ξ for all J, J R n. Hence ΠT µ is a contraction of modulus α.

103 PROOF OF ERROR BOUND Let Φr be the fixed point of ΠT. We have J µ Φr ξ 1 1 α 2 J µ ΠJ µ ξ. Proof: We have J µ Φr 2 ξ = J µ ΠJ µ 2 ξ + ΠJµ Φr 2 ξ where = J µ ΠJ µ 2 ξ + ΠTJµ ΠT(Φr ) 2 ξ J µ ΠJ µ 2 ξ +α2 J µ Φr 2 ξ, The first equality uses the Pythagorean Theorem The second equality holds because J µ is the fixed point of T and Φr is the fixed point of ΠT The inequality uses the contraction property of ΠT. Q.E.D.

104 SIMULATION-BASED SOLUTION OF PROJECTED EQUATION

105 MATRIX FORM OF PROJECTED EQUATION T µ (Φr)= g +αpφr 0 Subspace S = {Φr r R s } Φr = Π ξ T µ (Φr) ThesolutionΦr satisfiestheorthogonalitycondition: The error Φr (g +αpφr ) is orthogonal to the subspace spanned by the columns of Φ. This is written as Φ Ξ ( Φr (g +αpφr ) ) = 0, where Ξ is the diagonal matrix with the steadystate probabilities ξ 1,...,ξ n along the diagonal. Equivalently, Cr = d, where C = Φ Ξ(I αp)φ, d = Φ Ξg but computing C and d is HARD(high-dimensional inner products).

106 SOLUTION OF PROJECTED EQUATION Solve Cr = d by matrix inversion: r = C 1 d Projected Value Iteration (PVI) method: Φr k+1 = ΠT(Φr k ) = Π(g +αpφr k ) Converges to r because ΠT is a contraction. Value Iterate T(Φr k ) = g + αpφr k Projection on S Φr k+1 Φr k 0 S: Subspace spanned by basis functions PVI can be written as: r k+1 = arg min r R s Φr (g +αpφr k ) 2 ξ By setting to 0 the gradient with respect to r, which yields Φ Ξ ( Φr k+1 (g +αpφr k ) ) = 0, r k+1 = r k (Φ ΞΦ) 1 (Cr k d)

107 SIMULATION-BASED IMPLEMENTATIONS Key idea: Calculate simulation-based approximations based on k samples C k C, d k d Matrix inversion r = C 1 d is approximated by ˆr k = C 1 k d k This is the LSTD (Least Squares Temporal Differences) Method. PVI method r k+1 = r k (Φ ΞΦ) 1 (Cr k d) is approximated by r k+1 = r k G k (C k r k d k ) where G k (Φ ΞΦ) 1 This is the LSPE (Least Squares Policy Evaluation) Method. Key fact: C k, d k, and G k can be computed with low-dimensional linear algebra (of order s; the number of basis functions).

108 SIMULATION MECHANICS Wegenerateaninfinitelylongtrajectory(i 0,i 1,...) of the Markov chain, so states i and transitions (i,j) appear with long-term frequencies ξ i and p ij. After generating each transition (i t,i t+1 ), we compute the row φ(i t ) of Φ and the cost component g(i t,i t+1 ). We form d k = 1 k +1 k t=0 φ(i t )g(i t,i t+1 ) i,j ξ i p ij φ(i)g(i,j) = Φ Ξg = d C k = 1 k +1 k φ(i t ) ( φ(i t ) αφ(i t+1 ) ) Φ Ξ(I αp)φ = C t=0 Also in the case of LSPE G k = 1 k +1 k φ(i t )φ(i t ) Φ ΞΦ t=0 Convergence based on law of large numbers. C k, d k, and G k can be formed incrementally. Also can be written using the formalism of temporal differences (this is just a matter of style)

109 OPTIMISTIC VERSIONS Instead of calculating nearly exact approximations C k C and d k d, we do a less accurate approximation, based on few simulation samples Evaluate (coarsely) current policy µ, then do a policy improvement This often leads to faster computation (as optimistic methods often do) Very complex behavior (see the subsequent discussion on oscillations) The matrix inversion/lstd method has serious problems due to large simulation noise (because of limited sampling) - particularly if the C matrix is ill-conditioned LSPE tends to cope better because of its iterative nature (this is true of other iterative methods as well) A stepsize γ (0,1] in LSPE may be useful to damp the effect of simulation noise r k+1 = r k γg k (C k r k d k )

110 MULTISTEP PROJECTED EQUATIONS

111 MULTISTEP METHODS Introduce a multistep version of Bellman s equation J = T (λ) J, where for λ [0,1), T (λ) = (1 λ) λ l T l+1 l=0 Geometrically weighted sum of powers of T. Note that T l is a contraction with modulus α l, with respect to the weighted Euclidean norm ξ, whereξ isthesteady-state probability vector of the Markov chain. Hence T (λ) is a contraction with modulus α λ = (1 λ) l=0 Note that α λ 0 as λ 1 α l+1 λ l = α(1 λ) 1 αλ T l and T (λ) have the same fixed point J µ and J µ Φr λ ξ 1 1 α 2 λ J µ ΠJ µ ξ where Φr λ is the fixed point of ΠT(λ). The fixed point Φr λ depends on λ.

112 BIAS-VARIANCE TRADEOFF Φrλ : Solution of projected equation Φr = ΠT (λ) (Φr) J µ λ = 0 ΠJ Simulation error µ λ = 1 Bias Simulation error Subspace S = {Φr r R s } Errorbound J µ Φrλ 1 ξ 1 α 2 λ J µ ΠJ µ ξ As λ 1, we have α λ 0, so error bound (and the quality of approximation) improves as λ 1. In fact lim λ 1 Φr λ = ΠJ µ But the simulation noise in approximating T (λ) = (1 λ) λ l T l+1 increases l=0 Choice of λ is usually based on trial and error

113 MULTISTEP PROJECTED EQ. METHODS The projected Bellman equation is Φr = ΠT (λ) (Φr) In matrix form: C (λ) r = d (λ), where C (λ) = Φ Ξ ( I αp (λ)) Φ, d (λ) = Φ Ξg (λ), with P (λ) = (1 λ) α l λ l P l+1, g (λ) = l=0 The LSTD(λ) method is ( (λ)) 1d (λ) C k k, α l λ l P l g l=0 where C (λ) k and d (λ) k are simulation-based approximations of C (λ) and d (λ). The LSPE(λ) method is ( (λ) r k+1 = r k γg k C k r k d (λ) ) k whereg k isasimulation-basedapprox.to(φ ΞΦ) 1 TD(λ): An important simpler/slower iteration [similar to LSPE(λ) with G k = I - see the text].

114 MORE ON MULTISTEP METHODS The simulation process to obtain C (λ) k and d (λ) k is similar to the case λ = 0 (single simulation trajectory i 0,i 1,..., more complex formulas) C (λ) k = 1 k +1 k k φ(i t ) α m t λ m t( φ(i m ) αφ(i m+1 ) ) t=0 m=t d (λ) k = 1 k +1 k k φ(i t ) α m t λ m t g im t=0 m=t In the context of approximate policy iteration, we can use optimistic versions (few samples between policy updates). Many different versions (see the text). Note the λ-tradeoffs: As λ 1, C (λ) k and d (λ) k contain more simulation noise, so more samples are needed for a close approximation of r λ (the solution of the projected equation) Theerrorbound J µ Φr λ ξ becomessmaller As λ 1, ΠT (λ) becomes a contraction for arbitrary projection norm

115 6.231 DYNAMIC PROGRAMMING LECTURE 5 LECTURE OUTLINE Review of approximate PI based on projected Bellman equations Issues of policy improvement Exploration enhancement in policy evaluation Oscillations in approximate PI Aggregation An alternative to the projected equation/galerkin approach Examples of aggregation Simulation-based aggregation Relation between aggregation and projected equations

116 REVIEW

117 DISCOUNTED MDP System: Controlled Markov chain with states i = 1,...,n and finite set of controls u U(i) Transition probabilities: p ij (u) p ij (u) p ii (u) i j p jj (u) p ji (u) Cost of a policy π = {µ 0,µ 1,...} starting at state i: { N J π (i) = lim N E with α [0,1) k=0 α k g ( i k,µ k (i k ),i k+1 ) i = i0 } Shorthand notation for DP mappings (TJ)(i) = min u U(i) n p ij (u) ( g(i,u,j)+αj(j) ), i = 1,...,n, j=1 (T µ J)(i) = n ( )( ( ) ) p ij µ(i) g i,µ(i),j +αj(j), i = 1,...,n j=1

118 APPROXIMATE PI Initial Policy Evaluate Approximate Cost J µ (i,r) Approximate Policy Evaluation Generate Improved Policy µ Policy Improvement Evaluation of typical policy µ: Linear cost function approximation J µ (r) = Φr where Φ is full rank n s matrix with columns the basis functions, and ith row denoted φ(i). Policy improvement to generate µ: n µ(i) = arg min p ij (u) ( g(i,u,j)+αφ(j) r ) u U(i) j=1

EVALUATION BY PROJECTED EQUATIONS Approximate policy evaluation by solving Φr = ΠT µ (Φr) Π: weighted Euclidean projection; special nature of the steady-state distribution weighting.

119 EVALUATION BY PROJECTED EQUATIONS Approximate policy evaluation by solving Φr = ΠT µ (Φr) Π: weighted Euclidean projection; special nature of the steady-state distribution weighting. Implementation by simulation (single long trajectory using current policy - important to make ΠT µ a contraction). LSTD, LSPE methods. Multistep option: Solve Φr = ΠT µ (λ) (Φr) with T µ (λ) = (1 λ) λ l Tµ l+1, 0 λ < 1 l=0 As λ 1, ΠT (λ) µ becomes a contraction for any projection norm (allows changes in Π) Bias-variance tradeoff Solution of projected equation Φr = ΠT (λ) (Φr) J µ λ = 0 ΠJ Simulation error µ λ = 1 Bias Simulation error Subspace S = {Φr r R s }

6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE Review of Q-factors and Bellman equations for Q-factors VI and PI for Q-factors Q-learning - Combination of VI and sampling Q-learning and cost function