APPROXIMATE DYNAMIC PROGRAMMING A SERIES OF LECTURES GIVEN AT TSINGHUA UNIVERSITY JUNE 2014 DIMITRI P. BERTSEKAS

Size: px
Start display at page:

Download "APPROXIMATE DYNAMIC PROGRAMMING A SERIES OF LECTURES GIVEN AT TSINGHUA UNIVERSITY JUNE 2014 DIMITRI P. BERTSEKAS"

Transcription

1 APPROXIMATE DYNAMIC PROGRAMMING A SERIES OF LECTURES GIVEN AT TSINGHUA UNIVERSITY JUNE 2014 DIMITRI P. BERTSEKAS Based on the books: (1) Neuro-Dynamic Programming, by DPB and J. N. Tsitsiklis, Athena Scientific, 1996 (2) Dynamic Programming and Optimal Control, Vol. II: Approximate Dynamic Programming, by DPB, Athena Scientific, 2012 (3) Abstract Dynamic Programming, by DPB, Athena Scientific, For a fuller set of slides, see

2 APPROXIMATE DYNAMIC PROGRAMMING BRIEF OUTLINE I Our subject: Large-scale DP based on approximations and in part on simulation. This has been a research area of great interest for the last 25 years known under various names (e.g., reinforcement learning, neurodynamic programming) Emerged through an enormously fruitful crossfertilization of ideas from artificial intelligence and optimization/control theory Deals with control of dynamic systems under uncertainty, but applies more broadly (e.g., discrete deterministic optimization) A vast range of applications in control theory, operations research, artificial intelligence, and beyond... The subject is broad with rich variety of theory/math, algorithms, and applications. Our focus will be mostly on algorithms... less on theory and modeling

3 APPROXIMATE DYNAMIC PROGRAMMING Our aim: BRIEF OUTLINE II A state-of-the-art account of some of the major topics at a graduate level Show how to use approximation and simulation to address the dual curses of DP: dimensionality and modeling Our 6-lecture plan: Two lectures on exact DP with emphasis on infinite horizon problems and issues of largescale computational methods One lecture on general issues of approximation and simulation for large-scale problems One lecture on approximate policy iteration based on temporal differences(td)/projected equations/galerkin approximation One lecture on aggregation methods One lecture on Q-learning, and other methods, such as approximation in policy space

4 APPROXIMATE DYNAMIC PROGRAMMING LECTURE 1 LECTURE OUTLINE Introduction to DP and approximate DP Finite horizon problems The DP algorithm for finite horizon problems Infinite horizon problems Basic theory of discounted infinite horizon problems

5 DP AS AN OPTIMIZATION METHODOLOGY Generic optimization problem: min g(u) u U where u is the optimization/decision variable, g(u) is the cost function, and U is the constraint set Categories of problems: Discrete (U is finite) or continuous Linear (g is linear and U is polyhedral) or nonlinear Stochastic or deterministic: In stochastic problems the cost involves a stochastic parameter w, which is averaged, i.e., it has the form g(u) = E w { G(u,w) } where w is a random parameter. DP deals with multistage stochastic problems Information about w is revealed in stages Decisions are also made in stages and make use of the available information Its methodology is different

6 BASIC STRUCTURE OF STOCHASTIC DP Discrete-time system x k+1 = f k (x k,u k,w k ), k = 0,1,...,N 1 k: Discrete time x k : State; summarizes past information that is relevant for future optimization u k : Control; decision to be selected at time k from a given set w k : Random parameter (also called disturbance or noise depending on the context) N: Horizon or number of times control is applied Cost function that is additive over time E { g N (x N )+ N 1 k=0 g k (x k,u k,w k ) Alternativesystemdescription:P(x k+1 x k,u k ) x k+1 = w k with P(w k x k,u k ) = P(x k+1 x k,u k ) }

7 INVENTORY CONTROL EXAMPLE Discrete-time system x k+1 = f k (x k,u k,w k ) = x k +u k w k Cost function that is additive over time { } E g N (x N )+ = E { N 1 N 1 k=0 g k (x k,u k,w k ) ( cuk +r(x k +u k w k ) )} k=0

8 ADDITIONAL ASSUMPTIONS Probability distribution of w k does not depend on past values w k 1,...,w 0, but may depend on x k and u k Otherwise past values of w, x, or u would be useful for future optimization The constraint set from which u k is chosen at time k depends at most on x k, not on prior x or u Optimization over policies (also called feedback control laws): These are rules/functions u k = µ k (x k ), k = 0,...,N 1 that map state/inventory to control/order(closedloop optimization, use of feedback) MAJOR DISTINCTION: We minimize over sequences of functions (mapping inventory to order) {µ 0,µ 1,...,µ N 1 } NOT over sequences of controls/orders {u 0,u 1,...,u N 1 }

9 GENERIC FINITE-HORIZON PROBLEM System x k+1 = f k (x k,u k,w k ), k = 0,...,N 1 Control contraints u k U k (x k ) Probability distribution P k ( x k,u k ) of w k Policies π = {µ 0,...,µ N 1 }, where µ k maps states x k into controls u k = µ k (x k ) and is such that µ k (x k ) U k (x k ) for all x k Expected cost of π starting at x 0 is J π (x 0 ) = E { g N (x N )+ Optimal cost function N 1 k=0 J (x 0 ) = min π J π (x 0 ) Optimal policy π satisfies J π (x 0 ) = J (x 0 ) g k (x k,µ k (x k ),w k ) When produced by DP, π is independent of x 0. }

10 PRINCIPLE OF OPTIMALITY Let π = {µ 0,µ 1,...,µ N 1 } be optimal policy Consider the tail subproblem whereby we are at x k at time k and wish to minimize the costto-go from time k to time N E { g N (x N )+ N 1 l=k g l ( xl,µ l (x l ),w l ) } and the tail policy {µ k,µ k+1,...,µ N 1 } x k Tail Subproblem 0 k N Time Principle of optimality: The tail policy is optimal for the tail subproblem (optimization of the futuredoesnotdependonwhatwedidinthepast) DP solves ALL the tail subroblems At the generic step, it solves ALL tail subproblems of a given time length, using the solution of the tail subproblems of shorter time length

11 DP ALGORITHM Computes for all k and states x k : J k (x k ): opt. cost of tail problem starting at x k Initial condition: J N (x N ) = g N (x N ) Go backwards, k = N 1,...,0, using J k (x k ) = min u k U k (x k ) E w k { gk (x k,u k,w k ) +J k+1 ( fk (x k,u k,w k ) )}, To solve tail subproblem at time k minimize kth-stage cost + Opt. cost of next tail problem starting from next state at time k +1 Then J 0 (x 0 ), generated at the last step, is equal to the optimal cost J (x 0 ). Also, the policy π = {µ 0,...,µ N 1 } where µ k (x k) minimizes in the right side above for each x k and k, is optimal Proof by induction

12 PRACTICAL DIFFICULTIES OF DP The curse of dimensionality Exponential growth of the computational and storage requirements as the number of state variables and control variables increases Quick explosion of the number of states in combinatorial problems The curse of modeling Sometimes a simulator of the system is easier to construct than a model There may be real-time solution constraints A family of problems may be addressed. The dataoftheproblemtobesolvedisgivenwith little advance notice The problem data may change as the system is controlled need for on-line replanning All of the above are motivations for approximation and simulation

13 A MAJOR IDEA: COST APPROXIMATION Use a policy computed from the DP equation where the optimal cost-to-go function J k+1 is replaced by an approximation J k+1. Apply µ k (x k ), which attains the minimum in { min E ( g k (x k,u k,w k )+ J k+1 fk (x k,u k,w k ) )} u k U k (x k ) Some approaches: (a) ProblemApproximation:Use J k derivedfrom a related but simpler problem (b) Parametric Cost-to-Go Approximation: Use as J k a function of a suitable parametric form, whose parameters are tuned by some heuristic or systematic scheme(we will mostly focus on this) This is a major portion of Reinforcement Learning/Neuro-Dynamic Programming (c) Rollout Approach: Use as J k the cost of some suboptimal policy, which is calculated either analytically or by simulation

14 ROLLOUT ALGORITHMS At each k and state x k, use the control µ k (x k ) that minimizes in min u k U k (x k ) E{ g k (x k,u k,w k )+ J k+1 ( fk (x k,u k,w k ) )}, where J k+1 is the cost-to-go of some heuristic policy (called the base policy). Cost improvement property: The rollout algorithm achieves no worse(and usually much better) cost than the base policy starting from the same state. Main difficulty: Calculating J k+1 (x) may be computationally intensive if the cost-to-go of the base policy cannot be analytically calculated. May involve Monte Carlo simulation if the problem is stochastic. Things improve in the deterministic case (an important application is discrete optimization). Connection w/ Model Predictive Control(MPC).

15 INFINITE HORIZON PROBLEMS Same as the basic problem, but: The number of stages is infinite. The system is stationary. Total cost problems: Minimize J π (x 0 ) = lim N E w k k=0,1,... { N 1 k=0 α k g ( x k,µ k (x k ),w k ) } Discounted problems (α < 1, bounded g) Stochastic shortest path problems (α = 1, finite-state system with a termination state) - we will discuss sparringly Discounted and undiscounted problems with unbounded cost per stage - we will not cover Average cost problems - we will not cover Infinite horizon characteristics: Challenging analysis, elegance of solutions and algorithms Stationary policies π = {µ,µ,...} and stationary forms of DP play a special role

16 DISCOUNTED PROBLEMS/BOUNDED COST Stationary system x k+1 = f(x k,u k,w k ), k = 0,1,... Cost of a policy π = {µ 0,µ 1,...} J π (x 0 ) = lim N E w k k=0,1,... { N 1 k=0 α k g ( x k,µ k (x k ),w k ) } with α < 1, and g is bounded [for some M, we have g(x,u,w) M for all (x,u,w)] Optimal cost function: J (x) = min π J π (x) Boundedness of g guarantees that all costs are well-defined and bounded: Jπ (x) M 1 α All spaces are arbitrary - only boundedness of g is important (there are math fine points, e.g. measurability, but they don t matter in practice) Important special case: All underlying spaces finite; a (finite spaces) Markovian Decision Problem or MDP All algorithms ultimately work with a finite spaces MDP approximating the original problem

17 SHORTHAND NOTATION FOR DP MAPPINGS For any function J of x, denote (TJ)(x) = min u U(x) E w { g(x,u,w)+αj ( f(x,u,w) )}, x TJ is the optimal cost function for the onestage problem with stage cost g and terminal cost function αj. T operates on bounded functions of x to produce other bounded functions of x For any stationary policy µ, denote (T µ J)(x) = E w { g ( x,µ(x),w ) +αj ( f(x,µ(x),w) )}, x The critical structure of the problem is captured in T and T µ The entire theory of discounted problems can be developed in shorthand using T and T µ True for many other DP problems. T andt µ provideapowerfulunifyingframework for DP. This is the essence of the book Abstract Dynamic Programming

18 FINITE-HORIZON COST EXPRESSIONS ConsideranN-stagepolicyπ N 0 = {µ 0,µ 1,...,µ N 1 } with a terminal cost J: J π N 0 (x 0 ) = E = E { α N J(x k )+ N 1 l=0 α l g ( x l,µ l (x l ),w l ) } { g ( ) } x 0,µ 0 (x 0 ),w 0 +αjπ N 1 (x 1 ) = (T µ0 J π N 1 )(x 0 ) where π N 1 = {µ 1,µ 2,...,µ N 1 } By induction we have J π N 0 (x) = (T µ0 T µ1 T µn 1 J)(x), x For a stationary policy µ the N-stage cost function (with terminal cost J) is J π N 0 = T N µ J where T N µ is the N-fold composition of T µ Similarly the optimal N-stage cost function (with terminal cost J) is T N J T N J = T(T N 1 J) is just the DP algorithm

19 SHORTHAND THEORY A SUMMARY Infinite horizon cost function expressions [with J 0 (x) 0] J π (x) = lim N (T µ 0 T µ1 T µn J 0 )(x), J µ (x) = lim N (TN µ J 0 )(x) Bellman s equation: J = TJ, J µ = T µ J µ Optimality condition: µ: optimal <==> T µ J = TJ Value iteration: For any (bounded) J J (x) = lim k (Tk J)(x), x Policy iteration: Given µ k, Policy evaluation: Find J µ k by solving J µ k = T µ kj µ k Policy improvement: Find µ k+1 such that T µ k+1j µ k = TJ µ k

20 TWO KEY PROPERTIES Monotonicity property: For any J and J such that J(x) J (x) for all x, and any µ (TJ)(x) (TJ )(x), x, (T µ J)(x) (T µ J )(x), x. Constant Shift property: For any J, any scalar r, and any µ ( T(J +re) ) (x) = (TJ)(x)+αr, x, ( Tµ (J +re) ) (x) = (T µ J)(x)+αr, x, where e is the unit function [e(x) 1]. Monotonicity is present in all DP models(undiscounted, etc) Constant shift is special to discounted models Discounted problems have another property of major importance: T and T µ are contraction mappings (we will show this later)

21 CONVERGENCE OF VALUE ITERATION For all bounded J, J (x) = lim k (Tk J)(x), for all x Proof: For simplicity we give the proof for J 0. Foranyinitialstatex 0,andpolicyπ = {µ 0,µ 1,...}, J π (x 0 ) = E = E { α l g ( ) } x l,µ l (x l ),w l l=0 { k 1 l=0 α l g ( x l,µ l (x l ),w l ) } +E The tail portion satisfies { E l=k { α l g ( ) } x l,µ l (x l ),w l l=k α l g ( x l,µ l (x l ),w l ) } αk M 1 α, where M g(x,u,w). Take min over π of both sides, then lim as k. Q.E.D.

22 BELLMAN S EQUATION The optimal cost function J is a solution of Bellman s equation, J = TJ, i.e., for all x, J (x) = min u U(x) E w Proof: For all x and k, { g(x,u,w)+αj ( f(x,u,w) )} J (x) αk M 1 α (Tk J 0 )(x) J (x)+ αk M 1 α, where J 0 (x) 0 and M g(x,u,w). Applying T to this relation, and using Monotonicity and Constant Shift, (TJ )(x) αk+1 M 1 α (Tk+1 J 0 )(x) (TJ )(x)+ αk+1 M 1 α Taking the limit as k and using the fact lim k (Tk+1 J 0 )(x) = J (x) we obtain J = TJ. Q.E.D.

23 THE CONTRACTION PROPERTY Contraction property: For any bounded functions J and J, and any µ, max (TJ)(x) (TJ )(x) αmax J(x) J (x), x max (Tµ J)(x) (T µ J )(x) αmax J(x) J (x). x Proof: Denote c = max x S J(x) J (x). Then x x J(x) c J (x) J(x)+c, x Apply T to both sides, and use the Monotonicity and Constant Shift properties: (TJ)(x) αc (TJ )(x) (TJ)(x)+αc, x Hence (TJ)(x) (TJ )(x) αc, x. Q.E.D. Note: This implies that J is the unique solution of J = TJ, and J µ is the unique solution of J µ = T µ J µ

24 NEC. AND SUFFICIENT OPT. CONDITION A stationary policy µ is optimal if and only if µ(x) attains the minimum in Bellman s equation for each x; i.e., TJ = T µ J, or, equivalently, for all x, µ(x) arg min u U(x) E w { g(x,u,w)+αj ( f(x,u,w) )} Proof:IfTJ = T µ J, thenusingbellman sequation (J = TJ ), we have J = T µ J, sobyuniquenessofthefixedpointoft µ,weobtain J = J µ ; i.e., µ is optimal. Conversely, if the stationary policy µ is optimal, we have J = J µ, so J = T µ J. Combining this with Bellman s Eq. (J = TJ ), we obtain TJ = T µ J. Q.E.D.

25 APPROXIMATE DYNAMIC PROGRAMMING LECTURE 2 LECTURE OUTLINE Review of discounted problem theory Review of shorthand notation Algorithms for discounted DP Value iteration Various forms of policy iteration Optimistic policy iteration Q-factors and Q-learning Other DP models - Continuous space and time A more abstract view of DP Asynchronous algorithms

26 DISCOUNTED PROBLEMS/BOUNDED COST Stationary system with arbitrary state space x k+1 = f(x k,u k,w k ), k = 0,1,... Cost of a policy π = {µ 0,µ 1,...} J π (x 0 ) = lim N E w k k=0,1,... { N 1 k=0 α k g ( x k,µ k (x k ),w k ) } withα < 1,andforsomeM,wehave g(x,u,w) M for all (x,u,w) Shorthand notation for DP mappings (operate on functions of state to produce other functions) (TJ)(x) = min u U(x) E w { g(x,u,w)+αj ( f(x,u,w) )}, x TJ is the optimal cost function for the one-stage problem with stage cost g and terminal cost αj. For any stationary policy µ (T µ J)(x) = E w { g ( x,µ(x),w ) +αj ( f(x,µ(x),w) )}, x

27 SHORTHAND THEORY A SUMMARY Bellman s equation: J = TJ, J µ = T µ J µ or J (x) = min u U(x) E w { g(x,u,w)+αj ( f(x,u,w) )}, x J µ (x) = E w { g ( x,µ(x),w ) +αjµ ( f(x,µ(x),w) )}, x Optimality condition: i.e., µ: optimal <==> T µ J = TJ µ(x) arg min u U(x) E w { g(x,u,w)+αj ( f(x,u,w) )}, x Value iteration: For any (bounded) J J (x) = lim k (Tk J)(x), Policy iteration: Given µ k, x Find J µ k from J µ k = T µ kj µ k (policy evaluation); then Find µ k+1 such that T µ k+1j µ k = TJ µ k (policy improvement)

28 MAJOR PROPERTIES Monotonicity property: For any functions J and J on the state space X such that J(x) J (x) for all x X, and any µ (TJ)(x) (TJ )(x), (T µ J)(x) (T µ J )(x), x X Contraction property: For any bounded functions J and J, and any µ, max (TJ)(x) (TJ )(x) αmax J(x) J (x), x max (T µ J)(x) (T µ J )(x) αmax J(x) J (x) x Compact Contraction Notation: TJ TJ α J J, T µ J T µ J α J J, where for any bounded function J, we denote by J the sup-norm x J = max J(x) x x

29 THE TWO MAIN ALGORITHMS: VI AND PI Value iteration: For any (bounded) J J (x) = lim k (Tk J)(x), x Policy iteration: Given µ k Policy evaluation: Find J µ k by solving J µ k(x) = E w { ( g x,µ k (x),w ) ( +αj µ k f(x,µ k (x),w) )}, x or J µ k = T µ kj µ k Policy improvement: Let µ k+1 be such that µ k+1 (x) arg min u U(x) E w { g(x,u,w)+αjµ k ( f(x,u,w) )}, x or T µ k+1j µ k = TJ µ k For the case of n states, policy evaluation is equivalent to solving an n n linear system of equations: J µ = g µ +αp µ J µ Forlargen, exactpiisoutofthequestion(even though it terminates finitely as we will show)

30 JUSTIFICATION OF POLICY ITERATION We can show that J µ k J µ k+1 for all k Proof: For given k, we have J µ k = T µ kj µ k TJ µ k = T µ k+1j µ k Using the monotonicity property of DP, J µ k T µ k+1j µ k T 2 µ k+1 J µ k lim N TN µ k+1 J µ k Since we have J µ k J µ k+1. lim N TN µ k+1 J µ k = J µ k+1 If J µ k = J µ k+1, all above inequalities hold as equations, so J µ k solves Bellman s equation. Hence J µ k = J Thus at iteration k either the algorithm generates a strictly improved policy or it finds an optimal policy For a finite spaces MDP, the algorithm terminates with an optimal policy For infinite spaces MDP, convergence (in an infinite number of iterations) can be shown

31 OPTIMISTIC POLICY ITERATION Optimistic PI: This is PI, where policy evaluation is done approximately, with a finite number of VI So we approximate the policy evaluation J µ T m µ J for some number m [1, ) and initial J Shorthand definition: For some integers m k T µ kj k = TJ k, J k+1 = T m k µ k J k, k = 0,1,... If m k 1 it becomes VI If m k = it becomes PI Converges for both finite and infinite spaces discounted problems (in an infinite number of iterations) Typically works faster than VI and PI (for large problems)

32 APPROXIMATE PI Suppose that the policy evaluation is approximate, J k J µ k δ, k = 0,1,... and policy improvement is approximate, T µ k+1j k TJ k ǫ, k = 0,1,... where δ and ǫ are some positive scalars. Error Bound I: The sequence {µ k } generated by approximate policy iteration satisfies lim sup k J µ k J ǫ+2αδ (1 α) 2 Typical practical behavior: The method makes steady progress up to a point and then the iterates J µ k oscillate within a neighborhood of J. ErrorBound II:Ifinadditionthesequence {µ k } terminates at µ (i.e., keeps generating µ) J µ J ǫ+2αδ 1 α

33 Q-FACTORS I Optimal Q-factor of (x, u): Q (x,u) = E{g(x,u,w)+αJ (x)} with x = f(x,u,w). It is the cost of starting at x, applying u is the 1st stage, and an optimal policy after the 1st stage We can write Bellman s equation as J (x) = min u U(x) Q (x,u), x, We can equivalently write the VI method as J k+1 (x) = min u U(x) Q k+1(x,u), x, where Q k+1 is generated by Q k+1 (x,u) = E with x = f(x,u,w) { } g(x,u,w)+α min Q k(x,v) v U(x)

34 Q-FACTORS II Q-factors are costs in an augmented problem where states are (x,u) They satisfy a Bellman equation Q = FQ where { } (FQ)(x,u) = E g(x,u,w)+α min Q(x,v) v U(x) where x = f(x,u,w) VI and PI for Q-factors are mathematically equivalent to VI and PI for costs They require equal amount of computation... they just need more storage Having optimal Q-factors is convenient when implementing an optimal policy on-line by µ (x) = min u U(x) Q (x,u) Once Q (x,u) are known, the model [g and E{ }] is not needed. Model-free operation Q-Learning (to be discussed later) is a sampling method that calculates Q (x,u) using a simulator of the system (no model needed)

35 OTHER DP MODELS We have looked so far at the (discrete or continuous spaces) discounted models for which the analysis is simplest and results are most powerful Other DP models include: Undiscounted problems (α = 1): They may include a special termination state (stochastic shortest path problems) Continuous-time finite-state MDP: The time between transitions is random and state-andcontrol-dependent (typical in queueing systems, called Semi-Markov MDP). These can be viewed as discounted problems with stateand-control-dependent discount factors Continuous-time, continuous-space models: Classical automatic control, process control, robotics Substantial differences from discrete-time Mathematically more complex theory (particularly for stochastic problems) Deterministic versions can be analyzed using classical optimal control theory Admit treatment by DP, based on time discretization

36 CONTINUOUS-TIME MODELS System equation: dx(t)/dt = f ( x(t),u(t) ) Cost function: 0 g ( x(t),u(t) ) Optimal cost starting from x: J (x) δ-discretizationoftime: x k+1 = x k +δ f(x k,u k ) Bellman equation for the δ-discretized problem: J δ (x) = min u { δ g(x,u)+j δ ( x+δ f(x,u) )} Take δ 0, to obtain the Hamilton-Jacobi- Bellmanequation[assuminglim δ 0 J δ (x) = J (x)] 0 = min u { g(x,u)+ J (x) f(x,u) }, Policy Iteration (informally): x Policy evaluation: Given current µ, solve 0 = g ( x,µ(x) ) + J µ (x) f ( x,µ(x) ), x Policy improvement: Find µ(x) argmin u { g(x,u)+ Jµ (x) f(x,u) }, Note: Need to learn J µ (x) NOT J µ (x) x

37 A MORE GENERAL/ABSTRACT VIEW OF DP Let Y be a real vector space with a norm A function F : Y Y is said to be a contraction mapping if for some ρ (0,1), we have Fy Fz ρ y z, for all y,z Y. ρ is called the modulus of contraction of F. Important example: Let X be a set (e.g., state space in DP), v : X R be a positive-valued function. Let B(X) be the set of all functions J : X R such that J(x)/v(x) is bounded over x. We define a norm on B(X), called the weighted sup-norm, by J = max x X J(x) v(x). Important special case: The discounted problem mappings T and T µ [for v(x) 1, ρ = α].

38 CONTRACTION MAPPINGS: AN EXAMPLE Consider extension from finite to countable state space, X = {1,2,...}, and a weighted sup norm withrespecttowhichtheonestagecostsarebounded Suppose that T µ has the form (T µ J)(i) = b i +α j Xa ij J(j), i = 1,2,... where b i and a ij are some scalars. Then T µ is a contraction with modulus ρ if and only if j X a ij v(j) v(i) Consider T, ρ, i = 1,2,... (TJ)(i) = min µ (T µ J)(i), i = 1,2,... where for each µ M, T µ is a contraction mapping with modulus ρ. Then T is a contraction mapping with modulus ρ Allows extensions of main DP results from bounded one-stage cost to interesting unbounded one-stage cost cases.

39 CONTRACTION MAPPING FIXED-POINT TH. Contraction Mapping Fixed-Point Theorem: If F : B(X) B(X) is a contraction with modulus ρ (0,1), then there exists a unique J B(X) such that J = FJ. Furthermore, if J is any function in B(X), then {F k J} converges to J and we have F k J J ρ k J J, k = 1,2,... This is a special case of a general result for contraction mappings F : Y Y over normed vector spaces Y that are complete: every sequence {y k } that is Cauchy (satisfies y m y n 0 as m,n ) converges. The space B(X) is complete (see the text for a proof).

40 ABSTRACT FORMS OF DP We consider an abstract form of DP based on monotonicity and contraction Abstract Mapping: Denote R(X): set of realvalued functions J : X R, and let H : X U R(X) R be a given mapping. We consider the mapping (TJ)(x) = min H(x,u,J), x X. u U(x) We assume that (TJ)(x) > for all x X, so T maps R(X) into R(X). Abstract Policies: Let M be the set of policies, i.e., functions µ such that µ(x) U(x) for all x X. For each µ M, we consider the mapping T µ : R(X) R(X) defined by (T µ J)(x) = H ( x,µ(x),j ), x X. Find a function J R(X) such that J (x) = min u U(x) H(x,u,J ), x X

41 Discounted problems EXAMPLES H(x,u,J) = E { g(x,u,w)+αj ( f(x,u,w) )} Discounted discrete-state continuous-time Semi-Markov Problems (e.g., queueing) H(x,u,J) = G(x,u)+ n m xy (u)j(y) y=1 where m xy are discounted transition probabilities, defined by the distribution of transition times Minimax Problems/Games [ ( )] H(x,u,J) = max g(x,u,w)+αj f(x,u,w) w W(x,u) Shortest Path Problems H(x,u,J) = { axu +J(u) if u d, if u = d a xd where d is the destination. There are stochastic and minimax versions of this problem

42 ASSUMPTIONS Monotonicity: If J,J R(X) and J J, H(x,u,J) H(x,u,J ), x X, u U(x) We can show all the standard analytical and computational results of discounted DP if monotonicity and the following assumption holds: Contraction: For every J B(X), the functions T µ J and TJ belong to B(X) For some α (0,1), and all µ and J,J B(X), we have T µ J T µ J α J J With just monotonicity assumption(as in undiscounted problems) we can still show various forms of the basic results under appropriate conditions A weaker substitute for contraction assumption is semicontractiveness: (roughly) for some µ, T µ is a contraction and for others it is not; also the noncontractive µ are not optimal

43 RESULTS USING CONTRACTION Proposition 1: The mappings T µ and T are weighted sup-norm contraction mappings with modulus α over B(X), and have unique fixed points in B(X), denoted J µ and J, respectively (cf. Bellman s equation). Proof: From the contraction property of H. Proposition 2: For any J B(X) and µ M, lim k Tk µj = J µ, lim k Tk J = J (cf. convergence of value iteration). Proof: From the contraction property of T µ and T. Proposition 3: We have T µ J = TJ if and only if J µ = J (cf. optimality condition). Proof: T µ J = TJ, then T µ J = J, implying J = J µ. Conversely, if J µ = J, then T µ J = T µ J µ = J µ = J = TJ.

44 RESULTS USING MON. AND CONTRACTION Optimality of fixed point: J (x) = min µ M J µ(x), x X Existence of a nearly optimal policy: For every ǫ > 0, there exists µ ǫ M such that J (x) J µǫ (x) J (x)+ǫ, x X Nonstationary policies: Consider the set Π of all sequences π = {µ 0,µ 1,...} with µ k M for all k, and define J π (x) = liminf k (T µ 0 T µ1 T µk J)(x), x X, with J being any function (the choice of J does not matter) We have J (x) = min π Π J π(x), x X

45 THE TWO MAIN ALGORITHMS: VI AND PI Value iteration: For any (bounded) J J (x) = lim k (Tk J)(x), x Policy iteration: Given µ k Policy evaluation: Find J µ k by solving J µ k = T µ kj µ k Policy improvement: Find µ k+1 such that T µ k+1j µ k = TJ µ k Optimistic PI: This is PI, where policy evaluation is carried out by a finite number of VI Shorthand definition: For some integers m k T µ kj k = TJ k, J k+1 = T m k µ k J k, k = 0,1,... If m k 1 it becomes VI If m k = it becomes PI For intermediate values of m k, it is generally more efficient than either VI or PI

46 ASYNCHRONOUS ALGORITHMS Motivation for asynchronous algorithms Faster convergence Parallel and distributed computation Simulation-based implementations General framework: Partition X into disjoint nonempty subsets X 1,...,X m, and use separate processor l updating J(x) for x X l Let J be partitioned as J = (J 1,...,J m ), where J l is the restriction of J on the set X l. Synchronous VI algorithm: J t+1 l (x) = T(J1 t,...,jt m)(x), x X l, l = 1,...,m Asynchronous VI algorithm: For some subsets of times R l, J t+1 l (x) = { τ T(J l1 (t) 1,...,J τ lm(t) m )(x) if t R l, Jl t(x) if t / R l where t τ lj (t) are communication delays

47 ONE-STATE-AT-A-TIME ITERATIONS Important special case: Assume n states, a separate processor for each state, and no delays Generate a sequence of states {x 0,x 1,...}, generated in some way, possibly by simulation (each state is generated infinitely often) Asynchronous VI: J t+1 l = { T(J t 1,...,J t n)(l) if l = x t, J t l if l x t, where T(J t 1,...,Jt n)(l) denotes the l-th component of the vector T(J t 1,...,Jt n) = TJ t, The special case where {x 0,x 1,...} = {1,...,n,1,...,n,1,...} is the Gauss-Seidel method

48 ASYNCHRONOUS CONV. THEOREM I KEY FACT: VI and also PI (with some modifications) still work when implemented asynchronously Assumethatforalll,j = 1,...,m, R l isinfinite and lim t τ lj (t) = Proposition:LetT haveauniquefixedpointj, and assume that there is a sequence of nonempty subsets { S(k) } R(X) with S(k +1) S(k) for all k, and with the following properties: (1) Synchronous Convergence Condition: Every sequence {J k } with J k S(k) for each k, converges pointwise to J. Moreover, TJ S(k+1), J S(k), k = 0,1,... (2) BoxCondition: Forallk, S(k)isaCartesian product of the form S(k) = S 1 (k) S m (k), where S l (k) is a set of real-valued functions on X l, l = 1,...,m. Then for every J S(0), the sequence {J t } generated by the asynchronous algorithm converges pointwise to J.

49 ASYNCHRONOUS CONV. THEOREM II Interpretation of assumptions: J = (J 1, J 2 ) S 2 (0) S(0) S(k) S(k + 1) J TJ S 1 (0) A synchronous iteration from any J in S(k) moves into S(k + 1) (component-by-component) Convergence mechanism: J 1 Iterations J = (J 1, J 2 ) S(0) S(k) S(k + 1) J J 2 Iteration Key: Independent component-wise improvement. An asynchronous component iteration from any J in S(k) moves into the corresponding component portion of S(k +1)

50 APPROXIMATE DYNAMIC PROGRAMMING LECTURE 3 LECTURE OUTLINE Review of discounted DP Introduction to approximate DP Approximation architectures Simulation-based approximate policy iteration Approximate policy evaluation Some general issues about approximation and simulation

51 REVIEW

52 DISCOUNTED PROBLEMS/BOUNDED COST Stationary system with arbitrary state space x k+1 = f(x k,u k,w k ), k = 0,1,... Cost of a policy π = {µ 0,µ 1,...} J π (x 0 ) = lim N E w k k=0,1,... { N 1 k=0 α k g ( x k,µ k (x k ),w k ) } withα < 1,andforsomeM,wehave g(x,u,w) M for all (x,u,w) Shorthand notation for DP mappings (operate on functions of state to produce other functions) (TJ)(x) = min u U(x) E w { g(x,u,w)+αj ( f(x,u,w) )}, x TJ is the optimal cost function for the one-stage problem with stage cost g and terminal cost αj For any stationary policy µ (T µ J)(x) = E w { g ( x,µ(x),w ) +αj ( f(x,µ(x),w) )}, x

53 MDP - TRANSITION PROBABILITY NOTATION We will mostly assume the system is an n-state (controlled) Markov chain We will often switch to Markov chain notation States i = 1,...,n (instead of x) Transition probabilities p ik i k+1 (u k ) [instead of x k+1 = f(x k,u k,w k )] Stagecostg(i k,u k,i k+1 )[insteadofg(x k,u k,w k )] Cost functions J = ( J(1),...,J(n) ) (vectors in R n ) Cost of a policy π = {µ 0,µ 1,...} J π (i) = lim N E i k k=1,2,... { N 1 k=0 α k g ( i k,µ k (i k ),i k+1 ) i0 = i } Shorthand notation for DP mappings (TJ)(i) = min u U(i) n p ij (u) ( g(i,u,j)+αj(j) ), i = 1,...,n, j=1 (T µ J)(i) = n ( )( ( ) ) p ij µ(i) g i,µ(i),j +αj(j), i = 1,...,n j=1

54 SHORTHAND THEORY A SUMMARY Bellman s equation: J = TJ, J µ = T µ J µ or J (i) = min u U(i) n p ij (u) ( g(i,u,j)+αj (j) ), i j=1 J µ (i) = n ( )( ( ) p ij µ(i) g i,µ(i),j +αjµ (j) ), i j=1 Optimality condition: i.e., µ: optimal <==> T µ J = TJ µ(i) arg min u U(i) n p ij (u) ( g(i,u,j)+αj (j) ), i j=1

55 THE TWO MAIN ALGORITHMS: VI AND PI Value iteration: For any J R n J (i) = lim k (Tk J)(i), i = 1,...,n Policy iteration: Given µ k Policy evaluation: Find J µ k by solving J µ k(i) = n ( p ij µ k (i) )( g ( i,µ k (i),j ) +αj µ k(j) ), i = 1,...,n j=1 or J µ k = T µ kj µ k Policy improvement: Let µ k+1 be such that µ k+1 (i) arg min u U(i) n p ij (u) ( g(i,u,j)+αj µ k(j) ), i j=1 or T µ k+1j µ k = TJ µ k Policy evaluation is equivalent to solving an n n linear system of equations For large n, exact PI is out of the question. We use instead optimistic PI (policy evaluation with a few VIs)

56 APPROXIMATE DP

57 GENERAL ORIENTATION TO ADP ADP (late 80s - present) is a breakthrough methodology that allows the application of DP to problems with many or infinite number of states. Other names for ADP are: reinforcement learning (RL). neuro-dynamic programming (NDP). adaptive dynamic programming (ADP). We will mainly adopt an n-state discounted model (the easiest case - but think of HUGE n). Extensions to other DP models (continuous space, continuous-time, not discounted) are possible (but more quirky). We will set aside for later. There are many approaches: Problem approximation Simulation-based approaches (we will focus on these) Simulation-based methods are of three types: Rollout (we will not discuss further) Approximation in value space Approximation in policy space

58 WHY DO WE USE SIMULATION? One reason: Computational complexity advantage in computing sums/expectations involving a very large number of terms Any sum n i=1 a i can be written as an expected value: n a i = i=1 n i=1 ξ i a i ξ i = E ξ { ai ξ i }, whereξ isanyprob.distributionover{1,...,n} It can be approximated by generating many samples {i 1,...,i k } from {1,...,n}, according to distribution ξ, and Monte Carlo averaging: n i=1 a i = E ξ { ai ξ i } 1 k k t=1 a it ξ it Simulation is also convenient when an analytical model of the system is unavailable, but a simulation/computer model is possible.

59 APPROXIMATION IN VALUE AND POLICY SPACE

60 APPROXIMATION IN VALUE SPACE Approximate J or J µ from a parametric class J(i;r)whereiisthecurrentstateandr = (r 1,...,r m ) is a vector of tunable scalars weights Use J in place of J or J µ in various algorithms and computations Role of r: By adjusting r we can change the shape of J so that it is close to J or J µ Two key issues: The choice of parametric class J(i;r) (the approximation architecture) Method for tuning the weights ( training the architecture) Success depends strongly on how these issues are handled... also on insight about the problem A simulator may be used, particularly when there is no mathematical model of the system (but there is a computer model) We will focus on simulation, but this is not the only possibility We may also use parametric approximation for Q-factors or cost function differences

61 APPROXIMATION ARCHITECTURES Divided in linear and nonlinear [i.e., linear or nonlinear dependence of J(i;r) on r] Linear architectures are easier to train, but nonlinear ones (e.g., neural networks) are richer Computer chess example: Think of board position as state and move as control Uses a feature-based position evaluator that assigns a score (or approximate Q-factor) to each position/move Feature Extraction Features: Material balance, Mobility, Safety, etc Weighting of Features Score Position Evaluator Relatively few special features and weights, and multistep lookahead

62 LINEAR APPROXIMATION ARCHITECTURES Often, the features encode much of the nonlinearity inherent in the cost function approximated Then the approximation may be quite accurate without a complicated architecture (as an extreme example, the ideal feature is the true cost function) With well-chosen features, we can use a linear architecture: J(i;r) = φ(i) r, i = 1,...,n, or s J(r) = Φr = Φ j r j j=1 Φ: the matrix whose rows are φ(i), i = 1,...,n, Φ j is the jth column of Φ State i Feature Extraction Mapping Feature Vector φ(i) Linear Mapping Linear Cost Approximator φ(i) r This is approximation on the subspace S = {Φr r R s } spanned by the columns of Φ (basis functions) Many examples of feature types: Polynomial approximation, radial basis functions, etc

63 ILLUSTRATIONS: POLYNOMIAL TYPE Polynomial Approximation, e.g., a quadratic approximating function. Let the state be i = (i 1,...,i q ) (i.e., have q dimensions ) and define φ 0 (i) = 1, φ k (i) = i k, φ km (i) = i k i m, k,m = 1,...,q Linear approximation architecture: J(i;r) = r 0 + q r k i k + q q r km i k i m, k=1 k=1m=k where r has components r 0, r k, and r km. Interpolation: A subset I of special/representative states is selected, and the parameter vector r has one component r i per state i I. The approximating function is J(i;r) = r i, i I, J(i;r) = interpolation using the values at i I, i / I For example, piecewise constant, piecewise linear, more general polynomial interpolations.

64 A DOMAIN SPECIFIC EXAMPLE Tetris game (used as testbed in competitions)... TERMINATION J (i): optimal score starting from position i Number of states > (for board) Success with just 22 features, readily recognized by tetris players as capturing important aspects of the board position (heights of columns, etc)

65 APPROX. PI - OPTION TO APPROX. J µ OR Q µ Use simulation to approximate the cost J µ of the current policy µ Generate improved policy µ by minimizing in (approx.) Bellman equation Initial Policy Evaluate Approximate Cost J µ (i,r) Approximate Policy Evaluation Generate Improved Policy µ Policy Improvement Altenatively approximate the Q-factors of µ Initial Policy Evaluate Approximate Q-Factors Q µ (i, u,r) Approximate Policy Evaluation Generate Improved Policy µ µ(i) = arg min u U(i) Qµ (i, u,r) Policy Improvement

66 APPROXIMATING J OR Q Approximation of the optimal cost function J Q-Learning: Use a simulation algorithm to approximate the Q-factors Q (i,u) = g(i,u)+α n p ij (u)j (j); j=1 and the optimal costs J (i) = min u U(i) Q (i,u) Bellman Error approach: Find r to min r E i { ( J(i;r) (T J)(i;r) ) 2 } where E i { } is taken with respect to some distribution over the states Approximate Linear Programming (we will not discuss here) Q-learning can also be used with approximations Q-learning and Bellman error approach can also be used for policy evaluation

67 APPROXIMATION IN POLICY SPACE A brief discussion; we will return to it later. Use parametrization µ(i; r) of policies with a vector r = (r 1,...,r s ). Examples: Polynomial, e.g., µ(i;r) = r 1 +r 2 i+r 3 i 2 Linear feature-based µ(i;r) = φ 1 (i) r 1 +φ 2 (i) r 2 Optimize the cost over r. For example: Each value of r defines a stationary policy, withcoststartingatstateidenotedby J(i;r). Let (p 1,...,p n ) be some probability distribution over the states, and minimize over r n p i J(i;r) i=1 Use a random search, gradient, or other method A special case: The parameterization of the policies is indirect, through a cost approximation architecture Ĵ, i.e., n µ(i;r) arg min p ij (u) ( g(i,u,j)+αĵ(j;r)) u U(i) j=1

68 APPROXIMATE POLICY EVALUATION METHODS

69 DIRECT POLICY EVALUATION Approximate the cost of the current policy by using least squares and simulation-generated cost samples Amounts to projection of J µ onto the approximation subspace J µ 0 Subspace S = {Φr r R s } ΠJ µ Direct Method: Projection of cost vector J µ Solution by least squares methods Regular and optimistic policy iteration Nonlinear approximation architectures may also be used

70 DIRECT EVALUATION BY SIMULATION Projection by Monte Carlo Simulation: Compute the projection ΠJ µ of J µ on subspace S = {Φr r R s }, with respect to a weighted Euclidean norm ξ Equivalently, find Φr, where n ( r = arg min r R s Φr J µ 2 ξ = arg min ξ i φ(i) r J µ (i) ) 2 r R s i=1 Setting to 0 the gradient at r, ( n ) 1 n r = ξ i φ(i)φ(i) ξ i φ(i)j µ (i) i=1 i=1 Generatesamples { (i 1,J µ (i 1 )),...,(i k,j µ (i k )) } using distribution ξ Approximate by Monte Carlo the two expected values with low-dimensional calculations ( k ) 1 k ˆr k = t )φ(i t ) φ(i t )J µ (i t ) t=1φ(i t=1 Equivalent least squares alternative calculation: k ( ˆr k = arg min φ(it ) r J µ (i t ) ) 2 r R s t=1

71 INDIRECT POLICY EVALUATION An example: Galerkin approximation Solve the projected equation Φr = ΠT µ (Φr) where Π is projection w/ respect to a suitable weighted Euclidean norm J µ T µ (Φr) 0 Subspace S = {Φr r R s } ΠJ µ 0 Subspace S = {Φr r R s } Φr = ΠT µ (Φr) Direct Method: Projection of cost vector J µ Indirect Method: Solving a projected form of Bellman s equation Solution methods that use simulation (to manage the calculation of Π) TD(λ): Stochastic iterative algorithm for solving Φr = ΠT µ (Φr) LSTD(λ): Solves a simulation-based approximation w/ a standard solver LSPE(λ): A simulation-based form of projected value iteration; essentially Φr k+1 = ΠT µ (Φr k )+ simulation noise

72 BELLMAN EQUATION ERROR METHODS Another example of indirect approximate policy evaluation: min Φr T µ (Φr) 2 r ξ ( ) where ξ is Euclidean norm, weighted with respect to some distribution ξ It is closely related to the projected equation/galerkin approach(with a special choice of projection norm) Several ways to implement projected equation and Bellman error methods by simulation. They involve: Generating many random samples of states i k using the distribution ξ Generatingmanysamplesoftransitions(i k,j k ) using the policy µ Form a simulation-based approximation of the optimality condition for projection problem or problem (*) (use sample averages in place of inner products) Solve the Monte-Carlo approximation of the optimality condition Issues for indirect methods: How to generate the samples? How to calculate r efficiently?

73 ANOTHER INDIRECT METHOD: AGGREGATION A first idea: Group similar states together into aggregate states x 1,...,x s ; assign a common cost value r i to each group x i. Solve an aggregate DP problem, involving the aggregate states, to obtain r = (r 1,...,r s ). This is called hard aggregation x 1 x x 3 x Φ = More general/mathematical view: Solve Φr = ΦDT µ (Φr) where the rows of D and Φ are prob. distributions (e.g., D and Φ aggregate rows and columns of the linear system J = T µ J) ComparewithprojectedequationΦr = ΠT µ (Φr). Note: ΦD is a projection in some interesting cases

74 AGGREGATION AS PROBLEM APPROXIMATION Disaggregation Probabilities d xi i Original System States p ij (u),, g(i, u,j) j Aggregation Probabilities φ jy Q x ˆp xy (u) = ĝ(x, u) = Aggregate States n i=1 n i=1 d xi d xi n n j=1 j=1 y p ij (u)φ jy, p ij (u)g(i, u,j) Aggregation can be viewed as a systematic approach for problem approximation. Main elements: Solve (exactly or approximately) the aggregate problem by any kind of VI or PI method(including simulation-based methods) Use the optimal cost of the aggregate problem to approximate the optimal cost of the original problem Because an exact PI algorithm is used to solve the approximate/aggregate problem the method behaves more regularly than the projected equation approach

75 APPROXIMATE POLICY ITERATION ISSUES

76 THEORETICAL BASIS OF APPROXIMATE PI If policies are approximately evaluated using an approximation architecture such that max i J(i,r k ) J µ k(i) δ, k = 0,1,... If policy improvement is also approximate, max i (T µ k+1 J)(i,r k ) (T J)(i,r k ) ǫ, k = 0,1,... Error bound: The sequence {µ k } generated by approximate policy iteration satisfies lim sup k max i ( Jµ k(i) J (i) ) ǫ+2αδ (1 α) 2 Typical practical behavior: The method makes steady progress up to a point and then the iterates J µ k oscillate within a neighborhood of J. Oscillations are quite unpredictable. Some bad examples of oscillations have been constructed. In practice oscillations between policies is probably not the major concern.

77 THE ISSUE OF EXPLORATION To evaluate a policy µ, we need to generate cost samples using that policy - this biases the simulation by underrepresenting states that are unlikely to occur under µ Cost-to-go estimates of underrepresented states may be highly inaccurate This seriously impacts the improved policy µ This is known as inadequate exploration - a particularly acute difficulty when the randomness embodied in the transition probabilities is relatively small (e.g., a deterministic system) Some remedies: Frequently restart the simulation and ensure that the initial states employed form a rich and representative subset Occasionally generate transitions that use a randomly selected control rather than the one dictated by the policy µ Other methods: Use two Markov chains (one is the chain of the policy and is used to generate the transition sequence, the other is used to generate the state sequence).

78 APPROXIMATING Q-FACTORS Given J(i;r), policy improvement requires a model [knowledge of p ij (u) for all controls u U(i)] Model-free alternative: Approximate Q-factors Q(i,u;r) n p ij (u) ( g(i,u,j)+αj µ (j) ) j=1 and use for policy improvement the minimization µ(i) arg min u U(i) Q(i,u;r) risanadjustableparametervectorand Q(i,u;r) is a parametric architecture, such as s Q(i,u;r) = r m φ m (i,u) m=1 We can adapt any of the cost approximation approaches, e.g., projected equations, aggregation Use the Markov chain with states (i, u), so p ij (µ(i)) is the transition prob. to (j,µ(i)), 0 to other (j,u ) Major concern: Acutely diminished exploration

79 SOME GENERAL ISSUES

80 STOCHASTIC ALGORITHMS: GENERALITIES Consider solution of a linear equation x = b+ Ax by using m simulation samples b + w k and A+W k, k = 1,...,m, where w k,w k are random, e.g., simulation noise Think of x = b + Ax as approximate policy evaluation (projected or aggregation equations) Stoch.approx.(SA)approach: Fork = 1,...,m x k+1 = (1 γ k )x k +γ k ( (b+wk )+(A+W k )x k ) Monte Carlo estimation(mce) approach: Form Monte Carlo estimates of b and A b m = 1 m m (b+w k ), A m = 1 m m (A+W k ) k=1 k=1 Then solve x = b m +A m x by matrix inversion x m = (1 A m ) 1 b m or iteratively TD(λ) and Q-learning are SA methods LSTD(λ) and LSPE(λ) are MCE methods

81 COSTS OR COST DIFFERENCES? Consider the exact policy improvement process. To compare two controls u and u at x, we need E { g(x,u,w) g(x,u,w)+α ( J µ (x) J µ (x ) )} where x = f(x,u,w) and x = f(x,u,w) Approximate J µ (x) or D µ (x,x ) = J µ (x) J µ (x )? Approximating D µ (x,x ) avoids noise differencing. This can make a big difference Important point: D µ satisfies a Bellman equation for a system with state (x,x ) D µ (x,x ) = E { G µ (x,x,w)+αd µ (x,x ) } where x = f ( x,µ(x),w ), x = f ( x,µ(x ),w ) and G µ (x,x,w) = g ( x,µ(x),w ) g ( x,µ(x ),w ) D µ can be learned by the standard methods (TD, LSTD, LSPE, Bellman error, aggregation, etc). This is known as differential training.

82 AN EXAMPLE (FROM THE NDP TEXT) System and cost per stage: x k+1 = x k +δu k, g(x,u) = δ(x 2 +u 2 ) δ > 0 is very small; think of discretization of continuous-time problem involving dx(t)/dt = u(t) Consider policy µ(x) = 2x. Its cost function is J µ (x) = 5x2 4 (1+δ)+O(δ2 ) and its Q-factor is Q µ (x,u) = 5x2 4 +δ ( 9x 2 4 +u2 + 5 ) 2 xu +O(δ 2 ) The important part for policy improvement is δ (u xu ) When J µ (x) [or Q µ (x,u)] is approximated by J µ (x;r) [or by Q µ (x,u;r)], it will be dominated by 5x2 4 and will be lost

83 6.231 DYNAMIC PROGRAMMING LECTURE 4 LECTURE OUTLINE Review of approximation in value space Approximate VI and PI Projected Bellman equations Matrix form of the projected equation Simulation-based implementation LSTD and LSPE methods Optimistic versions Multistep projected Bellman equations Bias-variance tradeoff

84 REVIEW

85 DISCOUNTED MDP System: Controlled Markov chain with states i = 1,...,n, and finite control set U(i) at state i Transition probabilities: p ij (u) p ij (u) p ii (u) i j p jj (u) p ji (u) Cost of a policy π = {µ 0,µ 1,...} starting at state i: { N } J π (i) = lim N E k=0 α k g ( i k,µ k (i k ),i k+1 ) i0 = i with α [0,1) Shorthand notation for DP mappings (TJ)(i) = min u U(i) n p ij (u) ( g(i,u,j)+αj(j) ), i = 1,...,n, j=1 (T µ J)(i) = n ( )( ( ) ) p ij µ(i) g i,µ(i),j +αj(j), i = 1,...,n j=1

86 SHORTHAND THEORY A SUMMARY Bellman s equation: J = TJ, J µ = T µ J µ or J (i) = min u U(i) n p ij (u) ( g(i,u,j)+αj (j) ), i j=1 J µ (i) = n ( )( ( ) p ij µ(i) g i,µ(i),j +αjµ (j) ), i j=1 Optimality condition: i.e., µ: optimal <==> T µ J = TJ µ(i) arg min u U(i) n p ij (u) ( g(i,u,j)+αj (j) ), i j=1

87 THE TWO MAIN ALGORITHMS: VI AND PI Value iteration: For any J R n J (i) = lim k (Tk J)(i), i = 1,...,n Policy iteration: Given µ k Policy evaluation: Find J µ k by solving J µ k(i) = n ( p ij µ k (i) )( g ( i,µ k (i),j ) +αj µ k(j) ), i = 1,...,n j=1 or J µ k = T µ kj µ k Policy improvement: Let µ k+1 be such that µ k+1 (i) arg min u U(i) n p ij (u) ( g(i,u,j)+αj µ k(j) ), i j=1 or T µ k+1j µ k = TJ µ k Policy evaluation is equivalent to solving an n n linear system of equations For large n, exact PI is out of the question (even though it terminates finitely)

88 APPROXIMATION IN VALUE SPACE Approximate J or J µ from a parametric class J(i;r),whereiisthecurrentstateandr = (r 1,...,r s ) is a vector of tunable scalars weights Think n: HUGE, s: (Relatively) SMALL Many types of approximation architectures [i.e., parametric classes J(i;r)] to select from Any r R s defines a (suboptimal) one-step lookahead policy µ(i) = arg min u U(i) n p ij (u) ( g(i,u,j)+α J(j;r) ), i j=1 We want to find a good r We will focus mostly on linear architectures J(r) = Φr where Φ is an n s matrix whose columns are viewed as basis functions

89 LINEAR APPROXIMATION ARCHITECTURES We have J(i;r) = φ(i) r, i = 1,...,n where φ(i), i = 1,...,n is the ith row of Φ, or s J(r) = Φr = Φ j r j j=1 where Φ j is the jth column of Φ State i Feature Extraction Mapping Feature Vector φ(i) Linear Mapping Linear Cost Approximator φ(i) r This is approximation on the subspace S = {Φr r R s } spanned by the columns of Φ (basis functions) Many examples of feature types: Polynomial approximation, radial basis functions, etc Instead of computing J µ or J, which is hugedimensional, we compute the low-dimensional r = (r 1,...,r s ) using low-dimensional calculations

90 APPROXIMATE VALUE ITERATION

91 APPROXIMATE (FITTED) VI Approximates sequentially J k (i) = (T k J 0 )(i), k = 1,2,..., with J k (i;r k ) The starting function J 0 is given (e.g., J 0 0) Approximate (Fitted) Value Iteration: A sequential fit toproduce J k+1 from J k,i.e., J k+1 T J k or (for a single policy µ) J k+1 T µ Jk TJ 0 T J 1 T J 2 J 0 J 1 J 2 J 3 Subspace S = {Φr r R s } Fitted Value Iteration AfteralargeenoughnumberN ofsteps, J N (i;r N ) is used as approximation J(i;r) to J (i) Possibly use (approximate) projection Π with respect to some projection norm, J k+1 ΠT J k

92 WEIGHTED EUCLIDEAN PROJECTIONS Consider a weighted Euclidean norm J ξ = n ( ) 2, ξ i J(i) i=1 where ξ = (ξ 1,...,ξ n ) is a positive distribution (ξ i > 0 for all i). Let Π denote the projection operation onto S = {Φr r R s } with respect to this norm, i.e., for any J R n, where ΠJ = Φr r = arg min r R s Φr J 2 ξ Recall that weighted Euclidean projection can be implemented by simulation and least squares, i.e., sampling J(i) according to ξ and solving min r R s k ( φ(it ) r J(i t ) ) 2 t=1

93 FITTED VI - NAIVE IMPLEMENTATION Select/sample a small subset I k of representative states For each i I k, given J k, compute (T J k )(i) = min u U(i) n p ij (u) ( g(i,u,j)+α J k (j;r) ) j=1 Fit the function J k+1 (i;r k+1 ) to the small set of values (T J k )(i), i I k (for example use some form of approximate projection) Simulation can be used for model-free implementation Error Bound: If the fit is uniformly accurate within δ > 0, i.e., then lim sup k max i J k+1 (i) T J k (i) δ, max ( Jk (i,r k ) J (i) ) 2αδ i=1,...,n (1 α) 2 But there is a potential problem!

94 AN EXAMPLE OF FAILURE Consider two-state discounted MDP with states 1 and 2, and a single policy. Deterministic transitions: 1 2 and 2 2 Transition costs 0, so J (1) = J (2) = 0. Consider (exact) fitted VI scheme that approximates cost functions within S = { (r,2r) r R } ( ) 1 with a weighted least squares fit; here Φ = 2 Given J k = (r k,2r k ),wefind J k+1 = (r k+1,2r k+1 ), where J k+1 = Π ξ (T J k ), with weights ξ = (ξ 1,ξ 2 ): [ ( r k+1 = argmin ξ 1 r (T Jk )(1) ) 2 ( +ξ2 2r (T Jk )(2) ) ] 2 r With straightforward calculation r k+1 = αβr k, whereβ = 2(ξ 1 +2ξ 2 )/(ξ 1 +4ξ 2 ) > 1 So if α > 1/β (e.g., ξ 1 = ξ 2 = 1), the sequence {r k } diverges and so does { J k }. Difficulty is that T is a contraction, but Π ξ T (= least squares fit composed with T) is not.

95 NORM MISMATCH PROBLEM For the method to converge, we need Π ξ T to be a contraction; the contraction property of T is not enough TJ 0 T J 1 T J 2 J 0 J 2 = Π ξ (T J 1 ) J J 1 = Π ξ (TJ 0 ) 3 = Π ξ (T J 2 ) Subspace S = {Φr r R s } Fitted Value Iteration with Projection { } We need a vector of weights ξ such that T is a contraction with respect to the weighted Euclidean norm ξ Then we can show that Π ξ T is a contraction with respect to ξ We will come back to this issue

96 APPROXIMATE POLICY ITERATION

97 APPROXIMATE PI Initial Policy Evaluate Approximate Cost J µ (i,r) Approximate Policy Evaluation Generate Improved Policy µ Policy Improvement Evaluation of typical policy µ: Linear cost function approximation J µ (r) = Φr, where Φ is full rank n s matrix with columns the basis functions, and ith row denoted φ(i). Policy improvement to generate µ: n µ(i) = arg min p ij (u) ( g(i,u,j)+αφ(j) r ) u U(i) j=1 Error Bound (same as approximate VI): If max i J µ k(i,r k ) J µ k(i) δ, k = 0,1,... the sequence {µ k } satisfies ( lim supmax Jµ k(i) J (i) ) 2αδ k i (1 α) 2

98 POLICY EVALUATION Let s consider approximate evaluation of the cost of the current policy by using simulation. Direct policy evaluation - Cost samples generated by simulation, and optimization by least squares Indirect policy evaluation - solving the projected equation Φr = ΠT µ (Φr) where Π is projection w/ respect to a suitable weighted Euclidean norm J µ T µ (Φr) 0 Subspace S = {Φr r R s } ΠJ µ 0 Subspace S = {Φr r R s } Φr = ΠT µ (Φr) Direct Method: Projection of cost vector J µ Indirect Method: Solving a projected form of Bellman s equation Recall that projection can be implemented by simulation and least squares

99 PI WITH INDIRECT POLICY EVALUATION Initial Policy Evaluate Approximate Cost J µ (i,r) Approximate Policy Evaluation Generate Improved Policy µ Policy Improvement Given the current policy µ: We solve the projected Bellman s equation Φr = ΠT µ (Φr) WeapproximatethesolutionJ µ ofbellman s equation J = T µ J with the projected equation solution J µ (r)

100 KEY QUESTIONS AND RESULTS Does the projected equation have a solution? Under what conditions is the mapping ΠT µ a contraction, so ΠT µ has unique fixed point? Assumption: The Markov chain corresponding to µ has a single recurrent class and no transient states, i.e., it has steady-state probabilities that are positive ξ j = lim N 1 N N P(i k = j i 0 = i) > 0 k=1 Note that ξ j is the long-term frequency of state j. Proposition: (Norm Matching Property) Assume that the projection Π is withrespect to ξ, where ξ = (ξ 1,...,ξ n ) is the steady-state probability vector. Then: (a) ΠT µ is contraction of modulus α with respect to ξ. (b) The unique fixed point Φr of ΠT µ satisfies J µ Φr ξ 1 1 α 2 J µ ΠJ µ ξ

101 PRELIMINARIES: PROJECTION PROPERTIES Important property of the projection Π on S with weighted Euclidean norm ξ. For all J R n, Φr S, the Pythagorean Theorem holds: J Φr 2 ξ = J ΠJ 2 ξ + ΠJ Φr 2 ξ J Φr ΠJ Subspace S = {Φr r R s } The Pythagorean Theorem implies that the projection is nonexpansive, i.e., ΠJ Π J ξ J J ξ, for all J, J R n. To see this, note that Π(J J) 2 ξ Π(J J) 2 ξ + (I Π)(J J) 2 ξ = J J 2 ξ

102 PROOF OF CONTRACTION PROPERTY Lemma: If P is the transition matrix of µ, Pz ξ z ξ, z R n Proof: Let p ij be the components of P. For all z R n, we have Pz 2 ξ = n = i=1 n j=1 ξ i n p ij z j j=1 2 n n ξ i p ij zj 2 i=1 j=1 n n ξ i p ij zj 2 = ξ j zj 2 = z 2 ξ, i=1 j=1 where the inequality follows from the convexity of the quadratic function, and the next to last equalityfollowsfromthedefiningproperty n i=1 ξ ip ij = ξ j of the steady-state probabilities. Using the lemma, the nonexpansiveness of Π, and the definition T µ J = g +αpj, we have ΠT µ J ΠT µ J ξ T µ J T µ J ξ = α P(J J) ξ α J J ξ for all J, J R n. Hence ΠT µ is a contraction of modulus α.

103 PROOF OF ERROR BOUND Let Φr be the fixed point of ΠT. We have J µ Φr ξ 1 1 α 2 J µ ΠJ µ ξ. Proof: We have J µ Φr 2 ξ = J µ ΠJ µ 2 ξ + ΠJµ Φr 2 ξ where = J µ ΠJ µ 2 ξ + ΠTJµ ΠT(Φr ) 2 ξ J µ ΠJ µ 2 ξ +α2 J µ Φr 2 ξ, The first equality uses the Pythagorean Theorem The second equality holds because J µ is the fixed point of T and Φr is the fixed point of ΠT The inequality uses the contraction property of ΠT. Q.E.D.

104 SIMULATION-BASED SOLUTION OF PROJECTED EQUATION

105 MATRIX FORM OF PROJECTED EQUATION T µ (Φr)= g +αpφr 0 Subspace S = {Φr r R s } Φr = Π ξ T µ (Φr) ThesolutionΦr satisfiestheorthogonalitycondition: The error Φr (g +αpφr ) is orthogonal to the subspace spanned by the columns of Φ. This is written as Φ Ξ ( Φr (g +αpφr ) ) = 0, where Ξ is the diagonal matrix with the steadystate probabilities ξ 1,...,ξ n along the diagonal. Equivalently, Cr = d, where C = Φ Ξ(I αp)φ, d = Φ Ξg but computing C and d is HARD(high-dimensional inner products).

106 SOLUTION OF PROJECTED EQUATION Solve Cr = d by matrix inversion: r = C 1 d Projected Value Iteration (PVI) method: Φr k+1 = ΠT(Φr k ) = Π(g +αpφr k ) Converges to r because ΠT is a contraction. Value Iterate T(Φr k ) = g + αpφr k Projection on S Φr k+1 Φr k 0 S: Subspace spanned by basis functions PVI can be written as: r k+1 = arg min r R s Φr (g +αpφr k ) 2 ξ By setting to 0 the gradient with respect to r, which yields Φ Ξ ( Φr k+1 (g +αpφr k ) ) = 0, r k+1 = r k (Φ ΞΦ) 1 (Cr k d)

107 SIMULATION-BASED IMPLEMENTATIONS Key idea: Calculate simulation-based approximations based on k samples C k C, d k d Matrix inversion r = C 1 d is approximated by ˆr k = C 1 k d k This is the LSTD (Least Squares Temporal Differences) Method. PVI method r k+1 = r k (Φ ΞΦ) 1 (Cr k d) is approximated by r k+1 = r k G k (C k r k d k ) where G k (Φ ΞΦ) 1 This is the LSPE (Least Squares Policy Evaluation) Method. Key fact: C k, d k, and G k can be computed with low-dimensional linear algebra (of order s; the number of basis functions).

108 SIMULATION MECHANICS Wegenerateaninfinitelylongtrajectory(i 0,i 1,...) of the Markov chain, so states i and transitions (i,j) appear with long-term frequencies ξ i and p ij. After generating each transition (i t,i t+1 ), we compute the row φ(i t ) of Φ and the cost component g(i t,i t+1 ). We form d k = 1 k +1 k t=0 φ(i t )g(i t,i t+1 ) i,j ξ i p ij φ(i)g(i,j) = Φ Ξg = d C k = 1 k +1 k φ(i t ) ( φ(i t ) αφ(i t+1 ) ) Φ Ξ(I αp)φ = C t=0 Also in the case of LSPE G k = 1 k +1 k φ(i t )φ(i t ) Φ ΞΦ t=0 Convergence based on law of large numbers. C k, d k, and G k can be formed incrementally. Also can be written using the formalism of temporal differences (this is just a matter of style)

109 OPTIMISTIC VERSIONS Instead of calculating nearly exact approximations C k C and d k d, we do a less accurate approximation, based on few simulation samples Evaluate (coarsely) current policy µ, then do a policy improvement This often leads to faster computation (as optimistic methods often do) Very complex behavior (see the subsequent discussion on oscillations) The matrix inversion/lstd method has serious problems due to large simulation noise (because of limited sampling) - particularly if the C matrix is ill-conditioned LSPE tends to cope better because of its iterative nature (this is true of other iterative methods as well) A stepsize γ (0,1] in LSPE may be useful to damp the effect of simulation noise r k+1 = r k γg k (C k r k d k )

110 MULTISTEP PROJECTED EQUATIONS

111 MULTISTEP METHODS Introduce a multistep version of Bellman s equation J = T (λ) J, where for λ [0,1), T (λ) = (1 λ) λ l T l+1 l=0 Geometrically weighted sum of powers of T. Note that T l is a contraction with modulus α l, with respect to the weighted Euclidean norm ξ, whereξ isthesteady-state probability vector of the Markov chain. Hence T (λ) is a contraction with modulus α λ = (1 λ) l=0 Note that α λ 0 as λ 1 α l+1 λ l = α(1 λ) 1 αλ T l and T (λ) have the same fixed point J µ and J µ Φr λ ξ 1 1 α 2 λ J µ ΠJ µ ξ where Φr λ is the fixed point of ΠT(λ). The fixed point Φr λ depends on λ.

112 BIAS-VARIANCE TRADEOFF Φrλ : Solution of projected equation Φr = ΠT (λ) (Φr) J µ λ = 0 ΠJ Simulation error µ λ = 1 Bias Simulation error Subspace S = {Φr r R s } Errorbound J µ Φrλ 1 ξ 1 α 2 λ J µ ΠJ µ ξ As λ 1, we have α λ 0, so error bound (and the quality of approximation) improves as λ 1. In fact lim λ 1 Φr λ = ΠJ µ But the simulation noise in approximating T (λ) = (1 λ) λ l T l+1 increases l=0 Choice of λ is usually based on trial and error

113 MULTISTEP PROJECTED EQ. METHODS The projected Bellman equation is Φr = ΠT (λ) (Φr) In matrix form: C (λ) r = d (λ), where C (λ) = Φ Ξ ( I αp (λ)) Φ, d (λ) = Φ Ξg (λ), with P (λ) = (1 λ) α l λ l P l+1, g (λ) = l=0 The LSTD(λ) method is ( (λ)) 1d (λ) C k k, α l λ l P l g l=0 where C (λ) k and d (λ) k are simulation-based approximations of C (λ) and d (λ). The LSPE(λ) method is ( (λ) r k+1 = r k γg k C k r k d (λ) ) k whereg k isasimulation-basedapprox.to(φ ΞΦ) 1 TD(λ): An important simpler/slower iteration [similar to LSPE(λ) with G k = I - see the text].

114 MORE ON MULTISTEP METHODS The simulation process to obtain C (λ) k and d (λ) k is similar to the case λ = 0 (single simulation trajectory i 0,i 1,..., more complex formulas) C (λ) k = 1 k +1 k k φ(i t ) α m t λ m t( φ(i m ) αφ(i m+1 ) ) t=0 m=t d (λ) k = 1 k +1 k k φ(i t ) α m t λ m t g im t=0 m=t In the context of approximate policy iteration, we can use optimistic versions (few samples between policy updates). Many different versions (see the text). Note the λ-tradeoffs: As λ 1, C (λ) k and d (λ) k contain more simulation noise, so more samples are needed for a close approximation of r λ (the solution of the projected equation) Theerrorbound J µ Φr λ ξ becomessmaller As λ 1, ΠT (λ) becomes a contraction for arbitrary projection norm

115 6.231 DYNAMIC PROGRAMMING LECTURE 5 LECTURE OUTLINE Review of approximate PI based on projected Bellman equations Issues of policy improvement Exploration enhancement in policy evaluation Oscillations in approximate PI Aggregation An alternative to the projected equation/galerkin approach Examples of aggregation Simulation-based aggregation Relation between aggregation and projected equations

116 REVIEW

117 DISCOUNTED MDP System: Controlled Markov chain with states i = 1,...,n and finite set of controls u U(i) Transition probabilities: p ij (u) p ij (u) p ii (u) i j p jj (u) p ji (u) Cost of a policy π = {µ 0,µ 1,...} starting at state i: { N J π (i) = lim N E with α [0,1) k=0 α k g ( i k,µ k (i k ),i k+1 ) i = i0 } Shorthand notation for DP mappings (TJ)(i) = min u U(i) n p ij (u) ( g(i,u,j)+αj(j) ), i = 1,...,n, j=1 (T µ J)(i) = n ( )( ( ) ) p ij µ(i) g i,µ(i),j +αj(j), i = 1,...,n j=1

118 APPROXIMATE PI Initial Policy Evaluate Approximate Cost J µ (i,r) Approximate Policy Evaluation Generate Improved Policy µ Policy Improvement Evaluation of typical policy µ: Linear cost function approximation J µ (r) = Φr where Φ is full rank n s matrix with columns the basis functions, and ith row denoted φ(i). Policy improvement to generate µ: n µ(i) = arg min p ij (u) ( g(i,u,j)+αφ(j) r ) u U(i) j=1

119 EVALUATION BY PROJECTED EQUATIONS Approximate policy evaluation by solving Φr = ΠT µ (Φr) Π: weighted Euclidean projection; special nature of the steady-state distribution weighting. Implementation by simulation (single long trajectory using current policy - important to make ΠT µ a contraction). LSTD, LSPE methods. Multistep option: Solve Φr = ΠT µ (λ) (Φr) with T µ (λ) = (1 λ) λ l Tµ l+1, 0 λ < 1 l=0 As λ 1, ΠT (λ) µ becomes a contraction for any projection norm (allows changes in Π) Bias-variance tradeoff Solution of projected equation Φr = ΠT (λ) (Φr) J µ λ = 0 ΠJ Simulation error µ λ = 1 Bias Simulation error Subspace S = {Φr r R s }

6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE 6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE Review of Q-factors and Bellman equations for Q-factors VI and PI for Q-factors Q-learning - Combination of VI and sampling Q-learning and cost function

More information

Infinite-Horizon Dynamic Programming

Infinite-Horizon Dynamic Programming 1/70 Infinite-Horizon Dynamic Programming http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html Acknowledgement: this slides is based on Prof. Mengdi Wang s and Prof. Dimitri Bertsekas lecture notes 2/70 作业

More information

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019 Reinforcement Learning and Optimal Control ASU, CSE 691, Winter 2019 Dimitri P. Bertsekas dimitrib@mit.edu Lecture 8 Bertsekas Reinforcement Learning 1 / 21 Outline 1 Review of Infinite Horizon Problems

More information

Dynamic Programming and Reinforcement Learning

Dynamic Programming and Reinforcement Learning Dynamic Programming and Reinforcement Learning Reinforcement learning setting action/decision Agent Environment reward state Action space: A State space: S Reward: R : S A S! R Transition:

More information

Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications

Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications May 2012 Report LIDS - 2884 Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications Dimitri P. Bertsekas Abstract We consider a class of generalized dynamic programming

More information

6.231 DYNAMIC PROGRAMMING LECTURE 17 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 17 LECTURE OUTLINE 6.231 DYNAMIC PROGRAMMING LECTURE 17 LECTURE OUTLINE Undiscounted problems Stochastic shortest path problems (SSP) Proper and improper policies Analysis and computational methods for SSP Pathologies of

More information

Abstract Dynamic Programming

Abstract Dynamic Programming Abstract Dynamic Programming Dimitri P. Bertsekas Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Overview of the Research Monograph Abstract Dynamic Programming"

More information

Value and Policy Iteration

Value and Policy Iteration Chapter 7 Value and Policy Iteration 1 For infinite horizon problems, we need to replace our basic computational tool, the DP algorithm, which we used to compute the optimal cost and policy for finite

More information

Reinforcement Learning and Optimal Control. Chapter 4 Infinite Horizon Reinforcement Learning DRAFT

Reinforcement Learning and Optimal Control. Chapter 4 Infinite Horizon Reinforcement Learning DRAFT Reinforcement Learning and Optimal Control by Dimitri P. Bertsekas Massachusetts Institute of Technology Chapter 4 Infinite Horizon Reinforcement Learning DRAFT This is Chapter 4 of the draft textbook

More information

Regular Policies in Abstract Dynamic Programming

Regular Policies in Abstract Dynamic Programming August 2016 (Revised January 2017) Report LIDS-P-3173 Regular Policies in Abstract Dynamic Programming Dimitri P. Bertsekas Abstract We consider challenging dynamic programming models where the associated

More information

Reinforcement Learning and Optimal Control. Chapter 4 Infinite Horizon Reinforcement Learning DRAFT

Reinforcement Learning and Optimal Control. Chapter 4 Infinite Horizon Reinforcement Learning DRAFT Reinforcement Learning and Optimal Control by Dimitri P. Bertsekas Massachusetts Institute of Technology Chapter 4 Infinite Horizon Reinforcement Learning DRAFT This is Chapter 4 of the draft textbook

More information

Optimistic Policy Iteration and Q-learning in Dynamic Programming

Optimistic Policy Iteration and Q-learning in Dynamic Programming Optimistic Policy Iteration and Q-learning in Dynamic Programming Dimitri P. Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology November 2010 INFORMS, Austin,

More information

Stochastic Shortest Path Problems

Stochastic Shortest Path Problems Chapter 8 Stochastic Shortest Path Problems 1 In this chapter, we study a stochastic version of the shortest path problem of chapter 2, where only probabilities of transitions along different arcs can

More information

On the Convergence of Optimistic Policy Iteration

On the Convergence of Optimistic Policy Iteration Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes

More information

Introduction to Approximate Dynamic Programming

Introduction to Approximate Dynamic Programming Introduction to Approximate Dynamic Programming Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Approximate Dynamic Programming 1 Key References Bertsekas, D.P.

More information

6.231 DYNAMIC PROGRAMMING LECTURE 9 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 9 LECTURE OUTLINE 6.231 DYNAMIC PROGRAMMING LECTURE 9 LECTURE OUTLINE Rollout algorithms Policy improvement property Discrete deterministic problems Approximations of rollout algorithms Model Predictive Control (MPC) Discretization

More information

Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming MATHEMATICS OF OPERATIONS RESEARCH Vol. 37, No. 1, February 2012, pp. 66 94 ISSN 0364-765X (print) ISSN 1526-5471 (online) http://dx.doi.org/10.1287/moor.1110.0532 2012 INFORMS Q-Learning and Enhanced

More information

Distributed Optimization. Song Chong EE, KAIST

Distributed Optimization. Song Chong EE, KAIST Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links

More information

DIFFERENTIAL TRAINING OF 1 ROLLOUT POLICIES

DIFFERENTIAL TRAINING OF 1 ROLLOUT POLICIES Appears in Proc. of the 35th Allerton Conference on Communication, Control, and Computing, Allerton Park, Ill., October 1997 DIFFERENTIAL TRAINING OF 1 ROLLOUT POLICIES by Dimitri P. Bertsekas 2 Abstract

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

Computation and Dynamic Programming

Computation and Dynamic Programming Computation and Dynamic Programming Huseyin Topaloglu School of Operations Research and Information Engineering, Cornell University, Ithaca, New York 14853, USA topaloglu@orie.cornell.edu June 25, 2010

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process

More information

Lecture 4: Misc. Topics and Reinforcement Learning

Lecture 4: Misc. Topics and Reinforcement Learning Approximate Dynamic Programming Lecture 4: Misc. Topics and Reinforcement Learning Mengdi Wang Operations Research and Financial Engineering Princeton University August 1-4, 2015 1/56 Feature Extraction

More information

Journal of Computational and Applied Mathematics. Projected equation methods for approximate solution of large linear systems

Journal of Computational and Applied Mathematics. Projected equation methods for approximate solution of large linear systems Journal of Computational and Applied Mathematics 227 2009) 27 50 Contents lists available at ScienceDirect Journal of Computational and Applied Mathematics journal homepage: www.elsevier.com/locate/cam

More information

9 Improved Temporal Difference

9 Improved Temporal Difference 9 Improved Temporal Difference Methods with Linear Function Approximation DIMITRI P. BERTSEKAS and ANGELIA NEDICH Massachusetts Institute of Technology Alphatech, Inc. VIVEK S. BORKAR Tata Institute of

More information

WE consider finite-state Markov decision processes

WE consider finite-state Markov decision processes IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 7, JULY 2009 1515 Convergence Results for Some Temporal Difference Methods Based on Least Squares Huizhen Yu and Dimitri P. Bertsekas Abstract We consider

More information

Abstract Dynamic Programming

Abstract Dynamic Programming Abstract Dynamic Programming Dimitri P. Bertsekas Massachusetts Institute of Technology WWW site for book information and orders http://www.athenasc.com Athena Scientific, Belmont, Massachusetts Athena

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Optimal sequential decision making for complex problems agents. Damien Ernst University of Liège

Optimal sequential decision making for complex problems agents. Damien Ernst University of Liège Optimal sequential decision making for complex problems agents Damien Ernst University of Liège Email: dernst@uliege.be 1 About the class Regular lectures notes about various topics on the subject with

More information

Convex Optimization Theory. Chapter 5 Exercises and Solutions: Extended Version

Convex Optimization Theory. Chapter 5 Exercises and Solutions: Extended Version Convex Optimization Theory Chapter 5 Exercises and Solutions: Extended Version Dimitri P. Bertsekas Massachusetts Institute of Technology Athena Scientific, Belmont, Massachusetts http://www.athenasc.com

More information

Affine Monotonic and Risk-Sensitive Models in Dynamic Programming

Affine Monotonic and Risk-Sensitive Models in Dynamic Programming REPORT LIDS-3204, JUNE 2016 (REVISED, NOVEMBER 2017) 1 Affine Monotonic and Risk-Sensitive Models in Dynamic Programming Dimitri P. Bertsekas arxiv:1608.01393v2 [math.oc] 28 Nov 2017 Abstract In this paper

More information

15-780: LinearProgramming

15-780: LinearProgramming 15-780: LinearProgramming J. Zico Kolter February 1-3, 2016 1 Outline Introduction Some linear algebra review Linear programming Simplex algorithm Duality and dual simplex 2 Outline Introduction Some linear

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

1 Markov decision processes

1 Markov decision processes 2.997 Decision-Making in Large-Scale Systems February 4 MI, Spring 2004 Handout #1 Lecture Note 1 1 Markov decision processes In this class we will study discrete-time stochastic systems. We can describe

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

INTRODUCTION TO MARKOV DECISION PROCESSES

INTRODUCTION TO MARKOV DECISION PROCESSES INTRODUCTION TO MARKOV DECISION PROCESSES Balázs Csanád Csáji Research Fellow, The University of Melbourne Signals & Systems Colloquium, 29 April 2010 Department of Electrical and Electronic Engineering,

More information

Monte Carlo Linear Algebra: A Review and Recent Results

Monte Carlo Linear Algebra: A Review and Recent Results Monte Carlo Linear Algebra: A Review and Recent Results Dimitri P. Bertsekas Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Caradache, France, 2012 *Athena

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

Optimal Stopping Problems

Optimal Stopping Problems 2.997 Decision Making in Large Scale Systems March 3 MIT, Spring 2004 Handout #9 Lecture Note 5 Optimal Stopping Problems In the last lecture, we have analyzed the behavior of T D(λ) for approximating

More information

1 Problem Formulation

1 Problem Formulation Book Review Self-Learning Control of Finite Markov Chains by A. S. Poznyak, K. Najim, and E. Gómez-Ramírez Review by Benjamin Van Roy This book presents a collection of work on algorithms for learning

More information

5. Solving the Bellman Equation

5. Solving the Bellman Equation 5. Solving the Bellman Equation In the next two lectures, we will look at several methods to solve Bellman s Equation (BE) for the stochastic shortest path problem: Value Iteration, Policy Iteration and

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning Introduction to Reinforcement Learning Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/ munos/ INRIA Lille - Nord Europe Machine Learning Summer School, September 2011,

More information

Procedia Computer Science 00 (2011) 000 6

Procedia Computer Science 00 (2011) 000 6 Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-

More information

Linear Programming Methods

Linear Programming Methods Chapter 11 Linear Programming Methods 1 In this chapter we consider the linear programming approach to dynamic programming. First, Bellman s equation can be reformulated as a linear program whose solution

More information

Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming April 2010 (Revised October 2010) Report LIDS - 2831 Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming Dimitri P. Bertseas 1 and Huizhen Yu 2 Abstract We consider the classical

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

MS&E339/EE337B Approximate Dynamic Programming Lecture 1-3/31/2004. Introduction

MS&E339/EE337B Approximate Dynamic Programming Lecture 1-3/31/2004. Introduction MS&E339/EE337B Approximate Dynamic Programming Lecture 1-3/31/2004 Lecturer: Ben Van Roy Introduction Scribe: Ciamac Moallemi 1 Stochastic Systems In this class, we study stochastic systems. A stochastic

More information

Basic Deterministic Dynamic Programming

Basic Deterministic Dynamic Programming Basic Deterministic Dynamic Programming Timothy Kam School of Economics & CAMA Australian National University ECON8022, This version March 17, 2008 Motivation What do we do? Outline Deterministic IHDP

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A.

More information

Basis Function Adaptation Methods for Cost Approximation in MDP

Basis Function Adaptation Methods for Cost Approximation in MDP Basis Function Adaptation Methods for Cost Approximation in MDP Huizhen Yu Department of Computer Science and HIIT University of Helsinki Helsinki 00014 Finland Email: janeyyu@cshelsinkifi Dimitri P Bertsekas

More information

Optimal Control. McGill COMP 765 Oct 3 rd, 2017

Optimal Control. McGill COMP 765 Oct 3 rd, 2017 Optimal Control McGill COMP 765 Oct 3 rd, 2017 Classical Control Quiz Question 1: Can a PID controller be used to balance an inverted pendulum: A) That starts upright? B) That must be swung-up (perhaps

More information

Approximate dynamic programming for stochastic reachability

Approximate dynamic programming for stochastic reachability Approximate dynamic programming for stochastic reachability Nikolaos Kariotoglou, Sean Summers, Tyler Summers, Maryam Kamgarpour and John Lygeros Abstract In this work we illustrate how approximate dynamic

More information

HOW TO CHOOSE THE STATE RELEVANCE WEIGHT OF THE APPROXIMATE LINEAR PROGRAM?

HOW TO CHOOSE THE STATE RELEVANCE WEIGHT OF THE APPROXIMATE LINEAR PROGRAM? HOW O CHOOSE HE SAE RELEVANCE WEIGH OF HE APPROXIMAE LINEAR PROGRAM? YANN LE ALLEC AND HEOPHANE WEBER Abstract. he linear programming approach to approximate dynamic programming was introduced in [1].

More information

1 Stochastic Dynamic Programming

1 Stochastic Dynamic Programming 1 Stochastic Dynamic Programming Formally, a stochastic dynamic program has the same components as a deterministic one; the only modification is to the state transition equation. When events in the future

More information

The Art of Sequential Optimization via Simulations

The Art of Sequential Optimization via Simulations The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern California (Based on joint

More information

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017 The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the

More information

ECE7850 Lecture 7. Discrete Time Optimal Control and Dynamic Programming

ECE7850 Lecture 7. Discrete Time Optimal Control and Dynamic Programming ECE7850 Lecture 7 Discrete Time Optimal Control and Dynamic Programming Discrete Time Optimal control Problems Short Introduction to Dynamic Programming Connection to Stabilization Problems 1 DT nonlinear

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

Basis Function Adaptation Methods for Cost Approximation in MDP

Basis Function Adaptation Methods for Cost Approximation in MDP In Proc of IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning 2009 Nashville TN USA Basis Function Adaptation Methods for Cost Approximation in MDP Huizhen Yu Department

More information

Finite-Sample Analysis in Reinforcement Learning

Finite-Sample Analysis in Reinforcement Learning Finite-Sample Analysis in Reinforcement Learning Mohammad Ghavamzadeh INRIA Lille Nord Europe, Team SequeL Outline 1 Introduction to RL and DP 2 Approximate Dynamic Programming (AVI & API) 3 How does Statistical

More information

Value Function Based Reinforcement Learning in Changing Markovian Environments

Value Function Based Reinforcement Learning in Changing Markovian Environments Journal of Machine Learning Research 9 (2008) 1679-1709 Submitted 6/07; Revised 12/07; Published 8/08 Value Function Based Reinforcement Learning in Changing Markovian Environments Balázs Csanád Csáji

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

On Finding Optimal Policies for Markovian Decision Processes Using Simulation

On Finding Optimal Policies for Markovian Decision Processes Using Simulation On Finding Optimal Policies for Markovian Decision Processes Using Simulation Apostolos N. Burnetas Case Western Reserve University Michael N. Katehakis Rutgers University February 1995 Abstract A simulation

More information

CHAPTER 3 Further properties of splines and B-splines

CHAPTER 3 Further properties of splines and B-splines CHAPTER 3 Further properties of splines and B-splines In Chapter 2 we established some of the most elementary properties of B-splines. In this chapter our focus is on the question What kind of functions

More information

Lecture notes for Analysis of Algorithms : Markov decision processes

Lecture notes for Analysis of Algorithms : Markov decision processes Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with

More information

Approximate Policy Iteration: A Survey and Some New Methods

Approximate Policy Iteration: A Survey and Some New Methods April 2010 - Revised December 2010 and June 2011 Report LIDS - 2833 A version appears in Journal of Control Theory and Applications, 2011 Approximate Policy Iteration: A Survey and Some New Methods Dimitri

More information

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9 MAT 570 REAL ANALYSIS LECTURE NOTES PROFESSOR: JOHN QUIGG SEMESTER: FALL 204 Contents. Sets 2 2. Functions 5 3. Countability 7 4. Axiom of choice 8 5. Equivalence relations 9 6. Real numbers 9 7. Extended

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

16.4 Multiattribute Utility Functions

16.4 Multiattribute Utility Functions 285 Normalized utilities The scale of utilities reaches from the best possible prize u to the worst possible catastrophe u Normalized utilities use a scale with u = 0 and u = 1 Utilities of intermediate

More information

A Gentle Introduction to Reinforcement Learning

A Gentle Introduction to Reinforcement Learning A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,

More information

Total Expected Discounted Reward MDPs: Existence of Optimal Policies

Total Expected Discounted Reward MDPs: Existence of Optimal Policies Total Expected Discounted Reward MDPs: Existence of Optimal Policies Eugene A. Feinberg Department of Applied Mathematics and Statistics State University of New York at Stony Brook Stony Brook, NY 11794-3600

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs 2015 IEEE 54th Annual Conference on Decision and Control CDC December 15-18, 2015. Osaka, Japan An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs Abhishek Gupta Rahul Jain Peter

More information

Planning in Markov Decision Processes

Planning in Markov Decision Processes Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov

More information

arxiv: v1 [math.oc] 9 Oct 2018

arxiv: v1 [math.oc] 9 Oct 2018 A Convex Optimization Approach to Dynamic Programming in Continuous State and Action Spaces Insoon Yang arxiv:1810.03847v1 [math.oc] 9 Oct 2018 Abstract A convex optimization-based method is proposed to

More information

Q-learning and policy iteration algorithms for stochastic shortest path problems

Q-learning and policy iteration algorithms for stochastic shortest path problems Q-learning and policy iteration algorithms for stochastic shortest path problems The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

Convex Analysis and Optimization Chapter 2 Solutions

Convex Analysis and Optimization Chapter 2 Solutions Convex Analysis and Optimization Chapter 2 Solutions Dimitri P. Bertsekas with Angelia Nedić and Asuman E. Ozdaglar Massachusetts Institute of Technology Athena Scientific, Belmont, Massachusetts http://www.athenasc.com

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information

Section Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018

Section Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018 Section Notes 9 Midterm 2 Review Applied Math / Engineering Sciences 121 Week of December 3, 2018 The following list of topics is an overview of the material that was covered in the lectures and sections

More information

ESC794: Special Topics: Model Predictive Control

ESC794: Special Topics: Model Predictive Control ESC794: Special Topics: Model Predictive Control Nonlinear MPC Analysis : Part 1 Reference: Nonlinear Model Predictive Control (Ch.3), Grüne and Pannek Hanz Richter, Professor Mechanical Engineering Department

More information

Introduction - Motivation. Many phenomena (physical, chemical, biological, etc.) are model by differential equations. f f(x + h) f(x) (x) = lim

Introduction - Motivation. Many phenomena (physical, chemical, biological, etc.) are model by differential equations. f f(x + h) f(x) (x) = lim Introduction - Motivation Many phenomena (physical, chemical, biological, etc.) are model by differential equations. Recall the definition of the derivative of f(x) f f(x + h) f(x) (x) = lim. h 0 h Its

More information

6.231 DYNAMIC PROGRAMMING LECTURE 13 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 13 LECTURE OUTLINE 6.231 DYNAMIC PROGRAMMING LECTURE 13 LECTURE OUTLINE Control of continuous-time Markov chains Semi-Markov problems Problem formulation Equivalence to discretetime problems Discounted problems Average cost

More information

Markov Decision Processes Infinite Horizon Problems

Markov Decision Processes Infinite Horizon Problems Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld 1 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T)

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

Least-Squares λ Policy Iteration: Bias-Variance Trade-off in Control Problems

Least-Squares λ Policy Iteration: Bias-Variance Trade-off in Control Problems : Bias-Variance Trade-off in Control Problems Christophe Thiery Bruno Scherrer LORIA - INRIA Lorraine - Campus Scientifique - BP 239 5456 Vandœuvre-lès-Nancy CEDEX - FRANCE thierych@loria.fr scherrer@loria.fr

More information