Infinite-Horizon Dynamic Programming

Size: px

Start display at page:

Download "Infinite-Horizon Dynamic Programming"

Heather Little
5 years ago
Views:

1 1/70 Infinite-Horizon Dynamic Programming Acknowledgement: this slides is based on Prof. Mengdi Wang s and Prof. Dimitri Bertsekas lecture notes

2 2/70 作业 1) 阅读如下章节 : Richard S. Sutton and Andrew G. Barto, Reinforcement Learning: An Introduction, Chapter 2: Multi-armed Bandits Chapter 3: Finite Markov Decision Processes Chapter 4: Dynamic Programming Chapter 5: Monte Carlo Methods Chapter 6: Temporal-Difference Learning Chapter 9: On-policy Prediction with Approximation Chapter 10: On-policy Control with Approximation Chapter 13: Policy Gradient Methods 2) 至少看懂每章的三个 Example 如果有程序, 测试或实现其程序

3 Outline 3/70 1 Review of Finite-Horizon DP 2 Infinite-Horizon DP: Theory and Algorithms 3 A Premier on ADP 4 Dimension Reduction in RL Approximation in value space Approximation in policy space State Aggregation 5 On-Policy Learning Direct Projection Bellman Error Minimization Projected Bellman Equation Method From On-Policy to Off-Policy

4 Abstract DP Model Descrete-time System state transition x k+1 = f k (x k, u k, w k ) k = 0, 1,..., N 1 x k : state; summarizing past information that is relevant for future optimization u k : control/action; decision to be selected at time k from a given set U k w k : random disturbance or noise g k (x k, u k, w k ): state transitional cost incurred at time k given current state x k and control u k For every k and every x k, we want an optimal action. We look for a mapping µ from states to actions. 4/70

5 Abstract DP Model 5/70 Objective - control the system to minimize overall cost { } N 1 min µ E g N (x N ) + g k (x k, u k, w k ) k=0 s.t. u k = µ(x k ), k = 0,..., N 1 We look for a policy/strategy µ, which is a mapping from states to actions.

6 Finite-Horizon DP: Theory and Algorithm Value function is defined as the optimal revenue one can earn (by using the optimal policy onward) starting at time period t with inventory x [ T ] V t (x) = max µ t,...,µ T E k=t Bellman Equation: for all x, t T 1 g k (x k, µ k (x k ), w k ) x t = x V t (x) = max µ t E [g t (x t, µ t (x t ), w t ) + V t+1 (x t+1 ) x t = x] Tail Optimality: A strategy µ 1,..., µ T is optimal, if and only if every tail strategy µ t,..., µ T is optimal for the tail problem starting at stage t. DP Algorithm: solve the Bellman equation directly using backward induction. This is value iteration. 6/70

7 Option Pricing: A Simple Binomial Model 7/70 We focus on American call options. Strike price: K Duration: T days Stock price of tth day: S t Growth rate: u (1, ) Diminish rate: d (0, 1) Probability of growth: p [0, 1] Binomial Model of Stock Price { us t with probability p S t+1 = ds t with probability 1 p As the discretization of time becomes finer, the binomial model approaches the Brownian motion model.

8 8/70 Option Pricing: Bellman s Equation Given S 0, T, K, u, r, p. State: S t, finite number of possible values Cost vector: V t (S), the value of option at the tth day when the current stock price is S. Bellman equation for binomial option V t (S t ) = max{s t K, pv t+1 (us t ) + (1 p)v t+1 (ds t )} V t (S T ) = max{s T K, 0}

9 Outline 9/70 1 Review of Finite-Horizon DP 2 Infinite-Horizon DP: Theory and Algorithms 3 A Premier on ADP 4 Dimension Reduction in RL Approximation in value space Approximation in policy space State Aggregation 5 On-Policy Learning Direct Projection Bellman Error Minimization Projected Bellman Equation Method From On-Policy to Off-Policy

10 Infinite-Horizon Discounted Problems/Bounded Cost 10/70 Stationary system Cost of a policy π = {µ 0, µ 1,...} x k+1 = f (x k, u k, w k ), k = 0, 1,... J π (x 0 ) = lim N E w k,k=0,1,... [ N 1 ] α k g(x k, µ k (x k ), w k ) k=0 with α < 1, and g is bounded [for some M, we have g(x, u, w) M for all (x, u, w)] Optimal cost function is defined as J (x) = min π J π (x)

11 Infinite-Horizon Discounted Problems/Bounded Cost 11/70 Boundedness of g guarantees that all costs are well-defined and bounded: J π (x) M 1 α All spaces are arbitrary - only boundedness of g is important (there are math fine points, e.g. measurability, but they don t matter in practice) Important special case with finite space: Markovian Decision Problem All algorithms ultimately work with a finite spaces MDP approximating the original problem

12 Shorthand notation for DP mappings For any function J of x, denote (TJ)(x) = min u U(x) E w{g(x, u, w) + αj(f (x, u, w))}, TJ is the optimal cost function for the one-stage problem with stage cost g and terminal cost function αj. T operates on bounded functions of x to produce other bounded functions of x For any stationary policy µ, denote (T µ J)(x) = E w {g(x, µ(x), w) + αj(f (x, µ(x), w))}, x x The critical structure of the problem is captured in T and T µ The entire theory of discounted problems can be developed in shorthand using T and T µ True for many other DP problems. T and T µ provide a powerful unifying framework for DP. This is the essence of the book Abstract Dynamic Programming 12/70

13 Express Finite-Horizon Cost using T 13/70 Consider an N-stage policy π N 0 = {µ 0, µ 1,..., µ N 1 } with a terminal cost J and π N 1 = {µ 1, µ 2,..., µ N 1 }: [ ] N 1 J π N(x 0 ) = E α N J(x k ) + α l g(x 0 l, µ l (x l ), w l ) l=0 [ ] = E g(x 0, µ 0 (x 0 ), w 0 ) + αj π N(x 1 ) 1 = (T µ0 J π N 1 )(x 0 ) By induction, we have J π N 0 (x 0 ) = (T µ0 T µ1 T µn 1 J)(x), x For a stationary policy µ the N-stage cost function (with terminal cost J) is J π N 0 = T N µ J, where T N µ is the N-fold composition of T µ Similarly the optimal N-stage cost function (with terminal cost J) is T N J T N J = T(T N 1 J) is just the DP algorithm

14 Markov Chain A Markov chain is a random process that takes values on the state space {1,..., n}. The process evolves according to a certain transition probability matrix P R n n where P(i k+1 = j i k, i k 1,..., i 0 ) = P(i k+1 = j i k = i) = P ij Markov chain is memoryless, i.e., further evolvements are independent with past trajectory conditioned on the current state. The memoryless property is equivalent to Markov. A state i is recurrent if it will be visited infinitely many times with probability 1. A Markov chain is said to be irreducible if its state space is a single communicating class; in other words, if it is possible to get to any state from any state. When states are modeled appropriately, all stochastic processes are Markov. 14/70

15 15/70 Markovian Decision Problem We will mostly assume the system is an n-state (controlled) Markov chain States i = 1,..., n (instead of x) Transition probabilities p ik i k+1 (u k ) [instead of x k+1 = f (x k, u k, w k )] stage cost g(i k, u k, i k+1 ) [instead of g(x k, u k, w k )] cost function J = (J(1),..., J(n)) (vectors in R n ) cost of a policy π = {µ 0, µ 1,...} J π (i) = lim N E i k,k=1,2,... ] α k g(i k, µ k (i k ), i k+1 ) i 0 = i [ N 1 k=0 MDP is the most important problem in infinite-horizon DP

16 Markovian Decision Problem Shorthand notation for DP mappings n (TJ)(i) = min u U(i) p ij (u)(g(i, u, j) + αj(j)), i = 1,..., n, (T µ J)(i) = j=1 n p ij (µ(i))(g(i, µ(i), j) + αj(j)), i = 1,..., n, j=1 Vector form of DP mappings and where g µ (i) = TJ = min µ {g µ + αp µ J} T µ J = g µ + αp µ J n p ij (µ(i))g(i, µ(i), j), j=1 P µ (i, j) = p ij (µ(i)) 16/70

17 Two Key properties 17/70 Monotonicity property: For any J and J such that J(x) J (x) for all x, and any µ (TJ)(x) (TJ )(x), (T µ J)(x) (T µ J )(x), x Constant Shift property: For any J, any scalar r, and any µ (T(J + re))(x) = (TJ)(x) + ar, (T µ (J + re))(x) = (T µ J)(x) + ar, x where e is the unit function [e(x) 1]. Monotonicity is present in all DP models (undiscounted, etc) Constant shift is special to discounted models Discounted problems have another property of major importance: T and T µ are contraction mappings (we will show this later)

18 Convergence of Value Iteration Theorem For all bounded J 0, we have J (x) = lim k (T k J 0 )(x), for all x Proof. For simplicity we give the proof for J 0 0. For any initial state x 0, and policy π = {µ 0, µ 1,...}, [ ] J π (x 0 ) = E α l g(x l, µ l (x l ), w l ) = E l=0 [ k 1 ] [ ] α l g(x l, µ l (x l ), w l ) + E α l g(x l, µ l (x l ), w l ) l=0 l=k The tail portion satisfies [ E α l g(x l, µ l (x l ), w l )] αk M 1 α l=k where M g(x, u, w). Take min over π of both sides, then lim as k 18/70

19 Proof of Bellman s equation Theorem The optimal cost function J is a solution of Bellman s equation, J = TJ, i.e., for all x, J (x) = min u U(x) E w{g(x, u, w) + αj (f (x, u, w))} Proof. For all x and k, J (x) αk M 1 α (Tk J 0 )(x) J (x) + αk M 1 α where J 0 (x) 0 and M g(x, u, w). Applying T to this relation, and using Monotonicity and Constant Shift, (TJ )(x) αk+1 M 1 α (Tk+1 J 0 )(x) (TJ )(x) + αk+1 M 1 α Taking the limit as k and using the fact lim k (T k+1 J 0 )(x) = J (x) we obtain J = TJ. 19/70

20 The Contraction Property Contraction property: For any bounded functions J and J, and any µ, max x max x (TJ)(x) (TJ )(x) α max J(x) J (x), x J(x) J (x) (T µ J)(x) (T µ J )(x) α max x Proof. Denote c = max x S J(x) J (x). Then J(x) c J (x) J(x) + c, x Apply T to both sides, and use the Monotonicity and Constant Shift properties: (TJ)(x) αc (TJ )(x) (TJ)(x) + αc, x Hence, (TJ)(x) (TJ )(x) αc, x. This implies that T, T µ have unique fixed points. Then J is the unique solution of J = TJ, and J µ is the unique solution of J µ = T µ J µ 20/70

21 Necessary and Sufficient Optimality Condition Theorem A stationary policy µ is optimal if and only if µ(x) attains the minimum in Bellman s equation for each x; i.e., or, equivalently, for all x, TJ = T µ J, µ(x) arg min u U(x) E w{g(x, u, w) + αj (f (x, u, w))} 21/70

22 Proof of Optimality Condition 22/70 Proof: We have two directions. If TJ = T µ J, then using Bellman s equation (J = TJ ), we have J = T µ J, so by uniqueness of the fixed point of T µ, we obtain J = J µ ; i.e., µ is optimal. Conversely, if the stationary policy µ is optimal, we have J = J µ, so J = T µ J. Combining this with Bellman s Eq. (J = TJ ), we obtain TJ = T µ J.

23 23/70 Two Main Algorithms Value Iteration Solve the Bellman equation J = TJ by iterating on the value functions: J k+1 = TJ k, or for i = 1,..., n. J k+1 (i) = min u n p ij (u)(g(i, u, j) + αj k (j)) j=1 The program only needs to memorize the current value function J k. We have shown that J k J as k. Policy Iteration Solve the Bellman equation J = TJ by iterating on the policies

24 Policy Iteration (PI) 24/70 Given µ k, the k-th policy iteration has two steps Policy evaluation: Find J µ k by J µ k = T µ kj µ k or solving J µ k(i) = n p ij (µ k (i))(g(i, µ k (i), j) + αj µ k(j)), i = 1,..., n j=1 Policy improvement: Let µ k+1 be such that T µ k+1j µ k = TJ µ k or µ k+1 (i) arg min u U(i) n p ij (u)(g(i, u, j) + αj µ k(j)) Policy iteration is a method that updates the policy instead of the value function. j=1

25 25/70 Policy Iteration (PI) More abstractly, the k-th policy iteration has two steps Policy evaluation: Find J µ k by solving J µ k = T µ kj µ k = g µ k + αp µ kj µ k Policy improvement: Let µ k+1 be such that T µ k+1j µ k = TJ µ k Comments: Policy evaluation is equivalent to solving an n n linear system of equations Policy improvement is equivalent to 1-step lookahead using the evaluated value function For large n, exact PI is out of the question. We use instead optimistic PI (policy evaluation with a few VIs)

26 Convergence of Policy Iteration Theorem Assume that the state and action spaces are finite. The policy iteration generates µ k that converges to the optimal policy µ in a finite number of steps. Proof. We show that J µ k J µ k+1 for all k. For given k, we have J µ k = T µ kj µ k TJ µ k = T µ k+1j µ k Using the monotonicity property of DP, J µ k T µ k+1j µ k T 2 µ k+1 J µ k... lim N TN µ k+1 J µ k Since lim N T N µ k+1 J µ k = J µ k+1, we have J µ k J µ k+1. If J µ k = J µ k+1, all above inequalities hold as equations, so J µ k solves Bellman s equation. Hence J µ k = J. Thus at iteration k either the algorithm generates a strictly improved policy or it finds an optimal policy. For a finite spaces MDP, the algorithm terminates with an optimal policy. 26/70

27 Shorthand Theory - A Summary Infinite horizon cost function expressions [with J 0 (x) 0] J π (x) = lim N (T µ 0 T µ1 T µn )J 0 (x), J µ (x) = lim N (TN µ J 0 )(x) Bellman s equation J = TJ, J µ = T µ J µ Optimality condition: µ is optimal iff T µ J = TJ Value iteration: For any (bounded) J: J (x) = lim k (Tk J)(x), x Policy iteration: given µ k, Policy evaluation: Find J µ k by solving J µ k = T µ kj µ k Policy improvement: Let µ k+1 be such that T µ k+1j µ k = TJ µ k 27/70

28 Q-Factors 28/70 Optimal Q-factor of (x, u) : Q (x, u) = E{g(x, u, w) + αj (f (x, u, w))} It is the cost of starting at x, applying u in the 1st stage, and an optimal policy after the 1st stage The value function is equivalent to J (x) = min u U(x) Q (x, u), x. Q-factors are costs in an augmented problem where states are (x, u) Here (x, u) is a post-decision state.

29 VI in Q-factors 29/70 We can equivalently write the VI method as J k+1 (x) = min u U(x) Q k+1(x, u), x where Q k+1 is generated by [ ] Q k+1 (x, u) = E g(x, u, w) + α min Q k(f (x, u, w), v) v U( x) VI converges for Q-factors

30 Q-factors 30/70 VI and PI for Q-factors are mathematically equivalent to VI and PI for costs They require equal amount of computation... they just need more storage Having optimal Q-factors is convenient when implementing an optimal policy on-line by µ (x) = arg min u U(x) Q (x, u) Once Q (x, u) are known, the model [g and E{ }] is not needed. Model-free operation Q-Learning (to be discussed later) is a sampling method that calculates Q (x, u) using a simulator of the system (no model needed)

31 MDP and Q-Factors Optimal Q-factors - the function Q(i, u) that satisfies the following Bellman equation Q (i, u) = n j=1 ( ) p ij (u) g(i, u, j) + α min v U(j) Q (j, v) or in short Q = FQ. Interpretation Q-factors can be viewed as J values by considering (i, u) as the post-decision state DP Algorithm for Q-values instead of J-values Value Iteration: Q k+1 = FQ k Policy evaluation: Q µk = F µk Q µk Policy improvement: F µk+1 Q µk = FQ µk VI and PI are convergent for Q-values Model-free. 31/70

32 Other DP Models We have looked so far at the (discrete or continuous spaces) discounted models for which the analysis is simplest and results are most powerful Other DP models include: Undiscounted problems (α = 1): They may include a special termination state (stochastic shortest path problems) Continuous-time finite-state MDP : The time between transitions is random and state-and-control-dependent (typical in queueing systems, called Semi-Markov MDP ). These can be viewed as discounted problems with state-and-control-dependent discount factors Continuous-time, continuous-space models : Classical automatic control, process control, robotics Substantial differences from discrete-time Mathematically more complex theory (particularly for stochastic problems) Deterministic versions can be analyzed using classical optimal control theory 32/70

33 Outline 33/70 1 Review of Finite-Horizon DP 2 Infinite-Horizon DP: Theory and Algorithms 3 A Premier on ADP 4 Dimension Reduction in RL Approximation in value space Approximation in policy space State Aggregation 5 On-Policy Learning Direct Projection Bellman Error Minimization Projected Bellman Equation Method From On-Policy to Off-Policy

34 Practical Difficulties of DP 34/70 The curse of dimensionality Exponential growth of the computational and storage requirements as the number of state variables and control variables increases Quick explosion of the number of states in combinatorial problems The curse of modeling Sometimes a simulator of the system is easier to construct than a model There may be real-time solution constraints A family of problems may be addressed. The data of the problem to be solved is given with little advance notice The problem data may change as the system is controlled - need for on-line replanning All of the above are motivations for approximation and simulation

35 General Orientation to ADP ADP (late 80s - present) is a breakthrough methodology that allows the application of DP to problems with many or infinite number of states. Other names for ADP are: reinforcement learning (RL), neuro-dynamic programming (NDP), adaptive dynamic programming (ADP). We will mainly adopt an n-state discounted model (the easiest case - but think of HUGE n). Extensions to other DP models (continuous space, continuous-time, not discounted) are possible (but more quirky). We will set aside for later. There are many approaches: Problem approximation, Simulation-based approaches. Simulation-based methods are of three types: Rollout (we will not discuss further), Approximation in value space, Approximation in policy space 35/70

36 Why do we use Simulation? One reason: Computational complexity advantage in computing sums/expectations involving a very large number of terms Any sum n i=1 a i can be written as an expected value: n n [ ] a i ai a i = ξ i = E ξ ξ i i=1 i=1 where ξ is any probability distribution over {1,..., n} It can be approximated by generating many samples {i 1,..., i k } from {1,..., n}, according to distribution ξ, and Monte Carlo averaging: n [ ] a a i = E ξ 1 ξ k i=1 k ξ i a it ξ it t=1 Simulation is also convenient when an analytical model of the system is unavailable, but a simulation/computer model is possible. 36/70

37 Solve DP via Simulation Ideally, VI and PI solve the fixed equation: finding J such that J = min µ {g µ + αp µ J } Practically, we often wish to solve Bellman s equation without knowing P µ, g µ. What we do have: a simulator that starts from state i, given action a, generate random samples of transition costs and future state g(i, i next, a), i next Example: Optimize a trading policy to maximize profit Current transaction has unknown market impact Use current order book as states/features Example: stochastic games, Tetris, hundreds of millions of states, captured using 22 features 37/70

38 Outline 38/70 1 Review of Finite-Horizon DP 2 Infinite-Horizon DP: Theory and Algorithms 3 A Premier on ADP 4 Dimension Reduction in RL Approximation in value space Approximation in policy space State Aggregation 5 On-Policy Learning Direct Projection Bellman Error Minimization Projected Bellman Equation Method From On-Policy to Off-Policy

39 Approximation in value space 39/70 Approximate J or J µ from a parametric class J(i; σ) where i is the current state and σ = (σ 1,..., σ m ) is a vector of tunable scalars weights Use J in place of J or J µ in various algorithms and computations Role of σ : By adjusting σ we can change the shape of J so that it is close to J or J µ A simulator may be used, particularly when there is no mathematical model of the system (but there is a computer model) We will focus on simulation, but this is not the only possibility We may also use parametric approximation for Q-factors or cost function differences

40 Approximation Architectures 40/70 Two key issues: The choice of parametric class J(i; σ) (the approximation architecture) Method for tuning the weights ( training the architecture) Success depends strongly on how these issues are handled... also on insight about the problem Divided in linear and nonlinear [i.e., linear or nonlinear dependence of J(i; σ) on σ] Linear architectures are easier to train, but nonlinear ones (e.g., neural networks) are richer

41 41/70 Computer chess example Think of board position as state and move as control Uses a feature-based position evaluator that assigns a score (or approximate Q-factor) to each position/move Relatively few special features and weights, and multistep lookahead

42 Linear Approximation Architectures 42/70 With well-chosen features, we can use a linear architecture: or J(i; σ) = φ(i) σ, i = 1,..., n, Jσ = Φσ = s Φ j σ j Φ: the matrix whose rows are φ(i), i = 1,..., n, Φ j is the jth column j=1 This is approximation on the subspace S = {Φσ σ R s } spanned by the columns of Φ (basis functions)

43 Linear Approximation Architectures 43/70 Often, the features encode much of the nonlinearity inherent in the cost function approximated Many examples of feature types: Polynomial approximation, radial basis functions, etc

44 Example: Polynomial type Polynomial Approximation, e.g., a quadratic approximating function. Let the state be i = (i 1,..., i q ) (i.e., have q dimensions ) and define φ 0 (i) = 1, φ k (i) = i k, φ km (i) = i k i m, k, m = 1,..., q Linear approximation architecture : J(i; σ) = σ 0 + q σ k i k + k=1 q k=1 m=k q σ km i k i m, where σ has components σ 0, σ k, and σ km. Interpolation : A subset I of special/representative states is selected, and the parameter vector σ has one component σ i per state i I. The approximating function is J(i; σ) = σ i, i I. J(i; σ) is the interpolation using the values at i I, i / I. For example, piecewise constant, piecewise linear, more general polynomial interpolations. 44/70

45 Another Example J (i): optimal score starting from position i Number of states > (for board) Success with just 22 features, readily recognized by tetris players as capturing important aspects of the board position (heights of columns, etc) 45/70

46 Approximation in Policy Space A brief discussion; we will return to it later. Use parametrization µ(i; σ) of policies with a vector σ = (σ 1,..., σ s ). Examples: Polynomial, e.g., µ(i; σ) = σ 1 + σ 2 i + σ 3 i 2 Linear feature-based µ(i; σ) = φ 1 (i) σ 1 + φ 2 (i) σ 2 46/70

47 Approximation in Policy Space 47/70 Optimize the cost over σ. For example: Each value of σ defines a stationary policy, with cost starting at state i denoted by J(i; σ). Let (p 1,..., p n ) be some probability distribution over the states, and minimize over σ: n i=1 p i J(i; σ) Use a random search, gradient, or other method A special case: The parameterization of the policies is indirect, through a cost approximation architecture J, i.e., µ(i; σ) arg min u U(i) n p ij (u) ( g(i, u, j) + α J(j; σ) ) j=1

48 Aggregation 48/70 A first idea : Group similar states together into aggregate states x 1,..., x s ; assign a common cost value σ i to each group x i. Solve an aggregate DP problem, involving the aggregate states, to obtain σ = (σ 1,..., σ s ). This is called hard aggregation

49 Aggregation More general/mathematical view : Solve Φσ = ΦDT µ (Φσ) where the rows of D and Φ are prob. distributions (e.g., D and Φ aggregate rows and columns of the linear system J = T µ J) Compare with projected equation Φσ = ΠT µ (Φσ). Note: ΦD is a projection in some interesting cases 49/70

50 Aggregation as Problem Abstraction Aggregation can be viewed as a systematic approach for problem approximation. Main elements: Solve (exactly or approximately) the aggregate problem by any kind of VI or PI method (including simulation-based methods) Use the optimal cost of the aggregate problem to approximate the optimal cost of the original problem 50/70

51 Aggregate System Description 51/70 The transition probability from aggregate state x to aggregate state y under control u ˆp xy (u) = n d xi i=1 j=1 n p ij (u)φ jy, or ˆP(u) = DP(u)Φ where the rows of D and Φ are the disaggregation and aggregation probs. The expected transition cost is ĝ(x, u) = n d xi i=1 j=1 n p ij (u)g(i, u, j), or ĝ = DP(u)g

52 Aggregate Bellman s Equation 52/70 The optimal cost function of the aggregate problem, denoted ˆR, is [ ] ˆR(x) = min u U ĝ(x, u) + α y ˆp (x,y) (u)ˆr(y) Bellman s equation for the aggregate problem., x The optimal cost function J of the original problem is approximated by J given by J(j) = y φ jyˆr(y), j

53 Example I: Hard Aggregation 53/70 Group the original system states into subsets, and view each subset as an aggregate state Aggregation probs.: φ jy = 1 if j belongs to aggregate state y. Disaggregation probs.: There are many possibilities, e.g., all states i within aggregate state x have equal prob. d xi. If optimal cost vector J is piecewise constant over the aggregate states/subsets, hard aggregation is exact. Suggests grouping states with roughly equal cost into aggregates. A variant: Soft aggregation (provides soft boundaries between aggregate states).

54 Example II: Feature-Based Aggregation 54/70 Important question: How do we group states together? If we know good features, it makes sense to group together states that have similar features A general approach for passing from a feature-based state representation to a hard aggregation-based architecture Essentially discretize the features and generate a corresponding piecewise constant approximation to the optimal cost function Aggregation-based architecture is more powerful (it is nonlinear in the features)... but may require many more aggregate states to reach the same level of performance as the corresponding linear feature-based architecture

55 Example III: Representative States/Coarse Grid 55/70 Choose a collection of representative original system states, and associate each one of them with an aggregate state Disaggregation probabilities are d xi = 1 if i is equal to representative state x. Aggregation probabilities associate original system states with convex combinations of representative states j y A φ jy y Well-suited for Euclidean space discretization Extends nicely to continuous state space, including belief space of POMDP

56 Feature Extraction is Linear Approximation of High-d Cost Vector 56/70

57 Outline 57/70 1 Review of Finite-Horizon DP 2 Infinite-Horizon DP: Theory and Algorithms 3 A Premier on ADP 4 Dimension Reduction in RL Approximation in value space Approximation in policy space State Aggregation 5 On-Policy Learning Direct Projection Bellman Error Minimization Projected Bellman Equation Method From On-Policy to Off-Policy

58 58/70 Direct Policy evaluation Approximate the cost of the current policy by using least squares and simulation-generated cost samples Amounts to projection of J µ onto the approximation subspace Solution by least squares methods Regular and optimistic policy iteration Nonlinear approximation architectures may also be used

59 Direct Evaluation by Simulation 59/70 Projection by Monte Carlo Simulation: Compute the projection ΠJ µ of J µ on subspace S = {Φσ σ R s }, with respect to a weighted Euclidean norm ξ Equivalently, find Φσ, where σ = arg min Φσ J µ 2 σ R s ξ = arg min σ R s Setting to 0 the gradient at σ, ( n ) 1 σ = ξ i φ(i)φ(i) i=1 n ξ i (φ(i) σ J µ (i)) 2 i=1 n ξ i φ(i)j µ (i) i=1

60 Direct Evaluation by Simulation 60/70 Generate samples {(i 1, J µ (i 1 )),..., (i k, J µ (i k ))} using distribution ξ Approximate by Monte Carlo the two expected values with low-dimensional calculations ( k ) 1 ˆσ k = φ(i t )φ(i t ) t=1 k φ(i t )J µ (i t ) t=1 Equivalent least squares alternative calculation: ˆσ k = arg min σ R s k (φ(i t ) σ J µ (i t )) 2 t=1

61 61/70 Convergence of Evaluated Policy By law of large numbers, we have and We have 1 k 1 k k t=1 k t=1 φ(i t )φ(i t ) 1 a.s. n 1 φ(i t )J µ (i t ) a.s. n n ξ i φ(i)φ(i) i=1 n ξ i φ(i)j µ (i) i=1 σ k a.s. σ = Π S J µ As the number of samples increases, the estimated low-dim cost σ k converges almost surely to the projected J µ.

62 Indirect policy evaluation Solve the projected equation Φσ = ΠT µ (Φσ) where Π is projection with respect to a suitable weighted Euclidean norm Solution methods that use simulation (to manage the calculation of Π) TD(λ): Stochastic iterative algorithm for solving Φσ = ΠT µ (Φσ) LSTD(λ): Solves a simulation-based approximation with a standard solver LSPE(λ): A simulation-based form of projected value iteration ; essentially Φσ k+1 = ΠT µ (Φσ k ) + simulation noise Almost sure convergence guarantee 62/70

63 Bellman Error Minimization Another example of indirect approximate policy evaluation: min σ Φσ T µ(φσ) 2 ξ ( ) where ξ is Euclidean norm, weighted with respect to some distribution ξ It is closely related to the projected equation/galerkin approach (with a special choice of projection norm) Several ways to implement projected equation and Bellman error methods by simulation. They involve: Generating many random samples of states i k using the distribution ξ Generating many samples of transitions (i k, j k ) using the policy µ Form a simulation-based approximation of the optimality condition for projection problem or problem (*) (use sample averages in place of inner products) Solve the Monte-Carlo approximation of the optimality condition Issues for indirect methods: How to generate the samples? How to calculate σ efficiently? 63/70

64 Cost Function Approximation via Projected Equations 64/70 Ideally, we want to solve the Bellman equation (for a fixed policy µ) In MDP, the equation is n n: J = T µ J J = g µ + αp µ J We solve a projected version of the high-dim equation J = Π(g µ + αp µ J) Since the projection Π is onto the space spanned by Φ, the projected equation is equivalent to Φσ = Π(g µ + αp µ Φσ) We fix the policy µ from now on, and omit mentioning it.

65 Matrix Form of Projected Equation The solution Φσ satisfies the orthogonality condition: The error Φσ (g + αpφσ ) is orthogonal to the subspace spanned by the columns of Φ. This is written as Φ Ξ(Φσ (g + αpφσ )) = 0, where Ξ is the diagonal matrix with the steady-state probabilities ξ 1,..., ξ n along the diagonal. Equivalently, Cσ = d, where C = ΦΞ(I αp)φ, d = Φ Ξg but computing C and d is HARD (high-dimensional inner products). 65/70

66 Simulation-Based Implementations 66/70 Key idea: Calculate simulation-based approximations based on k samples C k C, d k d Matrix inversion σ = C 1 d is approximated by ˆσ k = C 1 k d k This is the LSTD (Least Squares Temporal Differences) Method. Key fact: C k, d k can be computed with low-dimensional linear algebra (of order s; the number of basis functions).

67 Simulation Mechanics We generate an infinitely long trajectory (i 0, i 1,...) of the Markov chain, so states i and transitions (i, j) appear with long-term frequencies ξ i and p ij. After generating each transition (i t, i t+1 ), we compute the row φ(i t ) of Φ and the cost component g(i t, i t+1 ). We form d k = C k = 1 k k + 1 k ξ i p ij φ(i)g(i, j) = Φ Ξg = d, φ(i t )g(i t, i t+1 ) t=0 i,j k φ(i t )(φ(i t ) αφ(i t+1 )) Φ Ξ(I αp)φ = C t=0 Convergence based on law of large numbers: C k a.s. C, d k a.s. d. As sample size increases, σ k converges a.s. to the solution of projected Bellman equation. 67/70

68 Approximate PI via On-Policy Learning Outer Loop (Off-Policy RL): Estimate the value function of the current policy µ t using linear features: J µt Φσ t Inner Loop (On-Policy RL): Generate state trajectories... Estimate σ t via Bellman error minimization (or direct projection, or projected equation approach) Update the policy by µ t+1 (i) = arg min a p ij (α)(g(i, α, j) + φ(j) σ t ), j i Comments: Requires knowledge of p ij (suitable for computer games with known transitions) The policy µ t+1 is parameterized by σ t. 68/70

69 69/70 Approximate PI via On-Policy Learning Use simulation to approximate the cost J µ of the current policy µ Generate improved policy µ by minimizing in (approx.) Bellman equation Alternatively we can approximate the Q-factors of µ

70 Theoretical Basis of Approximate PI If policies are approximately evaluated using an approximation architecture such that max i J(i, σ k ) J µ k(i) d, k = 0, 1,..., If policy improvement is also approximate, max i (T µ k+1 J)(i, σ k ) (T J)(i, σ k ) ɛ, k = 0, 1,... Error bound: The sequence {µ k } generated by approximate policy iteration satisfies lim sup max(j µ k(i) J (i)) ɛ + 2αd k i (1 α) 2 Typical practical behavior: The method makes steady progress up to a point and then the iterates J µ k oscillate within a neighborhood of J. In practice oscillations between policies is probably not the major concern. 70/70

Dynamic Programming and Reinforcement Learning

Dynamic Programming and Reinforcement Learning Reinforcement learning setting action/decision Agent Environment reward state Action space: A State space: S Reward: R : S A S! R Transition: