Infinite-Horizon Dynamic Programming

Size: px
Start display at page:

Download "Infinite-Horizon Dynamic Programming"

Transcription

1 1/70 Infinite-Horizon Dynamic Programming Acknowledgement: this slides is based on Prof. Mengdi Wang s and Prof. Dimitri Bertsekas lecture notes

2 2/70 作业 1) 阅读如下章节 : Richard S. Sutton and Andrew G. Barto, Reinforcement Learning: An Introduction, Chapter 2: Multi-armed Bandits Chapter 3: Finite Markov Decision Processes Chapter 4: Dynamic Programming Chapter 5: Monte Carlo Methods Chapter 6: Temporal-Difference Learning Chapter 9: On-policy Prediction with Approximation Chapter 10: On-policy Control with Approximation Chapter 13: Policy Gradient Methods 2) 至少看懂每章的三个 Example 如果有程序, 测试或实现其程序

3 Outline 3/70 1 Review of Finite-Horizon DP 2 Infinite-Horizon DP: Theory and Algorithms 3 A Premier on ADP 4 Dimension Reduction in RL Approximation in value space Approximation in policy space State Aggregation 5 On-Policy Learning Direct Projection Bellman Error Minimization Projected Bellman Equation Method From On-Policy to Off-Policy

4 Abstract DP Model Descrete-time System state transition x k+1 = f k (x k, u k, w k ) k = 0, 1,..., N 1 x k : state; summarizing past information that is relevant for future optimization u k : control/action; decision to be selected at time k from a given set U k w k : random disturbance or noise g k (x k, u k, w k ): state transitional cost incurred at time k given current state x k and control u k For every k and every x k, we want an optimal action. We look for a mapping µ from states to actions. 4/70

5 Abstract DP Model 5/70 Objective - control the system to minimize overall cost { } N 1 min µ E g N (x N ) + g k (x k, u k, w k ) k=0 s.t. u k = µ(x k ), k = 0,..., N 1 We look for a policy/strategy µ, which is a mapping from states to actions.

6 Finite-Horizon DP: Theory and Algorithm Value function is defined as the optimal revenue one can earn (by using the optimal policy onward) starting at time period t with inventory x [ T ] V t (x) = max µ t,...,µ T E k=t Bellman Equation: for all x, t T 1 g k (x k, µ k (x k ), w k ) x t = x V t (x) = max µ t E [g t (x t, µ t (x t ), w t ) + V t+1 (x t+1 ) x t = x] Tail Optimality: A strategy µ 1,..., µ T is optimal, if and only if every tail strategy µ t,..., µ T is optimal for the tail problem starting at stage t. DP Algorithm: solve the Bellman equation directly using backward induction. This is value iteration. 6/70

7 Option Pricing: A Simple Binomial Model 7/70 We focus on American call options. Strike price: K Duration: T days Stock price of tth day: S t Growth rate: u (1, ) Diminish rate: d (0, 1) Probability of growth: p [0, 1] Binomial Model of Stock Price { us t with probability p S t+1 = ds t with probability 1 p As the discretization of time becomes finer, the binomial model approaches the Brownian motion model.

8 8/70 Option Pricing: Bellman s Equation Given S 0, T, K, u, r, p. State: S t, finite number of possible values Cost vector: V t (S), the value of option at the tth day when the current stock price is S. Bellman equation for binomial option V t (S t ) = max{s t K, pv t+1 (us t ) + (1 p)v t+1 (ds t )} V t (S T ) = max{s T K, 0}

9 Outline 9/70 1 Review of Finite-Horizon DP 2 Infinite-Horizon DP: Theory and Algorithms 3 A Premier on ADP 4 Dimension Reduction in RL Approximation in value space Approximation in policy space State Aggregation 5 On-Policy Learning Direct Projection Bellman Error Minimization Projected Bellman Equation Method From On-Policy to Off-Policy

10 Infinite-Horizon Discounted Problems/Bounded Cost 10/70 Stationary system Cost of a policy π = {µ 0, µ 1,...} x k+1 = f (x k, u k, w k ), k = 0, 1,... J π (x 0 ) = lim N E w k,k=0,1,... [ N 1 ] α k g(x k, µ k (x k ), w k ) k=0 with α < 1, and g is bounded [for some M, we have g(x, u, w) M for all (x, u, w)] Optimal cost function is defined as J (x) = min π J π (x)

11 Infinite-Horizon Discounted Problems/Bounded Cost 11/70 Boundedness of g guarantees that all costs are well-defined and bounded: J π (x) M 1 α All spaces are arbitrary - only boundedness of g is important (there are math fine points, e.g. measurability, but they don t matter in practice) Important special case with finite space: Markovian Decision Problem All algorithms ultimately work with a finite spaces MDP approximating the original problem

12 Shorthand notation for DP mappings For any function J of x, denote (TJ)(x) = min u U(x) E w{g(x, u, w) + αj(f (x, u, w))}, TJ is the optimal cost function for the one-stage problem with stage cost g and terminal cost function αj. T operates on bounded functions of x to produce other bounded functions of x For any stationary policy µ, denote (T µ J)(x) = E w {g(x, µ(x), w) + αj(f (x, µ(x), w))}, x x The critical structure of the problem is captured in T and T µ The entire theory of discounted problems can be developed in shorthand using T and T µ True for many other DP problems. T and T µ provide a powerful unifying framework for DP. This is the essence of the book Abstract Dynamic Programming 12/70

13 Express Finite-Horizon Cost using T 13/70 Consider an N-stage policy π N 0 = {µ 0, µ 1,..., µ N 1 } with a terminal cost J and π N 1 = {µ 1, µ 2,..., µ N 1 }: [ ] N 1 J π N(x 0 ) = E α N J(x k ) + α l g(x 0 l, µ l (x l ), w l ) l=0 [ ] = E g(x 0, µ 0 (x 0 ), w 0 ) + αj π N(x 1 ) 1 = (T µ0 J π N 1 )(x 0 ) By induction, we have J π N 0 (x 0 ) = (T µ0 T µ1 T µn 1 J)(x), x For a stationary policy µ the N-stage cost function (with terminal cost J) is J π N 0 = T N µ J, where T N µ is the N-fold composition of T µ Similarly the optimal N-stage cost function (with terminal cost J) is T N J T N J = T(T N 1 J) is just the DP algorithm

14 Markov Chain A Markov chain is a random process that takes values on the state space {1,..., n}. The process evolves according to a certain transition probability matrix P R n n where P(i k+1 = j i k, i k 1,..., i 0 ) = P(i k+1 = j i k = i) = P ij Markov chain is memoryless, i.e., further evolvements are independent with past trajectory conditioned on the current state. The memoryless property is equivalent to Markov. A state i is recurrent if it will be visited infinitely many times with probability 1. A Markov chain is said to be irreducible if its state space is a single communicating class; in other words, if it is possible to get to any state from any state. When states are modeled appropriately, all stochastic processes are Markov. 14/70

15 15/70 Markovian Decision Problem We will mostly assume the system is an n-state (controlled) Markov chain States i = 1,..., n (instead of x) Transition probabilities p ik i k+1 (u k ) [instead of x k+1 = f (x k, u k, w k )] stage cost g(i k, u k, i k+1 ) [instead of g(x k, u k, w k )] cost function J = (J(1),..., J(n)) (vectors in R n ) cost of a policy π = {µ 0, µ 1,...} J π (i) = lim N E i k,k=1,2,... ] α k g(i k, µ k (i k ), i k+1 ) i 0 = i [ N 1 k=0 MDP is the most important problem in infinite-horizon DP

16 Markovian Decision Problem Shorthand notation for DP mappings n (TJ)(i) = min u U(i) p ij (u)(g(i, u, j) + αj(j)), i = 1,..., n, (T µ J)(i) = j=1 n p ij (µ(i))(g(i, µ(i), j) + αj(j)), i = 1,..., n, j=1 Vector form of DP mappings and where g µ (i) = TJ = min µ {g µ + αp µ J} T µ J = g µ + αp µ J n p ij (µ(i))g(i, µ(i), j), j=1 P µ (i, j) = p ij (µ(i)) 16/70

17 Two Key properties 17/70 Monotonicity property: For any J and J such that J(x) J (x) for all x, and any µ (TJ)(x) (TJ )(x), (T µ J)(x) (T µ J )(x), x Constant Shift property: For any J, any scalar r, and any µ (T(J + re))(x) = (TJ)(x) + ar, (T µ (J + re))(x) = (T µ J)(x) + ar, x where e is the unit function [e(x) 1]. Monotonicity is present in all DP models (undiscounted, etc) Constant shift is special to discounted models Discounted problems have another property of major importance: T and T µ are contraction mappings (we will show this later)

18 Convergence of Value Iteration Theorem For all bounded J 0, we have J (x) = lim k (T k J 0 )(x), for all x Proof. For simplicity we give the proof for J 0 0. For any initial state x 0, and policy π = {µ 0, µ 1,...}, [ ] J π (x 0 ) = E α l g(x l, µ l (x l ), w l ) = E l=0 [ k 1 ] [ ] α l g(x l, µ l (x l ), w l ) + E α l g(x l, µ l (x l ), w l ) l=0 l=k The tail portion satisfies [ E α l g(x l, µ l (x l ), w l )] αk M 1 α l=k where M g(x, u, w). Take min over π of both sides, then lim as k 18/70

19 Proof of Bellman s equation Theorem The optimal cost function J is a solution of Bellman s equation, J = TJ, i.e., for all x, J (x) = min u U(x) E w{g(x, u, w) + αj (f (x, u, w))} Proof. For all x and k, J (x) αk M 1 α (Tk J 0 )(x) J (x) + αk M 1 α where J 0 (x) 0 and M g(x, u, w). Applying T to this relation, and using Monotonicity and Constant Shift, (TJ )(x) αk+1 M 1 α (Tk+1 J 0 )(x) (TJ )(x) + αk+1 M 1 α Taking the limit as k and using the fact lim k (T k+1 J 0 )(x) = J (x) we obtain J = TJ. 19/70

20 The Contraction Property Contraction property: For any bounded functions J and J, and any µ, max x max x (TJ)(x) (TJ )(x) α max J(x) J (x), x J(x) J (x) (T µ J)(x) (T µ J )(x) α max x Proof. Denote c = max x S J(x) J (x). Then J(x) c J (x) J(x) + c, x Apply T to both sides, and use the Monotonicity and Constant Shift properties: (TJ)(x) αc (TJ )(x) (TJ)(x) + αc, x Hence, (TJ)(x) (TJ )(x) αc, x. This implies that T, T µ have unique fixed points. Then J is the unique solution of J = TJ, and J µ is the unique solution of J µ = T µ J µ 20/70

21 Necessary and Sufficient Optimality Condition Theorem A stationary policy µ is optimal if and only if µ(x) attains the minimum in Bellman s equation for each x; i.e., or, equivalently, for all x, TJ = T µ J, µ(x) arg min u U(x) E w{g(x, u, w) + αj (f (x, u, w))} 21/70

22 Proof of Optimality Condition 22/70 Proof: We have two directions. If TJ = T µ J, then using Bellman s equation (J = TJ ), we have J = T µ J, so by uniqueness of the fixed point of T µ, we obtain J = J µ ; i.e., µ is optimal. Conversely, if the stationary policy µ is optimal, we have J = J µ, so J = T µ J. Combining this with Bellman s Eq. (J = TJ ), we obtain TJ = T µ J.

23 23/70 Two Main Algorithms Value Iteration Solve the Bellman equation J = TJ by iterating on the value functions: J k+1 = TJ k, or for i = 1,..., n. J k+1 (i) = min u n p ij (u)(g(i, u, j) + αj k (j)) j=1 The program only needs to memorize the current value function J k. We have shown that J k J as k. Policy Iteration Solve the Bellman equation J = TJ by iterating on the policies

24 Policy Iteration (PI) 24/70 Given µ k, the k-th policy iteration has two steps Policy evaluation: Find J µ k by J µ k = T µ kj µ k or solving J µ k(i) = n p ij (µ k (i))(g(i, µ k (i), j) + αj µ k(j)), i = 1,..., n j=1 Policy improvement: Let µ k+1 be such that T µ k+1j µ k = TJ µ k or µ k+1 (i) arg min u U(i) n p ij (u)(g(i, u, j) + αj µ k(j)) Policy iteration is a method that updates the policy instead of the value function. j=1

25 25/70 Policy Iteration (PI) More abstractly, the k-th policy iteration has two steps Policy evaluation: Find J µ k by solving J µ k = T µ kj µ k = g µ k + αp µ kj µ k Policy improvement: Let µ k+1 be such that T µ k+1j µ k = TJ µ k Comments: Policy evaluation is equivalent to solving an n n linear system of equations Policy improvement is equivalent to 1-step lookahead using the evaluated value function For large n, exact PI is out of the question. We use instead optimistic PI (policy evaluation with a few VIs)

26 Convergence of Policy Iteration Theorem Assume that the state and action spaces are finite. The policy iteration generates µ k that converges to the optimal policy µ in a finite number of steps. Proof. We show that J µ k J µ k+1 for all k. For given k, we have J µ k = T µ kj µ k TJ µ k = T µ k+1j µ k Using the monotonicity property of DP, J µ k T µ k+1j µ k T 2 µ k+1 J µ k... lim N TN µ k+1 J µ k Since lim N T N µ k+1 J µ k = J µ k+1, we have J µ k J µ k+1. If J µ k = J µ k+1, all above inequalities hold as equations, so J µ k solves Bellman s equation. Hence J µ k = J. Thus at iteration k either the algorithm generates a strictly improved policy or it finds an optimal policy. For a finite spaces MDP, the algorithm terminates with an optimal policy. 26/70

27 Shorthand Theory - A Summary Infinite horizon cost function expressions [with J 0 (x) 0] J π (x) = lim N (T µ 0 T µ1 T µn )J 0 (x), J µ (x) = lim N (TN µ J 0 )(x) Bellman s equation J = TJ, J µ = T µ J µ Optimality condition: µ is optimal iff T µ J = TJ Value iteration: For any (bounded) J: J (x) = lim k (Tk J)(x), x Policy iteration: given µ k, Policy evaluation: Find J µ k by solving J µ k = T µ kj µ k Policy improvement: Let µ k+1 be such that T µ k+1j µ k = TJ µ k 27/70

28 Q-Factors 28/70 Optimal Q-factor of (x, u) : Q (x, u) = E{g(x, u, w) + αj (f (x, u, w))} It is the cost of starting at x, applying u in the 1st stage, and an optimal policy after the 1st stage The value function is equivalent to J (x) = min u U(x) Q (x, u), x. Q-factors are costs in an augmented problem where states are (x, u) Here (x, u) is a post-decision state.

29 VI in Q-factors 29/70 We can equivalently write the VI method as J k+1 (x) = min u U(x) Q k+1(x, u), x where Q k+1 is generated by [ ] Q k+1 (x, u) = E g(x, u, w) + α min Q k(f (x, u, w), v) v U( x) VI converges for Q-factors

30 Q-factors 30/70 VI and PI for Q-factors are mathematically equivalent to VI and PI for costs They require equal amount of computation... they just need more storage Having optimal Q-factors is convenient when implementing an optimal policy on-line by µ (x) = arg min u U(x) Q (x, u) Once Q (x, u) are known, the model [g and E{ }] is not needed. Model-free operation Q-Learning (to be discussed later) is a sampling method that calculates Q (x, u) using a simulator of the system (no model needed)

31 MDP and Q-Factors Optimal Q-factors - the function Q(i, u) that satisfies the following Bellman equation Q (i, u) = n j=1 ( ) p ij (u) g(i, u, j) + α min v U(j) Q (j, v) or in short Q = FQ. Interpretation Q-factors can be viewed as J values by considering (i, u) as the post-decision state DP Algorithm for Q-values instead of J-values Value Iteration: Q k+1 = FQ k Policy evaluation: Q µk = F µk Q µk Policy improvement: F µk+1 Q µk = FQ µk VI and PI are convergent for Q-values Model-free. 31/70

32 Other DP Models We have looked so far at the (discrete or continuous spaces) discounted models for which the analysis is simplest and results are most powerful Other DP models include: Undiscounted problems (α = 1): They may include a special termination state (stochastic shortest path problems) Continuous-time finite-state MDP : The time between transitions is random and state-and-control-dependent (typical in queueing systems, called Semi-Markov MDP ). These can be viewed as discounted problems with state-and-control-dependent discount factors Continuous-time, continuous-space models : Classical automatic control, process control, robotics Substantial differences from discrete-time Mathematically more complex theory (particularly for stochastic problems) Deterministic versions can be analyzed using classical optimal control theory 32/70

33 Outline 33/70 1 Review of Finite-Horizon DP 2 Infinite-Horizon DP: Theory and Algorithms 3 A Premier on ADP 4 Dimension Reduction in RL Approximation in value space Approximation in policy space State Aggregation 5 On-Policy Learning Direct Projection Bellman Error Minimization Projected Bellman Equation Method From On-Policy to Off-Policy

34 Practical Difficulties of DP 34/70 The curse of dimensionality Exponential growth of the computational and storage requirements as the number of state variables and control variables increases Quick explosion of the number of states in combinatorial problems The curse of modeling Sometimes a simulator of the system is easier to construct than a model There may be real-time solution constraints A family of problems may be addressed. The data of the problem to be solved is given with little advance notice The problem data may change as the system is controlled - need for on-line replanning All of the above are motivations for approximation and simulation

35 General Orientation to ADP ADP (late 80s - present) is a breakthrough methodology that allows the application of DP to problems with many or infinite number of states. Other names for ADP are: reinforcement learning (RL), neuro-dynamic programming (NDP), adaptive dynamic programming (ADP). We will mainly adopt an n-state discounted model (the easiest case - but think of HUGE n). Extensions to other DP models (continuous space, continuous-time, not discounted) are possible (but more quirky). We will set aside for later. There are many approaches: Problem approximation, Simulation-based approaches. Simulation-based methods are of three types: Rollout (we will not discuss further), Approximation in value space, Approximation in policy space 35/70

36 Why do we use Simulation? One reason: Computational complexity advantage in computing sums/expectations involving a very large number of terms Any sum n i=1 a i can be written as an expected value: n n [ ] a i ai a i = ξ i = E ξ ξ i i=1 i=1 where ξ is any probability distribution over {1,..., n} It can be approximated by generating many samples {i 1,..., i k } from {1,..., n}, according to distribution ξ, and Monte Carlo averaging: n [ ] a a i = E ξ 1 ξ k i=1 k ξ i a it ξ it t=1 Simulation is also convenient when an analytical model of the system is unavailable, but a simulation/computer model is possible. 36/70

37 Solve DP via Simulation Ideally, VI and PI solve the fixed equation: finding J such that J = min µ {g µ + αp µ J } Practically, we often wish to solve Bellman s equation without knowing P µ, g µ. What we do have: a simulator that starts from state i, given action a, generate random samples of transition costs and future state g(i, i next, a), i next Example: Optimize a trading policy to maximize profit Current transaction has unknown market impact Use current order book as states/features Example: stochastic games, Tetris, hundreds of millions of states, captured using 22 features 37/70

38 Outline 38/70 1 Review of Finite-Horizon DP 2 Infinite-Horizon DP: Theory and Algorithms 3 A Premier on ADP 4 Dimension Reduction in RL Approximation in value space Approximation in policy space State Aggregation 5 On-Policy Learning Direct Projection Bellman Error Minimization Projected Bellman Equation Method From On-Policy to Off-Policy

39 Approximation in value space 39/70 Approximate J or J µ from a parametric class J(i; σ) where i is the current state and σ = (σ 1,..., σ m ) is a vector of tunable scalars weights Use J in place of J or J µ in various algorithms and computations Role of σ : By adjusting σ we can change the shape of J so that it is close to J or J µ A simulator may be used, particularly when there is no mathematical model of the system (but there is a computer model) We will focus on simulation, but this is not the only possibility We may also use parametric approximation for Q-factors or cost function differences

40 Approximation Architectures 40/70 Two key issues: The choice of parametric class J(i; σ) (the approximation architecture) Method for tuning the weights ( training the architecture) Success depends strongly on how these issues are handled... also on insight about the problem Divided in linear and nonlinear [i.e., linear or nonlinear dependence of J(i; σ) on σ] Linear architectures are easier to train, but nonlinear ones (e.g., neural networks) are richer

41 41/70 Computer chess example Think of board position as state and move as control Uses a feature-based position evaluator that assigns a score (or approximate Q-factor) to each position/move Relatively few special features and weights, and multistep lookahead

42 Linear Approximation Architectures 42/70 With well-chosen features, we can use a linear architecture: or J(i; σ) = φ(i) σ, i = 1,..., n, Jσ = Φσ = s Φ j σ j Φ: the matrix whose rows are φ(i), i = 1,..., n, Φ j is the jth column j=1 This is approximation on the subspace S = {Φσ σ R s } spanned by the columns of Φ (basis functions)

43 Linear Approximation Architectures 43/70 Often, the features encode much of the nonlinearity inherent in the cost function approximated Many examples of feature types: Polynomial approximation, radial basis functions, etc

44 Example: Polynomial type Polynomial Approximation, e.g., a quadratic approximating function. Let the state be i = (i 1,..., i q ) (i.e., have q dimensions ) and define φ 0 (i) = 1, φ k (i) = i k, φ km (i) = i k i m, k, m = 1,..., q Linear approximation architecture : J(i; σ) = σ 0 + q σ k i k + k=1 q k=1 m=k q σ km i k i m, where σ has components σ 0, σ k, and σ km. Interpolation : A subset I of special/representative states is selected, and the parameter vector σ has one component σ i per state i I. The approximating function is J(i; σ) = σ i, i I. J(i; σ) is the interpolation using the values at i I, i / I. For example, piecewise constant, piecewise linear, more general polynomial interpolations. 44/70

45 Another Example J (i): optimal score starting from position i Number of states > (for board) Success with just 22 features, readily recognized by tetris players as capturing important aspects of the board position (heights of columns, etc) 45/70

46 Approximation in Policy Space A brief discussion; we will return to it later. Use parametrization µ(i; σ) of policies with a vector σ = (σ 1,..., σ s ). Examples: Polynomial, e.g., µ(i; σ) = σ 1 + σ 2 i + σ 3 i 2 Linear feature-based µ(i; σ) = φ 1 (i) σ 1 + φ 2 (i) σ 2 46/70

47 Approximation in Policy Space 47/70 Optimize the cost over σ. For example: Each value of σ defines a stationary policy, with cost starting at state i denoted by J(i; σ). Let (p 1,..., p n ) be some probability distribution over the states, and minimize over σ: n i=1 p i J(i; σ) Use a random search, gradient, or other method A special case: The parameterization of the policies is indirect, through a cost approximation architecture J, i.e., µ(i; σ) arg min u U(i) n p ij (u) ( g(i, u, j) + α J(j; σ) ) j=1

48 Aggregation 48/70 A first idea : Group similar states together into aggregate states x 1,..., x s ; assign a common cost value σ i to each group x i. Solve an aggregate DP problem, involving the aggregate states, to obtain σ = (σ 1,..., σ s ). This is called hard aggregation

49 Aggregation More general/mathematical view : Solve Φσ = ΦDT µ (Φσ) where the rows of D and Φ are prob. distributions (e.g., D and Φ aggregate rows and columns of the linear system J = T µ J) Compare with projected equation Φσ = ΠT µ (Φσ). Note: ΦD is a projection in some interesting cases 49/70

50 Aggregation as Problem Abstraction Aggregation can be viewed as a systematic approach for problem approximation. Main elements: Solve (exactly or approximately) the aggregate problem by any kind of VI or PI method (including simulation-based methods) Use the optimal cost of the aggregate problem to approximate the optimal cost of the original problem 50/70

51 Aggregate System Description 51/70 The transition probability from aggregate state x to aggregate state y under control u ˆp xy (u) = n d xi i=1 j=1 n p ij (u)φ jy, or ˆP(u) = DP(u)Φ where the rows of D and Φ are the disaggregation and aggregation probs. The expected transition cost is ĝ(x, u) = n d xi i=1 j=1 n p ij (u)g(i, u, j), or ĝ = DP(u)g

52 Aggregate Bellman s Equation 52/70 The optimal cost function of the aggregate problem, denoted ˆR, is [ ] ˆR(x) = min u U ĝ(x, u) + α y ˆp (x,y) (u)ˆr(y) Bellman s equation for the aggregate problem., x The optimal cost function J of the original problem is approximated by J given by J(j) = y φ jyˆr(y), j

53 Example I: Hard Aggregation 53/70 Group the original system states into subsets, and view each subset as an aggregate state Aggregation probs.: φ jy = 1 if j belongs to aggregate state y. Disaggregation probs.: There are many possibilities, e.g., all states i within aggregate state x have equal prob. d xi. If optimal cost vector J is piecewise constant over the aggregate states/subsets, hard aggregation is exact. Suggests grouping states with roughly equal cost into aggregates. A variant: Soft aggregation (provides soft boundaries between aggregate states).

54 Example II: Feature-Based Aggregation 54/70 Important question: How do we group states together? If we know good features, it makes sense to group together states that have similar features A general approach for passing from a feature-based state representation to a hard aggregation-based architecture Essentially discretize the features and generate a corresponding piecewise constant approximation to the optimal cost function Aggregation-based architecture is more powerful (it is nonlinear in the features)... but may require many more aggregate states to reach the same level of performance as the corresponding linear feature-based architecture

55 Example III: Representative States/Coarse Grid 55/70 Choose a collection of representative original system states, and associate each one of them with an aggregate state Disaggregation probabilities are d xi = 1 if i is equal to representative state x. Aggregation probabilities associate original system states with convex combinations of representative states j y A φ jy y Well-suited for Euclidean space discretization Extends nicely to continuous state space, including belief space of POMDP

56 Feature Extraction is Linear Approximation of High-d Cost Vector 56/70

57 Outline 57/70 1 Review of Finite-Horizon DP 2 Infinite-Horizon DP: Theory and Algorithms 3 A Premier on ADP 4 Dimension Reduction in RL Approximation in value space Approximation in policy space State Aggregation 5 On-Policy Learning Direct Projection Bellman Error Minimization Projected Bellman Equation Method From On-Policy to Off-Policy

58 58/70 Direct Policy evaluation Approximate the cost of the current policy by using least squares and simulation-generated cost samples Amounts to projection of J µ onto the approximation subspace Solution by least squares methods Regular and optimistic policy iteration Nonlinear approximation architectures may also be used

59 Direct Evaluation by Simulation 59/70 Projection by Monte Carlo Simulation: Compute the projection ΠJ µ of J µ on subspace S = {Φσ σ R s }, with respect to a weighted Euclidean norm ξ Equivalently, find Φσ, where σ = arg min Φσ J µ 2 σ R s ξ = arg min σ R s Setting to 0 the gradient at σ, ( n ) 1 σ = ξ i φ(i)φ(i) i=1 n ξ i (φ(i) σ J µ (i)) 2 i=1 n ξ i φ(i)j µ (i) i=1

60 Direct Evaluation by Simulation 60/70 Generate samples {(i 1, J µ (i 1 )),..., (i k, J µ (i k ))} using distribution ξ Approximate by Monte Carlo the two expected values with low-dimensional calculations ( k ) 1 ˆσ k = φ(i t )φ(i t ) t=1 k φ(i t )J µ (i t ) t=1 Equivalent least squares alternative calculation: ˆσ k = arg min σ R s k (φ(i t ) σ J µ (i t )) 2 t=1

61 61/70 Convergence of Evaluated Policy By law of large numbers, we have and We have 1 k 1 k k t=1 k t=1 φ(i t )φ(i t ) 1 a.s. n 1 φ(i t )J µ (i t ) a.s. n n ξ i φ(i)φ(i) i=1 n ξ i φ(i)j µ (i) i=1 σ k a.s. σ = Π S J µ As the number of samples increases, the estimated low-dim cost σ k converges almost surely to the projected J µ.

62 Indirect policy evaluation Solve the projected equation Φσ = ΠT µ (Φσ) where Π is projection with respect to a suitable weighted Euclidean norm Solution methods that use simulation (to manage the calculation of Π) TD(λ): Stochastic iterative algorithm for solving Φσ = ΠT µ (Φσ) LSTD(λ): Solves a simulation-based approximation with a standard solver LSPE(λ): A simulation-based form of projected value iteration ; essentially Φσ k+1 = ΠT µ (Φσ k ) + simulation noise Almost sure convergence guarantee 62/70

63 Bellman Error Minimization Another example of indirect approximate policy evaluation: min σ Φσ T µ(φσ) 2 ξ ( ) where ξ is Euclidean norm, weighted with respect to some distribution ξ It is closely related to the projected equation/galerkin approach (with a special choice of projection norm) Several ways to implement projected equation and Bellman error methods by simulation. They involve: Generating many random samples of states i k using the distribution ξ Generating many samples of transitions (i k, j k ) using the policy µ Form a simulation-based approximation of the optimality condition for projection problem or problem (*) (use sample averages in place of inner products) Solve the Monte-Carlo approximation of the optimality condition Issues for indirect methods: How to generate the samples? How to calculate σ efficiently? 63/70

64 Cost Function Approximation via Projected Equations 64/70 Ideally, we want to solve the Bellman equation (for a fixed policy µ) In MDP, the equation is n n: J = T µ J J = g µ + αp µ J We solve a projected version of the high-dim equation J = Π(g µ + αp µ J) Since the projection Π is onto the space spanned by Φ, the projected equation is equivalent to Φσ = Π(g µ + αp µ Φσ) We fix the policy µ from now on, and omit mentioning it.

65 Matrix Form of Projected Equation The solution Φσ satisfies the orthogonality condition: The error Φσ (g + αpφσ ) is orthogonal to the subspace spanned by the columns of Φ. This is written as Φ Ξ(Φσ (g + αpφσ )) = 0, where Ξ is the diagonal matrix with the steady-state probabilities ξ 1,..., ξ n along the diagonal. Equivalently, Cσ = d, where C = ΦΞ(I αp)φ, d = Φ Ξg but computing C and d is HARD (high-dimensional inner products). 65/70

66 Simulation-Based Implementations 66/70 Key idea: Calculate simulation-based approximations based on k samples C k C, d k d Matrix inversion σ = C 1 d is approximated by ˆσ k = C 1 k d k This is the LSTD (Least Squares Temporal Differences) Method. Key fact: C k, d k can be computed with low-dimensional linear algebra (of order s; the number of basis functions).

67 Simulation Mechanics We generate an infinitely long trajectory (i 0, i 1,...) of the Markov chain, so states i and transitions (i, j) appear with long-term frequencies ξ i and p ij. After generating each transition (i t, i t+1 ), we compute the row φ(i t ) of Φ and the cost component g(i t, i t+1 ). We form d k = C k = 1 k k + 1 k ξ i p ij φ(i)g(i, j) = Φ Ξg = d, φ(i t )g(i t, i t+1 ) t=0 i,j k φ(i t )(φ(i t ) αφ(i t+1 )) Φ Ξ(I αp)φ = C t=0 Convergence based on law of large numbers: C k a.s. C, d k a.s. d. As sample size increases, σ k converges a.s. to the solution of projected Bellman equation. 67/70

68 Approximate PI via On-Policy Learning Outer Loop (Off-Policy RL): Estimate the value function of the current policy µ t using linear features: J µt Φσ t Inner Loop (On-Policy RL): Generate state trajectories... Estimate σ t via Bellman error minimization (or direct projection, or projected equation approach) Update the policy by µ t+1 (i) = arg min a p ij (α)(g(i, α, j) + φ(j) σ t ), j i Comments: Requires knowledge of p ij (suitable for computer games with known transitions) The policy µ t+1 is parameterized by σ t. 68/70

69 69/70 Approximate PI via On-Policy Learning Use simulation to approximate the cost J µ of the current policy µ Generate improved policy µ by minimizing in (approx.) Bellman equation Alternatively we can approximate the Q-factors of µ

70 Theoretical Basis of Approximate PI If policies are approximately evaluated using an approximation architecture such that max i J(i, σ k ) J µ k(i) d, k = 0, 1,..., If policy improvement is also approximate, max i (T µ k+1 J)(i, σ k ) (T J)(i, σ k ) ɛ, k = 0, 1,... Error bound: The sequence {µ k } generated by approximate policy iteration satisfies lim sup max(j µ k(i) J (i)) ɛ + 2αd k i (1 α) 2 Typical practical behavior: The method makes steady progress up to a point and then the iterates J µ k oscillate within a neighborhood of J. In practice oscillations between policies is probably not the major concern. 70/70

Dynamic Programming and Reinforcement Learning

Dynamic Programming and Reinforcement Learning Dynamic Programming and Reinforcement Learning Reinforcement learning setting action/decision Agent Environment reward state Action space: A State space: S Reward: R : S A S! R Transition:

More information

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019 Reinforcement Learning and Optimal Control ASU, CSE 691, Winter 2019 Dimitri P. Bertsekas dimitrib@mit.edu Lecture 8 Bertsekas Reinforcement Learning 1 / 21 Outline 1 Review of Infinite Horizon Problems

More information

6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE 6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE Review of Q-factors and Bellman equations for Q-factors VI and PI for Q-factors Q-learning - Combination of VI and sampling Q-learning and cost function

More information

Lecture 4: Misc. Topics and Reinforcement Learning

Lecture 4: Misc. Topics and Reinforcement Learning Approximate Dynamic Programming Lecture 4: Misc. Topics and Reinforcement Learning Mengdi Wang Operations Research and Financial Engineering Princeton University August 1-4, 2015 1/56 Feature Extraction

More information

Value and Policy Iteration

Value and Policy Iteration Chapter 7 Value and Policy Iteration 1 For infinite horizon problems, we need to replace our basic computational tool, the DP algorithm, which we used to compute the optimal cost and policy for finite

More information

Misc. Topics and Reinforcement Learning

Misc. Topics and Reinforcement Learning 1/79 Misc. Topics and Reinforcement Learning http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html Acknowledgement: some parts of this slides is based on Prof. Mengdi Wang s, Prof. Dimitri Bertsekas and Prof.

More information

On the Convergence of Optimistic Policy Iteration

On the Convergence of Optimistic Policy Iteration Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology

More information

APPROXIMATE DYNAMIC PROGRAMMING A SERIES OF LECTURES GIVEN AT TSINGHUA UNIVERSITY JUNE 2014 DIMITRI P. BERTSEKAS

APPROXIMATE DYNAMIC PROGRAMMING A SERIES OF LECTURES GIVEN AT TSINGHUA UNIVERSITY JUNE 2014 DIMITRI P. BERTSEKAS APPROXIMATE DYNAMIC PROGRAMMING A SERIES OF LECTURES GIVEN AT TSINGHUA UNIVERSITY JUNE 2014 DIMITRI P. BERTSEKAS Based on the books: (1) Neuro-Dynamic Programming, by DPB and J. N. Tsitsiklis, Athena Scientific,

More information

Optimistic Policy Iteration and Q-learning in Dynamic Programming

Optimistic Policy Iteration and Q-learning in Dynamic Programming Optimistic Policy Iteration and Q-learning in Dynamic Programming Dimitri P. Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology November 2010 INFORMS, Austin,

More information

Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications

Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications May 2012 Report LIDS - 2884 Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications Dimitri P. Bertsekas Abstract We consider a class of generalized dynamic programming

More information

Abstract Dynamic Programming

Abstract Dynamic Programming Abstract Dynamic Programming Dimitri P. Bertsekas Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Overview of the Research Monograph Abstract Dynamic Programming"

More information

Introduction to Approximate Dynamic Programming

Introduction to Approximate Dynamic Programming Introduction to Approximate Dynamic Programming Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Approximate Dynamic Programming 1 Key References Bertsekas, D.P.

More information

6.231 DYNAMIC PROGRAMMING LECTURE 17 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 17 LECTURE OUTLINE 6.231 DYNAMIC PROGRAMMING LECTURE 17 LECTURE OUTLINE Undiscounted problems Stochastic shortest path problems (SSP) Proper and improper policies Analysis and computational methods for SSP Pathologies of

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

Reinforcement Learning and Optimal Control. Chapter 4 Infinite Horizon Reinforcement Learning DRAFT

Reinforcement Learning and Optimal Control. Chapter 4 Infinite Horizon Reinforcement Learning DRAFT Reinforcement Learning and Optimal Control by Dimitri P. Bertsekas Massachusetts Institute of Technology Chapter 4 Infinite Horizon Reinforcement Learning DRAFT This is Chapter 4 of the draft textbook

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

Stochastic Shortest Path Problems

Stochastic Shortest Path Problems Chapter 8 Stochastic Shortest Path Problems 1 In this chapter, we study a stochastic version of the shortest path problem of chapter 2, where only probabilities of transitions along different arcs can

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Reinforcement Learning and Optimal Control. Chapter 4 Infinite Horizon Reinforcement Learning DRAFT

Reinforcement Learning and Optimal Control. Chapter 4 Infinite Horizon Reinforcement Learning DRAFT Reinforcement Learning and Optimal Control by Dimitri P. Bertsekas Massachusetts Institute of Technology Chapter 4 Infinite Horizon Reinforcement Learning DRAFT This is Chapter 4 of the draft textbook

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

Computation and Dynamic Programming

Computation and Dynamic Programming Computation and Dynamic Programming Huseyin Topaloglu School of Operations Research and Information Engineering, Cornell University, Ithaca, New York 14853, USA topaloglu@orie.cornell.edu June 25, 2010

More information

Journal of Computational and Applied Mathematics. Projected equation methods for approximate solution of large linear systems

Journal of Computational and Applied Mathematics. Projected equation methods for approximate solution of large linear systems Journal of Computational and Applied Mathematics 227 2009) 27 50 Contents lists available at ScienceDirect Journal of Computational and Applied Mathematics journal homepage: www.elsevier.com/locate/cam

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Monte Carlo Linear Algebra: A Review and Recent Results

Monte Carlo Linear Algebra: A Review and Recent Results Monte Carlo Linear Algebra: A Review and Recent Results Dimitri P. Bertsekas Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Caradache, France, 2012 *Athena

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

Hidden Markov Models (HMM) and Support Vector Machine (SVM) Hidden Markov Models (HMM) and Support Vector Machine (SVM) Professor Joongheon Kim School of Computer Science and Engineering, Chung-Ang University, Seoul, Republic of Korea 1 Hidden Markov Models (HMM)

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes

More information

9 Improved Temporal Difference

9 Improved Temporal Difference 9 Improved Temporal Difference Methods with Linear Function Approximation DIMITRI P. BERTSEKAS and ANGELIA NEDICH Massachusetts Institute of Technology Alphatech, Inc. VIVEK S. BORKAR Tata Institute of

More information

1 Problem Formulation

1 Problem Formulation Book Review Self-Learning Control of Finite Markov Chains by A. S. Poznyak, K. Najim, and E. Gómez-Ramírez Review by Benjamin Van Roy This book presents a collection of work on algorithms for learning

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

A System Theoretic Perspective of Learning and Optimization

A System Theoretic Perspective of Learning and Optimization A System Theoretic Perspective of Learning and Optimization Xi-Ren Cao* Hong Kong University of Science and Technology Clear Water Bay, Kowloon, Hong Kong eecao@ee.ust.hk Abstract Learning and optimization

More information

Lecture notes for Analysis of Algorithms : Markov decision processes

Lecture notes for Analysis of Algorithms : Markov decision processes Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with

More information

WE consider finite-state Markov decision processes

WE consider finite-state Markov decision processes IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 7, JULY 2009 1515 Convergence Results for Some Temporal Difference Methods Based on Least Squares Huizhen Yu and Dimitri P. Bertsekas Abstract We consider

More information

Procedia Computer Science 00 (2011) 000 6

Procedia Computer Science 00 (2011) 000 6 Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

1 Markov decision processes

1 Markov decision processes 2.997 Decision-Making in Large-Scale Systems February 4 MI, Spring 2004 Handout #1 Lecture Note 1 1 Markov decision processes In this class we will study discrete-time stochastic systems. We can describe

More information

15-780: LinearProgramming

15-780: LinearProgramming 15-780: LinearProgramming J. Zico Kolter February 1-3, 2016 1 Outline Introduction Some linear algebra review Linear programming Simplex algorithm Duality and dual simplex 2 Outline Introduction Some linear

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

Optimal Stopping Problems

Optimal Stopping Problems 2.997 Decision Making in Large Scale Systems March 3 MIT, Spring 2004 Handout #9 Lecture Note 5 Optimal Stopping Problems In the last lecture, we have analyzed the behavior of T D(λ) for approximating

More information

Approximate Policy Iteration: A Survey and Some New Methods

Approximate Policy Iteration: A Survey and Some New Methods April 2010 - Revised December 2010 and June 2011 Report LIDS - 2833 A version appears in Journal of Control Theory and Applications, 2011 Approximate Policy Iteration: A Survey and Some New Methods Dimitri

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Basis Function Adaptation Methods for Cost Approximation in MDP

Basis Function Adaptation Methods for Cost Approximation in MDP Basis Function Adaptation Methods for Cost Approximation in MDP Huizhen Yu Department of Computer Science and HIIT University of Helsinki Helsinki 00014 Finland Email: janeyyu@cshelsinkifi Dimitri P Bertsekas

More information

Approximate dynamic programming for stochastic reachability

Approximate dynamic programming for stochastic reachability Approximate dynamic programming for stochastic reachability Nikolaos Kariotoglou, Sean Summers, Tyler Summers, Maryam Kamgarpour and John Lygeros Abstract In this work we illustrate how approximate dynamic

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process

More information

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396 Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

DIFFERENTIAL TRAINING OF 1 ROLLOUT POLICIES

DIFFERENTIAL TRAINING OF 1 ROLLOUT POLICIES Appears in Proc. of the 35th Allerton Conference on Communication, Control, and Computing, Allerton Park, Ill., October 1997 DIFFERENTIAL TRAINING OF 1 ROLLOUT POLICIES by Dimitri P. Bertsekas 2 Abstract

More information

A projection algorithm for strictly monotone linear complementarity problems.

A projection algorithm for strictly monotone linear complementarity problems. A projection algorithm for strictly monotone linear complementarity problems. Erik Zawadzki Department of Computer Science epz@cs.cmu.edu Geoffrey J. Gordon Machine Learning Department ggordon@cs.cmu.edu

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A.

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning Introduction to Reinforcement Learning Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/ munos/ INRIA Lille - Nord Europe Machine Learning Summer School, September 2011,

More information

Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming MATHEMATICS OF OPERATIONS RESEARCH Vol. 37, No. 1, February 2012, pp. 66 94 ISSN 0364-765X (print) ISSN 1526-5471 (online) http://dx.doi.org/10.1287/moor.1110.0532 2012 INFORMS Q-Learning and Enhanced

More information

Markov Decision Processes Infinite Horizon Problems

Markov Decision Processes Infinite Horizon Problems Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld 1 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T)

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

Basis Function Adaptation Methods for Cost Approximation in MDP

Basis Function Adaptation Methods for Cost Approximation in MDP In Proc of IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning 2009 Nashville TN USA Basis Function Adaptation Methods for Cost Approximation in MDP Huizhen Yu Department

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement

More information

Linear Programming Methods

Linear Programming Methods Chapter 11 Linear Programming Methods 1 In this chapter we consider the linear programming approach to dynamic programming. First, Bellman s equation can be reformulated as a linear program whose solution

More information

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

INTRODUCTION TO MARKOV DECISION PROCESSES

INTRODUCTION TO MARKOV DECISION PROCESSES INTRODUCTION TO MARKOV DECISION PROCESSES Balázs Csanád Csáji Research Fellow, The University of Melbourne Signals & Systems Colloquium, 29 April 2010 Department of Electrical and Electronic Engineering,

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

5. Solving the Bellman Equation

5. Solving the Bellman Equation 5. Solving the Bellman Equation In the next two lectures, we will look at several methods to solve Bellman s Equation (BE) for the stochastic shortest path problem: Value Iteration, Policy Iteration and

More information

6.231 DYNAMIC PROGRAMMING LECTURE 9 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 9 LECTURE OUTLINE 6.231 DYNAMIC PROGRAMMING LECTURE 9 LECTURE OUTLINE Rollout algorithms Policy improvement property Discrete deterministic problems Approximations of rollout algorithms Model Predictive Control (MPC) Discretization

More information

Regular Policies in Abstract Dynamic Programming

Regular Policies in Abstract Dynamic Programming August 2016 (Revised January 2017) Report LIDS-P-3173 Regular Policies in Abstract Dynamic Programming Dimitri P. Bertsekas Abstract We consider challenging dynamic programming models where the associated

More information

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017 The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

Bayesian Active Learning With Basis Functions

Bayesian Active Learning With Basis Functions Bayesian Active Learning With Basis Functions Ilya O. Ryzhov Warren B. Powell Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA IEEE ADPRL April 13, 2011 1 / 29

More information

A Gentle Introduction to Reinforcement Learning

A Gentle Introduction to Reinforcement Learning A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,

More information

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari

More information

arxiv: v1 [cs.lg] 23 Oct 2017

arxiv: v1 [cs.lg] 23 Oct 2017 Accelerated Reinforcement Learning K. Lakshmanan Department of Computer Science and Engineering Indian Institute of Technology (BHU), Varanasi, India Email: lakshmanank.cse@itbhu.ac.in arxiv:1710.08070v1

More information

Optimal sequential decision making for complex problems agents. Damien Ernst University of Liège

Optimal sequential decision making for complex problems agents. Damien Ernst University of Liège Optimal sequential decision making for complex problems agents Damien Ernst University of Liège Email: dernst@uliege.be 1 About the class Regular lectures notes about various topics on the subject with

More information

A Least Squares Q-Learning Algorithm for Optimal Stopping Problems

A Least Squares Q-Learning Algorithm for Optimal Stopping Problems LIDS REPORT 273 December 2006 Revised: June 2007 A Least Squares Q-Learning Algorithm for Optimal Stopping Problems Huizhen Yu janey.yu@cs.helsinki.fi Dimitri P. Bertsekas dimitrib@mit.edu Abstract We

More information

Distributed Optimization. Song Chong EE, KAIST

Distributed Optimization. Song Chong EE, KAIST Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links

More information

MS&E339/EE337B Approximate Dynamic Programming Lecture 1-3/31/2004. Introduction

MS&E339/EE337B Approximate Dynamic Programming Lecture 1-3/31/2004. Introduction MS&E339/EE337B Approximate Dynamic Programming Lecture 1-3/31/2004 Lecturer: Ben Van Roy Introduction Scribe: Ciamac Moallemi 1 Stochastic Systems In this class, we study stochastic systems. A stochastic

More information

Planning and Model Selection in Data Driven Markov models

Planning and Model Selection in Data Driven Markov models Planning and Model Selection in Data Driven Markov models Shie Mannor Department of Electrical Engineering Technion Joint work with many people along the way: Dotan Di-Castro (Yahoo!), Assaf Halak (Technion),

More information

The Multi-Arm Bandit Framework

The Multi-Arm Bandit Framework The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Reinforcement Learning II. George Konidaris

Reinforcement Learning II. George Konidaris Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2017 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.

More information

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Assignment #1 Roll-out: nice example paper: X.

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

Reinforcement Learning II. George Konidaris

Reinforcement Learning II. George Konidaris Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2018 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes

More information

Some AI Planning Problems

Some AI Planning Problems Course Logistics CS533: Intelligent Agents and Decision Making M, W, F: 1:00 1:50 Instructor: Alan Fern (KEC2071) Office hours: by appointment (see me after class or send email) Emailing me: include CS533

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

On Average Versus Discounted Reward Temporal-Difference Learning

On Average Versus Discounted Reward Temporal-Difference Learning Machine Learning, 49, 179 191, 2002 c 2002 Kluwer Academic Publishers. Manufactured in The Netherlands. On Average Versus Discounted Reward Temporal-Difference Learning JOHN N. TSITSIKLIS Laboratory for

More information

Planning in Markov Decision Processes

Planning in Markov Decision Processes Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov

More information

Projected Equation Methods for Approximate Solution of Large Linear Systems 1

Projected Equation Methods for Approximate Solution of Large Linear Systems 1 June 2008 J. of Computational and Applied Mathematics (to appear) Projected Equation Methods for Approximate Solution of Large Linear Systems Dimitri P. Bertsekas 2 and Huizhen Yu 3 Abstract We consider

More information

University of Warwick, EC9A0 Maths for Economists Lecture Notes 10: Dynamic Programming

University of Warwick, EC9A0 Maths for Economists Lecture Notes 10: Dynamic Programming University of Warwick, EC9A0 Maths for Economists 1 of 63 University of Warwick, EC9A0 Maths for Economists Lecture Notes 10: Dynamic Programming Peter J. Hammond Autumn 2013, revised 2014 University of

More information

Robotics. Control Theory. Marc Toussaint U Stuttgart

Robotics. Control Theory. Marc Toussaint U Stuttgart Robotics Control Theory Topics in control theory, optimal control, HJB equation, infinite horizon case, Linear-Quadratic optimal control, Riccati equations (differential, algebraic, discrete-time), controllability,

More information