Dynamic Programming and Reinforcement Learning

Size: px
Start display at page:

Download "Dynamic Programming and Reinforcement Learning"

Transcription

1 Dynamic Programming and Reinforcement Learning

2 Reinforcement learning setting action/decision Agent Environment reward state Action space: A <A, S, R, P> State space: S Reward: R : S A S! R Transition: P : S A! S

3 Reinforcement learning setting <A, S, R, P> action/decision Action space: A State space: S Reward: R : S A S! R Transition: P : S A! S Agent reward state Environment Agent: Policy: : S A! R, X a2a (a s) =1 Policy (deterministic): : S! A

4 Reinforcement learning setting <A, S, R, P> action/decision Action space: A State space: S Reward: R : S A S! R Transition: P : S A! S Agent reward state Environment Agent: Policy: : S A! R, X a2a (a s) =1 Policy (deterministic): : S! A Agent s view: s 0, a 1,r 1,s 1, a 2,r 2,s 2, a 3,r 3,s 3,... (s 0 ) (s 1 ) (s 2 )

5 Reinforcement learning setting <A, S, R, P> action/decision Action space: A State space: S Reward: R : S A S! R Transition: P : S A! S Agent: Policy: : S A! R, Agent reward state X a2a Policy (deterministic): : S! A (a s) =1 Environment Agent s goal: learn a policy to maximize long-term total reward T-step: X T t=1 r t discounted: X 1 t=1 all RL tasks can be defined by maximizing total reward t r t

6 Difference between RL and planning? what if we use planning/search methods to find actions that maximize total reward Planing: find an optimal solution RL: find an optimal policy from samples planning: shortest-path RL: shortest-path policy without knowing the graph

7 Difference between RL and SL? supervised learning also learns a model... supervised learning environment reinforcement learning environment data (x,y) (x,y) (x,y)... algorithm model data data data (s,a,s,r,a,s,r...) (s,a,s,r,a,s,r...) (s,a,s,r,a,s,r...) (s,a,s,r,a,s,r...) (s,a,s,r,a,s,r...) (s,a,s,r,a,s,r...) (s,a,s,r,a,s,r...) (s,a,s,r,a,s,r...) (s,a,s,r,a,s,r...) algorithm model learning from labeled data open loop passive data learning from delayed reward closed loop explore environment

8 Introduction and Examples Dynamic Programming One powerful tool to solve certain types of optimization problems is called dynamic programming (DP). The idea is similar to recursion To illustrate the idea of recursion, consider you want to compute f(n) = n! You know n! = n (n 1)!. Therefore, if you know f(n 1) then you can compute n! easily. Thus you only need to focus on computing f(n 1) To compute f(n 1), you apply the same idea, you only need to compute f(n 2).. so on so forth And you know f(1) = 1

9 Introduction and Examples Dynamic Programming Basically, we want to solve a big problem that is hard We can first solve a few smaller but similar problems, if those can be solved, then the solution to the big problem will be easy to get To solve each of those smaller problems, we use the same idea, we first solve a few even smaller problems. Continue doing it, we will eventually encounter a problem we know how to solve Dynamic programming has the same feature, the difference is that at each step, there might be some optimization involved.

10 Introduction and Examples Shortest Path Problem You have a graph, you want to find the shortest path from s to t Here we use d ij to denote the distance between node i and node j

11 Introduction and Examples DP formulation for the shortest path problem Let V i denote the shortest distance between i to t. Eventually, we want to compute V s It is hard to directly compute V i in general However, we can just look one step We know if the first step is to move from i to j, the shortest distance we can get must be d ij + V j. To minimize the total distance, we want to choose j to minimize d ij + V j To write into a math formula, we get V i = min j {d ij + V j }

12 Introduction and Examples DP for shortest path problem We call this the recursion formula V i = min{d ij + V j } for all i j We also know if we are already at our destination, then the distance is 0. I.e., V t = 0 The above two equations are the DP formulation for this problem

13 Introduction and Examples Solve the DP Given the formula, how to solve the DP? We use backward induction. V i = min j {d ij + V j } for all i, V t = 0 From the last node (which we know the value), we solve the values of V s backwardly.

14 Introduction and Examples Example We have V t = 0. Then we have V f = min (f,j) is a path {d fj + V j } Here, we only have one path, thus V f = 5 + V t = 5 Similarly, V g = 2

15 Introduction and Examples Example continued We have V t = 0, V f = 5 and V g = 2 Now consider c, d, e. For c and e there is only one path V c = d cf + V f = 7, V e = d eg + V g = 5 For d, we have V d = min (d,j) is a path {d dj + V j } = min{d df + V f, d dg + V g } = 10 The optimal way to choose at d is go to g

16 Introduction and Examples Example continued We got V c = 7, V d = 10, V e = 5. Now we compute V a and V b. V a = min{d ac + V c, d ad + V d } = min{3 + 7, } = 10 V b = min{d bd + V d, d be + V e } = min{1 + 10, 2 + 5} = 7 and the optimal path to go at a is to choose c, and the optimal path to go at b is to choose e.

17 Introduction and Examples Example continued Finally, we have V s = min{d sa + V a, d sb + V b } = min{1 + 10, 9 + 7} = 11 and the optimal path to go at s is to choose a Therefore, we found the optimal path is 11, and by connecting the optimal path, we get s a c f t

18 Introduction and Examples Summary of the example In the example, we saw that we have those V i s, indicating the shortest length to go from i to t. We call this V the value function We also have those nodes s, a, b,...,g, t. We call them the states of the problem The value function is a function of the state And the recursion formula V i = min{d ij + V j } for all i j connects the value function at different states. It is known as the Bellman equation

19 Introduction and Examples Dynamic Programming Framework The above is the general framework of dynamic programming problems. To formulate a problem into a DP problem, the first step is to define the states The state variables should be in some order (usually either time order or geographical order) In the shortest path example, the state is simply each node We call the set of all the state the state space In this example, the state space is all the nodes Defining the appropriate state space is the most important step in DP. Definition A state is a collection of variables that summarize all the (historical) information that is useful for (future) decisions. Conditioned on the state, the problem becomes Markov (i.e., memoryless).

20 Introduction and Examples DP: Actions There will be an action set at each state At state x, we denote the set by A(x) For each action a A(x), there is an immediate cost r(x, a) If one exerts action a, the system will go to some next state s(x, a) In the shortest path problem, the action at each state is the path it can take. There is a distance for each path which is the immediate cost. And after you take a certain path, the system will be at the next node

21 Introduction and Examples DP: Value Functions There is a value function V ( ) at each state. The value function denotes that if you choose the optimal action from this state and onward, what is the optimal value. In the shortest path example, the value function is the shortest distance between the current state to the destination Then a recursion formula can be written to link the value functions: V (x) = min {r(x, a) + V (s(x, a))} a A(x) In order to be able to solve the DP, one has to know some terminal values (boundary values) of this V function In the shortest path example, we know V t = 0 The recursion for value functions is known as the Bellman equation.

22 Introduction and Examples Some more general framework The above framework is to minimize the total cost. In some cases, one wants to maximize the total profit. Then the r(x, a) can be viewed as the immediate reward. The DP in those cases can be written as with some boundary conditions V (x) = max {r(x, a) + V (s(x, a))} a A(x)

23 Introduction and Examples Stochastic DP In some cases, when you choose action a at x, the next state is not certain (e.g., you decide a price, but the demand is random). There will be p(x, y, a) probability you move from x to y if you choose action a A(x) Then the recursion formula becomes: V (x) = min a A(x) {r(x, a) + y p(x, y, a)v (y)} or if we choose to use the expectation notation: V (x) = min a A(x) {r(x, a) + EV (s(x, a))}

24 Introduction and Examples Example: Stochastic Shortest Path Problem Stochastic setting: One no longer controls which exact node to jump to next Instead one can choose between different actions a A Each action a is associated with a set of transition probabilities p(j i; a) for all i, j S. The arc length may be random w ija Objective: One needs to decide on the action for every possible current node. In other words, one wants to find a policy or strategy that maps from S to A. Bellman Equation for Stochastic SSP: V (i) = min p(j i; a)(w ija + V (j)), i S. a j S

25 Introduction and Examples Bellman equation continued We rewrite the Bellman equation V (i) = min p(j i; a)(w ija + V (j)), i S. a j S in a vector form: V = min g µ + P µ V µ:s A where average transitional cost starting from i to the next state: g µ (i) = j S p(j i, µ(i))w ij,µ(i) transition probabilities P µ (i, j) = p(j i, µ(i)) V (i) is the expected length of stochastic shortest path starting from i, also known as the value function or cost-to-go vector V is the optimal value function or cost-to-go vector that solves the Bellman equation

26 Introduction and Examples Example: Game of picking coins Assume there is a row of n coins of values v 1,...,v n where n is an even number. You and your opponent takes turn to Either pick the first or last coin in this row, obtain that value. That coin is then removed from this row Both player wants to maximize his total value

27 Introduction and Examples Example Example: 1, 2, 10, 8 If you choose 8, then you opponent will choose 10, then you choose 2, your opponent chooses 1. You get 10 and your opponent get 11. If you choose 1, then your opponent will choose 8, you choose 10, your opponent choose 2. You get 11 and your opponent get 10 When there are many coins, it is not very easy to find the optimal strategy As you have seen, greedy strategy doesn t work

28 Introduction and Examples Dynamic programming formulation Given a sequence of numbers v 1,..., v n. Let the state be the remaining positions that yet to be chosen. That is, the state is a pair of i and j with i j, meaning the remaining coins are v i, v i+1,..., v j The value function V (i, j) denotes the maximum values you can take if the game starts when the coins are v i, v i+1,..., v j (and you are the first one to move) The action at state (i, j) is easy: Either take i or take j The immediate reward function If you take i, then you get v i If you take j, then you get v j What will be the next state if you choose i at state (i, j)?

29 Introduction and Examples Example continued Consider the current state (i, j) if you picked i. Your opponent will either choose i + 1 or j. When he chooses i + 1, the state will become (i + 2, j) When he chooses j, the state will become (i + 1, j 1) He will choose to make most value of the remaining coins, i.e., leave you with the least value of the remaining coins Similar argument if you take j Therefore, we can write down the recursion formula V (i, j) = max{v i + min{v (i + 2, j), V (i + 1, j 1)}, v j + min{v (i + 1, j 1), V (i, j 2)}} And we know V (i, i + 1) = max{v i, v i+1 } (if there are only two coins remaining, you pick the larger one)

30 Introduction and Examples Dynamic programming formulation Therefore, the DP for this problem is: V (i, j) = max{v i + min{v (i + 2, j), V (i + 1, j 1)}, v j + min{v (i + 1, j 1), V (i, j 2)}} with V (i, i + 1) = max{v i, v i+1 } for all i = 1,..., n 1. How to solve this DP? We know V (i, j) when j i = 1 Then we can easily solve V (i, j) for any pair with j i = 3 Then all the way we can solve the initial problem

31 Introduction and Examples Summary of this example The state variable is a two-dimensional vector, indicating the remaining range This is one typical way to set up the state variables We used DP to solve the optimal strategy of a two-person game In fact, for computers to solve games, DP is the main algorithm For example, to solve a go or chess game. One can define the state space of the location of each piece/stone. The value function of a state is simply whether this is a win state or loss state When it is at certain state, the computer will consider all actions. The next state will be the most adversary state the opponent can give you after your move and his move The boundary states are the checkmate states

32 Introduction and Examples DP for games DP is roughly how computer plays games. In fact, if you just use the DP and code it, the code will probably be no more than a page (excluding interface of course) However, there is one major obstacle - the state space is huge For chess, each piece could take one of the 64 spots (and also could has been removed), there are roughly states For Go, each spot on the board (361 spots) could be occupied by white, black or neither. There are states It is impossible for current computers to solve that large problem This is called the curse of dimensionality To solve it, people have developed approximate dynamic programming techniques to get approximate good solutions (both the approximate value function and optimal strategy)

33 Stationary system Infinite-Horizon DP: Theory and Algorithms Markov Decision Process Cost of a policy π = {µ 0, µ 1,...} J π (x 0 ) = lim x k+1 = f(x k, u k, w k ), k = 0, 1,... E w k N k=0,1,... { N 1 k=0 α k g (x k, µ k (x k ), w k ) with α < 1, and g is bounded [for some M, we have g(x, u, w) M for all (x, u, w)] x k : state; summarizing past information that is relevant for future optimization u k : control/action; decision to be selected at time k from a given set U k w k : random disturbance or noise g k (x k, u k, w k ): state transitional cost incurred at time k given current state x k } Optimal cost function is defined as J (x) = min J π (x) π

34 Infinite-Horizon DP: Theory and Algorithms Markov Decision Process Shorthand notation for DP mappings For any function J of x, denote (T J)(x) = min u U(x) E w {g(x, u, w) + αj (f(x, u, w))}, x T J is the optimal cost function for the one-stage problem with stage cost g and terminal cost function αj. T operates on bounded functions of x to produce other bounded functions of x For any stationary policy µ, denote (T µ J)(x) = E w {g (x, µ(x), w) + αj (f(x, µ(x), w))}, x The critical structure of the problem is captured in T and T µ The entire theory of discounted problems can be developed in shorthand using T and T µ True for many other DP problems. T and T µ provide a powerful unifying framework for DP.

35 Infinite-Horizon DP: Theory and Algorithms Markov Decision Process Boundedness of g guarantees that all costs are well-defined and bounded: J π (x) M 1 α Important special case with finite space: Markovian Decision Problem All algorithms ultimately work with a finite spaces MDP approximating the original problem

36 Infinite-Horizon DP: Theory and Algorithms Markov Decision Process Markov Chain A Markov chain is a random process that takes values on the state space {1,..., n}. The process evolves according to a certain transition probability matrix P R n n where P (i k+1 = j i k, i k 1,..., i 0 ) = P (i k+1 = j i k = i) = P ij Markov chain is memoryless, i.e., further evolvements are independent with past trajectory conditioned on the current state. The memoryless property is equivalent to Markov. A state i is recurrent if it will be visited infinitely many times with probability 1. A Markov chain is said to be irreducible if its state space is a single communicating class; in other words, if it is possible to get to any state from any state. When states are modeled appropriately, all stochastic processes are Markov.

37 Infinite-Horizon DP: Theory and Algorithms Markov Decision Process Markovian Decision Problem We will mostly assume the system is an n-state (controlled) Markov chain States i = 1,..., n Transition probabilities p ik i k+1 (u k ) Stage cost g(i k, u k, i k+1) Cost functions J = ( J(1),..., J(n) (vectors in R n ) Cost of a policy π = {µ 0, µ 1,...} J π (i) = lim E N i k k=1,2,... { N 1 } α k g (i k, µ k (i k ), i k+1 ) i 0 = i

38 Infinite-Horizon DP: Theory and Algorithms Markov Decision Process Markovian Decision Problem Shorthand notation for DP mappings n (T J)(i) = min p ij (u) ( g(i, u, j) + αj(j) ), i = 1,..., n, u U(i) j=1 n ( )( ) (T µ J)(i) = p ij µ(i) g (i, µ(i), j) + αj(j), j=1 Vector form of DP mappings i = 1,..., n and where g µ (i) = T J = min µ {g µ + P µ J} T µ J = g µ + P µ J, n p ij (µ(i))g(i, µ(i), j), j=1 P µ (i, j) = p ij (µ(i)).

39 Infinite-Horizon DP: Theory and Algorithms Optimality Condition and Value Iteration Two Key properties Monotonicity property: For any J and J such that J(x) J (x) for all x, and any µ (T J)(x) (T J )(x), x, (T µ J)(x) (T µ J )(x), x. Constant Shift property: For any J, any scalar r, and any µ (T (J + re)) (x) = (T J)(x) + αr, x, (T µ (J + re)) (x) = (T µ J)(x) + αr, x, where e is the unit function [e(x) 1]. Monotonicity is present in all DP models (undiscounted, etc) Constant shift is special to discounted models Discounted problems have another property of major importance: T and T µ are contraction mappings (we will show this later)

40 Infinite-Horizon DP: Theory and Algorithms Optimality Condition and Value Iteration Convergence of Value Iteration Theorem For all bounded J 0, we have J (x) = lim k (T k J 0 )(x), for all x Proof. For simplicity we give the proof for J 0 0. For any initial state x 0, and policy π = {µ 0, µ 1,...}, { } J π (x 0 ) = E α l g (x l, µ l (x l ), w l ) = E l=0 { k 1 } { } α l g (x l, µ l (x l ), w l ) + E α l g (x l, µ l (x l ), w l ) l=0 l=k The tail portion satisfies { E α l g (x l, µ l (x l ), w l )} αk M 1 α l=k where M g(x, u, w). Take min over π of both sides, then lim as k.

41 Infinite-Horizon DP: Theory and Algorithms Optimality Condition and Value Iteration Proof of Bellman s equation Theorem The optimal cost function J is a solution of Bellman s equation, J = T J, i.e., for all x, J (x) = Proof. For all x and k, min E w {g(x, u, w) + αj (f(x, u, w))} u U(x) J (x) αk M 1 α (T k J 0 )(x) J (x) + αk M 1 α, where J 0 (x) 0 and M g(x, u, w). Applying T to this relation, and using Monotonicity and Constant Shift, (T J )(x) αk+1 M 1 α (T k+1 J 0 )(x) (T J )(x) + αk+1 M 1 α Taking the limit as k and using the fact lim k (T k+1 J 0 )(x) = J (x) we obtain J = T J.

42 Infinite-Horizon DP: Theory and Algorithms Optimality Condition and Value Iteration The Contraction Property Contraction property: For any bounded functions J and J, and any µ, max (T J)(x) (T J )(x) α max J(x) J (x), x x Proof. max (T µj)(x) (T µ J )(x) α max J(x) J (x). x x Denote c = max x S J(x) J (x). Then J(x) c J (x) J(x) + c, x Apply T to both sides, and use the Monotonicity and Constant Shift properties: (T J)(x) αc (T J )(x) (T J)(x) + αc, x Hence (T J)(x) (T J )(x) αc, x. This implies that T, T µ have unique fixed points. Then J is the unique solution of J = T J, and J µ is the unique solution of J µ = T µ J µ

43 Infinite-Horizon DP: Theory and Algorithms Optimality Condition and Value Iteration Nec. and Sufficient Opt. Condition Theorem A stationary policy µ is optimal if and only if µ(x) attains the minimum in Bellman s equation for each x; i.e., or, equivalently, for all x, T J = T µ J, µ(x) arg min u U(x) E w {g(x, u, w) + αj (f(x, u, w))}

44 Infinite-Horizon DP: Theory and Algorithms Optimality Condition and Value Iteration Proof of Optimality Condition Proof. We have two directions. If T J = T µ J, then using Bellman s equation (J = T J ), we have J = T µ J, so by uniqueness of the fixed point of T µ, we obtain J = J µ ; i.e., µ is optimal. Conversely, if the stationary policy µ is optimal, we have J = J µ, so J = T µ J. Combining this with Bellman s Eq. (J = T J ), we obtain T J = T µ J.

45 Infinite-Horizon DP: Theory and Algorithms Optimality Condition and Value Iteration Two Main Algorithms for MDP Value Iteration (VI) Solve the Bellman equation J = T J by iterating on the value functions: or for i = 1,..., n. J k+1 (i) = min u J k+1 = T J k, n p ij (u) (g(i, u, j) + αj k (j)), i=1 The program only needs to memorize the current value function J k. We have shown that J k J as k. Policy Iteration (PI) Solve the Bellman equation J = T J by iterating on the policies

46 Infinite-Horizon DP: Theory and Algorithms Policy Iteration Policy Iteration (PI) Given µ k, the k-th policy iteration has two steps Policy evaluation: Find J µ k by solving n ( J µ k(i) = p ij µ k (i) )( g ( i, µ k (i), j ) + αj µ k(j) ), j=1 i = 1,..., n or J µ k = T µ kj µ k Policy improvement: Let µ k+1 be such that µ k+1 (i) arg min u U(i) j=1 n p ij (u) ( g(i, u, j) + αj µ k(j) ), i or T µ k+1j µ k = T J µ k Policy iteration is a method that updates the policy instead of the value function.

47 Infinite-Horizon DP: Theory and Algorithms Policy Iteration Policy Iteration (PI) More abstractly, the k-th policy iteration has two steps Policy evaluation: Find J µ k by solving J µ k = T µ kj µ k = g µ k + P µ kj µ k Policy improvement: Let µ k+1 be such that Comments: T µ k+1j µ k = T J µ k. Policy evaluation is equivalent to solving an n n linear system of equations Policy improvement is equivalent to 1-step lookahead using the evaluated value function For large n, exact PI is out of the question. We use instead optimistic PI (policy evaluation with a few VIs)

48 Infinite-Horizon DP: Theory and Algorithms Policy Iteration Convergence of Policy Iteration Theorem Assume that the state and action spaces are finite. The policy iteration generates µ k that converges to the optimal policy µ in a finite number of steps. Proof. We show that J µ k J µ k+1 for all k. For given k, we have J µ k = T µ kj µ k T J µ k = T µ k+1j µ k Using the monotonicity property of DP, J µ k T µ k+1j µ k T 2 µ k+1j µ k lim N T N µ k+1j µ k Since lim T N µ k+1j µ = J N k µ k+1 we have J µ k J µ k+1. If J µ k = J µ k+1, all above inequalities hold as equations, so J µ k solves Bellman s equation. Hence J µ k = J Thus at iteration k either the algorithm generates a strictly improved policy or it finds an optimal policy. For a finite spaces MDP, the algorithm terminates with an optimal policy.

49 Infinite-Horizon DP: Theory and Algorithms Policy Iteration Shorthand Theory A Summary Infinite horizon cost function expressions [with J 0 (x) 0] J π (x) = lim N (T µ 0 T µ1 T µn J 0 )(x), J µ (x) = lim N (T N µ J 0 )(x) Bellman s equation: J = T J, J µ = T µ J µ Optimality condition: µ: optimal <==> T µ J = T J Value iteration: For any (bounded) J J (x) = lim k (T k J)(x), Policy iteration: Given µ k, Policy evaluation: Find J µ k by solving J µ k = T µ kj µ k Policy improvement : Find µ k+1 such that T µ k+1j µ k = T J µ k x

50 Infinite-Horizon DP: Theory and Algorithms Q-factors Q-Factors Optimal Q-factor of (x, u) : Q (x, u) = E {g(x, u, w) + αj (f(x, u, w))}. It is the cost of starting at x, applying u is the 1st stage, and an optimal policy after the 1st stage The value function is equivalent to J (x) = min u U(x) Q (x, u), x. Q-factors are costs in an augmented problem where states are (x, u) Here (x, u) is a post-decision state.

51 Infinite-Horizon DP: Theory and Algorithms Q-factors VI in Q-factors We can equivalently write the VI method as J k+1 (x) = min Q k+1(x, u), x, u U(x) where Q k+1 is generated by { } Q k+1 (x, u) = E g(x, u, w) + α min Q k(f(x, u, w), v) v U(x) VI converges for Q-factors

52 Infinite-Horizon DP: Theory and Algorithms Q-factors Q-Factors VI and PI for Q-factors are mathematically equivalent to VI and PI for costs They require equal amount of computation... they just need more storage Having optimal Q-factors is convenient when implementing an optimal policy on-line by µ (x) = min u U(x) Q (x, u) Once Q (x, u) are known, the model [g and E{ }] is not needed. Model-free operation Q-Learning (to be discussed later) is a sampling method that calculates Q (x, u) using a simulator of the system (no model needed)

53 Infinite-Horizon DP: Theory and Algorithms Q-factors MDP and Q-Factors Optimal Q-factors - the function Q(i, u) that satisfies the following Bellman equation Q (i, u) = n j=1 ( ) p ij (u) g(i, u, j) + α min v U(j) Q (j, v), or in short Q = F Q Interpretation: Q-factors can be viewed as J values by considering (i, u) as the post-decision state DP Algorithm for Q-values instead of J-values Value Iteration: Q k+1 = F Q k Policy Iteration: F µk+1 Q µk = F Q µk VI and PI are convergent for Q-values Model-free.

54 A Premier on ADP Practical Difficulties of DP The curse of dimensionality Exponential growth of the computational and storage requirements as the number of state variables and control variables increases Quick explosion of the number of states in combinatorial problems The curse of modeling Sometimes a simulator of the system is easier to construct than a model There may be real-time solution constraints A family of problems may be addressed. The data of the problem to be solved is given with little advance notice The problem data may change as the system is controlled need for on-line replanning All of the above are motivations for approximation and simulation

55 A Premier on ADP Why do we use Simulation? One reason: Computational complexity advantage in computing sums/expectations involving a very large number of terms Any sum n i=1 can be written as an expected value : n n a i a i = ξ i = E ξ ξ i i=1 i=1 a i { ai where ξ is any prob. distribution over {1,..., n} It can be approximated by generating many samples {i 1,..., i k } from {1,..., n}, according to distribution ξ, and Monte Carlo averaging: n i=1 a i = E ξ { ai ξ i } 1 k Simulation is also convenient when an analytical model of the system is unavailable, but a simulation/computer model is possible. ξ i k t=1 }, a it ξ it

56 Dimension Reduction in RL Approximation in Value Space Approximation in value space Approximate J or J µ from a parametric class J(i; r) where i is the current state and r = (r 1,..., r m ) is a vector of tunable scalars weights Use J in place of J or J µ in various algorithms and computations Role of r : By adjusting r we can change the shape of J so that it is close to J or J µ A simulator may be used, particularly when there is no mathematical model of the system (but there is a computer model) We will focus on simulation, but this is not the only possibility We may also use parametric approximation for Q-factors or cost function differences

57 Dimension Reduction in RL Approximation in Value Space Approximation Architectures Two key issues: The choice of parametric class J(i; r) (the approximation architecture) Method for tuning the weights ( training the architecture) Success depends strongly on how these issues are handled... also on insight about the problem Divided in linear and nonlinear [i.e., linear or nonlinear dependence of J(i; r) on r] Linear architectures are easier to train, but nonlinear ones (e.g., neural networks) are richer

58 Dimension Reduction in RL Approximation in Value Space Computer chess example Think of board position as state and move as control Uses a feature-based position evaluator that assigns a score (or approximate Q-factor) to each position/move Relatively few special features and weights, and multistep lookahead

59 Dimension Reduction in RL Approximation in Value Space Linear Approximation Architectures With well-chosen features, we can use a linear architecture: J(i; r) = φ(i) r, i = 1,..., n, or s J(r) = Φr = Φ j r j Φ: the matrix whose rows are φ(i), i = 1,..., n, Φ j is the jth column This is approximation on the subspace S = {Φr r R s } spanned by the columns of Φ (basis functions) j=1

60 Dimension Reduction in RL Approximation in Value Space Linear Approximation Architectures Often, the features encode much of the nonlinearity inherent in the cost function approximated Many examples of feature types: Polynomial approximation, radial basis functions, etc

61 Dimension Reduction in RL Approximation in Value Space Example: Polynomial type Polynomial Approximation, e.g., a quadratic approximating function. Let the state be i = (i 1,..., i q ) (i.e., have q dimensions ) and define φ 0 (i) = 1, φ k (i) = i k, φ km (i) = i k i m, k, m = 1,..., q Linear approximation architecture : q J(i; r) = r 0 + r k i k + k=1 q k=1 m=k q r km i k i m, where r has components r 0, r k, and r km. Interpolation : A subset I of special/representative states is selected, and the parameter vector r has one component r i per state i I. The approximating function is J(i; r) = r i, i I, J(i; r) = interpolation using the values at i I, i / I For example, piecewise constant, piecewise linear, more general polynomial interpolations.

62 Dimension Reduction in RL Approximation in Value Space Another Example J (i): optimal score starting from position i Number of states > (for board) Success with just 22 features, readily recognized by tetris players as capturing important aspects of the board position (heights of columns, etc)

63 Dimension Reduction in RL Approximation in Policy Space Approximation in Policy Space A brief discussion; we will return to it later. Use parametrization µ(i; r) of policies with a vector r = (r 1,..., r s ). Examples: Polynomial, e.g., µ(i; r) = r 1 + r 2 i + r 3 i 2 Linear feature-based µ(i; r) = φ 1 (i) r 1 + φ 2 (i) r 2

64 Dimension Reduction in RL Approximation in Policy Space Approximation in Policy Space Optimize the cost over r. For example: Each value of r defines a stationary policy, with cost starting at state i denoted by J(i; r). Let (p 1,..., p n ) be some probability distribution over the states, and minimize over r n p i J(i; r) i=1 Use a random search, gradient, or other method A special case : The parameterization of the policies is indirect, through a cost approximation architecture Ĵ, i.e., µ(i; r) arg min n u U(i) j=1 p ij (u) ( g(i, u, j) + αĵ(j; r))

65 Dimension Reduction in RL State Aggregation Aggregation A first idea : Group similar states together into aggregate states x 1,..., x s ; assign a common cost value r i to each group x i. Solve an aggregate DP problem, involving the aggregate states, to obtain r = (r 1,..., r s ). This is called hard aggregation

66 Dimension Reduction in RL State Aggregation Aggregation More general/mathematical view : Solve Φr = ΦDT µ (Φr) where the rows of D and Φ are prob. distributions (e.g., D and Φ aggregate rows and columns of the linear system J = T µ J) Compare with projected equation Φr = ΠT µ (Φr). Note: ΦD is a projection in some interesting cases

67 Dimension Reduction in RL State Aggregation Aggregation as Problem Abstraction Aggregation can be viewed as a systematic approach for problem approximation. Main elements: Solve (exactly or approximately) the aggregate problem by any kind of VI or PI method Use the optimal cost of the aggregate problem to approximate the optimal cost of the original problem

68 Dimension Reduction in RL State Aggregation Aggregate System Description The transition probability from aggregate state x to aggregate state y under control u n ˆp xy (u) = i=1 d xi n j=1 p ij (u)φ jy, or ˆP (u) = DP (u)φ where the rows of D and Φ are the disaggregation and aggregation probs. The expected transition cost is n ĝ(x, u) = i=1 d xi n j=1 p ij (u)g(i, u, j), or ĝ = DP (u)g

69 Dimension Reduction in RL State Aggregation Aggregate Bellman s Equation The optimal cost function of the aggregate problem, denoted ˆR, is [ ] ˆR(x) = min u U ĝ(x, u) + α y Bellman s equation for the aggregate problem. ˆp xy (u) ˆR(y), x The optimal cost function J of the original problem is approximated by J given by J(j) = φ jy ˆR(y), j y

70 Dimension Reduction in RL State Aggregation Example I: Hard Aggregation Group the original system states into subsets, and view each subset as an aggregate state Aggregation probs.: φ jy = 1 if j belongs to aggregate state y. Disaggregation probs.: There are many possibilities, e.g., all states i within aggregate state x have equal prob. d xi. If optimal cost vector J is piecewise constant over the aggregate states/subsets, hard aggregation is exact. Suggests grouping states with roughly equal cost into aggregates. A variant: Soft aggregation (provides soft boundaries between aggregate states).

71 Dimension Reduction in RL State Aggregation Example II: Feature-Based Aggregation Important question: How do we group states together? If we know good features, it makes sense to group together states that have similar features A general approach for passing from a feature-based state representation to a hard aggregation-based architecture Essentially discretize the features and generate a corresponding piecewise constant approximation to the optimal cost function Aggregation-based architecture is more powerful (it is nonlinear in the features)... but may require many more aggregate states to reach the same level of performance as the corresponding linear feature-based architecture

72 Dimension Reduction in RL State Aggregation Example III: Rep. States/Coarse Grid Choose a collection of representative original system states, and associate each one of them with an aggregate state Disaggregation probabilities are d xi = 1 if i is equal to representative state x. Aggregation probabilities associate original system states with convex combinations of representative states j y A φ jy y Well-suited for Euclidean space discretization Extends nicely to continuous state space, including belief space of POMDP

73 On-Policy Learning Direct Projection Direct Policy evaluation Approximate the cost of the current policy by using least squares and simulation-generated cost samples Amounts to projection of J µ onto the approximation subspace Solution by least squares methods Regular and optimistic policy iteration Nonlinear approximation architectures may also be used

74 On-Policy Learning Direct Projection Direct Evaluation by Simulation Projection by Monte Carlo Simulation: Compute the projection ΠJ µ of J µ on subspace S = {Φr r R s }, with respect to a weighted Euclidean norm ξ Equivalently, find Φr, where r = arg min r R s Φr J µ 2 ξ = arg min r R s Setting to 0 the gradient at r, ( n ) 1 r = ξ i φ(i)φ(i) i=1 n ( ξ i φ(i) r J µ (i) ) 2 i=1 n i=1 ξ i φ(i)j µ (i)

75 On-Policy Learning Direct Projection Direct Evaluation by Simulation Generate samples { (i 1, J µ (i 1 )),..., (i k, J µ (i k )) } using distribution ξ Approximate by Monte Carlo the two expected values with low-dimensional calculations ( k ˆr k = t=1 φ(i t )φ(i t ) ) 1 k t=1 φ(i t )J µ (i t ) Equivalent least squares alternative calculation: ˆr k = arg min r R s k ( φ(it ) r J µ (i t ) ) 2 t=1

76 On-Policy Learning Direct Projection Convergence of Evaluated Policy By law of large numbers, we have 1 k k t=1 φ(i t )φ(i t ) a.s. 1 n n ξ i φ(i)φ(i) i=1 and We have 1 k k t=1 φ(i t )J µ (i t ) a.s. 1 n n ξ i φ(i)j µ (i) i=1 r k a.s. r = Π S J µ. As the number of samples increases, the estimated low-dim cost r k converges almost surely to the projected J µ.

77 On-Policy Learning Direct Projection Indirect policy evaluation Solve the projected equation Φr = ΠT µ (Φr) where Π is projection w/ respect to a suitable weighted Euclidean norm Solution methods that use simulation (to manage the calculation of Π) TD(λ): Stochastic iterative algorithm for solving Φr = ΠT µ (Φr) LSTD(λ): Solves a simulation-based approximation w/ a standard solver LSPE(λ): A simulation-based form of projected value iteration ; essentially Φr k+1 = ΠT µ (Φr k ) + simulation noise Almost sure convergence guarantee

78 On-Policy Learning Bellman Error Minimization Bellman equation error methods Another example of indirect approximate policy evaluation: min r Φr T µ (Φr) 2 ξ ( ) where ξ is Euclidean norm, weighted with respect to some distribution ξ It is closely related to the projected equation/galerkin approach (with a special choice of projection norm) Several ways to implement projected equation and Bellman error methods by simulation. They involve: Generating many random samples of states i k using the distribution ξ Generating many samples of transitions (i k, j k ) using the policy µ Form a simulation-based approximation of the optimality condition for projection problem or problem (*) (use sample averages in place of inner products) Solve the Monte-Carlo approximation of the optimality condition Issues for indirect methods: How to generate the samples? How to calculate r efficiently?

79 On-Policy Learning Projected Bellman Equation Method Cost Function Approximation via Projected Equations Ideally, we want to solve the Bellman equation (for a fixed policy µ) In MDP, the equation is n n: J = T µ J J = g µ + αp µ J. We solve a projected version of the high-dim equation J = Π(g µ + αp µ J) Since the projection Π is onto the space spanned by Φ, the projected equation is equivalent to Φr = Π(g µ + αp µ Φr) We fix the policy µ from now on, and omit mentioning it.

80 On-Policy Learning Projected Bellman Equation Method Matrix Form of Projected Equation The solution Φr satisfies the orthogonality condition : The error Φr (g + αp Φr ) is orthogonal to the subspace spanned by the columns of Φ. This is written as Φ Ξ ( Φr (g + αp Φr ) ) = 0, where Ξ is the diagonal matrix with the steady-state probabilities ξ 1,..., ξ n along the diagonal. Equivalently, Cr = d, where C = Φ Ξ(I αp )Φ, d = Φ Ξg but computing C and d is HARD (high-dimensional inner products).

81 On-Policy Learning Projected Bellman Equation Method Simulation-Based Implementations Key idea: Calculate simulation-based approximations based on k samples C k C, d k d Matrix inversion r = C 1 d is approximated by ˆr k = C 1 k d k This is the LSTD (Least Squares Temporal Differences) Method. Key fact: C k, d k can be computed with low-dimensional linear algebra (of order s; the number of basis functions).

82 On-Policy Learning Projected Bellman Equation Method Simulation Mechanics We generate an infinitely long trajectory (i 0, i 1,...) of the Markov chain, so states i and transitions (i, j) appear with long-term frequencies ξ i and p ij. After generating each transition (i t, i t+1 ), we compute the row φ(i t ) of Φ and the cost component g(i t, i t+1 ). We form d k = 1 k + 1 k t=0 φ(i t )g(i t, i t+1 ) i,j ξ i p ij φ(i)g(i, j) = Φ Ξg = d C k = 1 k + 1 k φ(i t ) ( φ(i t ) αφ(i t+1 ) ) Φ Ξ(I αp )Φ = C t=0 a.s. a.s. Convergence based on law of large numbers: C k C, d k d. As sample size increases, r k converges a.s. to the solution of projected Bellman equation.

83 On-Policy Learning From On-Policy to Off-Policy Outer Loop (Off-Policy RL): Approx. PI via On-Policy Learning Estimate the value function of the current policy µ t using linear features: J µt Φr t. Inner Loop (On-Policy RL): Generate state trajectories... Estimate r t via Bellman error minimization (or direct projection, or projected equation approach) Update the policy by Comments: µ t+1 (i) = argmin a p ij (a)(g(i, a, j) + φ(j) r t ), j Requires knowledge of p ij (suitable for computer games with known transitions) The policy µ t+1 is parameterized by r t. i.

84 On-Policy Learning From On-Policy to Off-Policy Approx. PI via On-Policy Learning Use simulation to approximate the cost J µ of the current policy µ Generate improved policy µ by minimizing in (approx.) Bellman equation Alternatively we can approximate the Q-factors of µ

85 On-Policy Learning From On-Policy to Off-Policy Theoretical Basis of Approximate PI If policies are approximately evaluated using an approximation architecture such that max J(i, r k ) J µ k(i) d, k = 0, 1,... i If policy improvement is also approximate, max (T J)(i, µ k+1 r k ) (T J)(i, r k ) ɛ, k = 0, 1,... i Error bound: The sequence {µ k } generated by approximate policy iteration satisfies ( lim sup max Jµ k(i) J (i) ) ɛ + 2αd k i (1 α) 2 Typical practical behavior: The method makes steady progress up to a point and then the iterates J µ k oscillate within a neighborhood of J. In practice oscillations between policies is probably not the major concern.

86 Off-Policy RL via Linear Duality Look at the Bellman Equation Again Consider a MDP model with States i = 1,..., n Probability transition matrix under policy µ is P µ R n n Reward of transition is g µ R n The Bellman equation is J = min µ g µ + P µ J. This is a nonlinear system of equations. Note: The righthandside is the infimum of a number of linear mappings of J!

87 Off-Policy RL via Linear Duality DP is a special case of LP Theorem Every finite-state DP problem is an LP problem. We construct the following LP or more compactly: minimize J(1) J(n) n subject to J(i) p ij (u)g iju + u A, j=1 minimize e J n p ij (u)j(j), j=1 subject to J g µ + P µ J, µ. The variables are J ( i) where i = 1,..., n. For each state action pair (i, u), there is an inequality constraint. Dimension of LP: n T n A (here we assume a finite-state problem)

88 Off-Policy RL via Linear Duality DP is a special case of LP Theorem This solution to the constructed LP minimize e J subject to J g µ + P µ J, µ. is exactly the solution to the Bellman s equation J = min µ g µ + P µ J. Proof. The solution J to the Bellman equation is obviously a feasible solution to the LP. If the LP solution J is different from J, it must solve the Bellman equation at the same time. Since the Bellman equation has a unique solution, J = J.

89 Off-Policy RL via Linear Duality ADP via Approximate Linear Programming The constructed LP is of huge scale. Approximate LP: minimize e J subject to J g µ + P µ J, µ. We may approximate J by adding the constraint J = Φr, so the variable dimension becomes smaller. We may sample a subset of all constraints, so the constraint dimension becomes smaller. LP and Approximate LP can be solved by simulation/online.

90 Final Remarks Tetris Height: 12 Width: 7 Rotate and move the falling shape Gravity related to current height Score when eliminating an entire level Game over when reaching the ceiling

91 Final Remarks DP Model of Tetris State: The current board, the current falling tile, predictions of future tiles Termination state: when the tiles reach the ceiling, the game is over with no more future reward Action: Rotation and shift System Dynamics: The next board is deterministically determined by the current board and the player s placement of the current tile. The future tiles are generated randomly. Uncertainty: Randomness in future tiles Transitional cost g: If a level is cleared by the current action, score 1; otherwise score 0. Objective: Expectation of total score.

92 Final Remarks Interesting facts about Tetris First released in 1984 by Alexey Pajitnov from the Soviet Union Has been proved to be NP-complete. Game will be over with probability 1. For a 12 7 board, the number of possible states Highest score achieved by human 1 million Highest score achieved by algorithm 35 million (average performance)

93 Final Remarks Solve Tetris by Approximate Dynamic Programming Curse of dimensionality: The exact cost-to-go vector V has dimension equal to the state space size (10 25 ). The transition matrix is , although sparse. There is no way to solve an Bellman equation. Cost function approximation Instead of characterizing the entire board configuration, we use features φ 1,..., φ d to describe the current state Represent the cost-to-go vector V using linear combinations of a small number of features V = β 1 φ β d φ d We approximate the high-dim space of V use a low-dim feature space of β.

94 Final Remarks Features of Tetris The earliest tetris learning algorithm uses 3 features The latest best algorithm uses 22 features The mapping from state space to feature space can be viewed as state aggregation: different states/boards with similar height/holes/structures are treated as a single meta-state.

95 Final Remarks Learning in Tetris Ideally, we still want to solve the Bellman equation V = max µ g µ + P µ V We consider an approximate solution Ṽ to the Bellman equation such that Ṽ = Φβ We apply policy iteration with approximate function evaluation Outer loop: keep improving the policies µ k using one-step lookahead Inner loop: finding the cost-to-go V µk associated with the current policy µ k, by taking samples and using feature representation We want to avoid high-dimensional algebraic operations, as well as memory of high-dimensional quantities.

96 Final Remarks Algorithms tested on tetris More examples of learning methods: Temporal difference learning, TD(λ), LSTD, cross-entropy method, actor-critic, active learning, etc... These algorithms are essentially all combinations of DP, sampling, parametric models.

97 Final Remarks Tetris World Record: Human vs. AI

98 AlphaGo a Rollout policy policy network: value network: Human expert positions SL policy network RL policy network Value network Policy network Value network p p p (a s) (s ) Feature #of planes Description Self-play positions Stone colour 3 Player stone /opponent stone /empty Ones 1 Aconstant plane filled with 1 Turns since 8 How many turns since amove was played Liberties 8 Number of liberties (empty adjacent points) Capture size 8 How many opponent stones would be captured Self-atari size 8 How many ofown stones would be captured Liberties after move 8 Number of liberties after this move is played Ladder capture 1 Whether amove at this point is asuccessful ladder capture Ladder escape 1 Whether amove at this point is asuccessful ladder escape Sensibleness 1 Whether amove is legal and does not fill its own eyes Zeros 1 Aconstant plane filled with 0 Player color 1 Whether current player is black p Neural network Data b s s

99 Books Richard S. Sutton and Andrew G. Barto Reinforcement Learning: An Introduction Masashi Sugiyama Statistical Reinforcement Learning: Modern Machine Learning Approaches Marco Wiering and Martijn van Otterlo (eds) Reinforcement Learning: State-of-the-Art Also in MDP books Mykel J. Kochenderfer Decision Making Under Uncertainty: Theory and Application

Infinite-Horizon Dynamic Programming

Infinite-Horizon Dynamic Programming 1/70 Infinite-Horizon Dynamic Programming http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html Acknowledgement: this slides is based on Prof. Mengdi Wang s and Prof. Dimitri Bertsekas lecture notes 2/70 作业

More information

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019 Reinforcement Learning and Optimal Control ASU, CSE 691, Winter 2019 Dimitri P. Bertsekas dimitrib@mit.edu Lecture 8 Bertsekas Reinforcement Learning 1 / 21 Outline 1 Review of Infinite Horizon Problems

More information

Lecture 4: Misc. Topics and Reinforcement Learning

Lecture 4: Misc. Topics and Reinforcement Learning Approximate Dynamic Programming Lecture 4: Misc. Topics and Reinforcement Learning Mengdi Wang Operations Research and Financial Engineering Princeton University August 1-4, 2015 1/56 Feature Extraction

More information

6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE 6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE Review of Q-factors and Bellman equations for Q-factors VI and PI for Q-factors Q-learning - Combination of VI and sampling Q-learning and cost function

More information

Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications

Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications May 2012 Report LIDS - 2884 Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications Dimitri P. Bertsekas Abstract We consider a class of generalized dynamic programming

More information

Value and Policy Iteration

Value and Policy Iteration Chapter 7 Value and Policy Iteration 1 For infinite horizon problems, we need to replace our basic computational tool, the DP algorithm, which we used to compute the optimal cost and policy for finite

More information

On the Convergence of Optimistic Policy Iteration

On the Convergence of Optimistic Policy Iteration Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

Stochastic Shortest Path Problems

Stochastic Shortest Path Problems Chapter 8 Stochastic Shortest Path Problems 1 In this chapter, we study a stochastic version of the shortest path problem of chapter 2, where only probabilities of transitions along different arcs can

More information

Reinforcement Learning and Optimal Control. Chapter 4 Infinite Horizon Reinforcement Learning DRAFT

Reinforcement Learning and Optimal Control. Chapter 4 Infinite Horizon Reinforcement Learning DRAFT Reinforcement Learning and Optimal Control by Dimitri P. Bertsekas Massachusetts Institute of Technology Chapter 4 Infinite Horizon Reinforcement Learning DRAFT This is Chapter 4 of the draft textbook

More information

APPROXIMATE DYNAMIC PROGRAMMING A SERIES OF LECTURES GIVEN AT TSINGHUA UNIVERSITY JUNE 2014 DIMITRI P. BERTSEKAS

APPROXIMATE DYNAMIC PROGRAMMING A SERIES OF LECTURES GIVEN AT TSINGHUA UNIVERSITY JUNE 2014 DIMITRI P. BERTSEKAS APPROXIMATE DYNAMIC PROGRAMMING A SERIES OF LECTURES GIVEN AT TSINGHUA UNIVERSITY JUNE 2014 DIMITRI P. BERTSEKAS Based on the books: (1) Neuro-Dynamic Programming, by DPB and J. N. Tsitsiklis, Athena Scientific,

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Abstract Dynamic Programming

Abstract Dynamic Programming Abstract Dynamic Programming Dimitri P. Bertsekas Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Overview of the Research Monograph Abstract Dynamic Programming"

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Reinforcement Learning and Optimal Control. Chapter 4 Infinite Horizon Reinforcement Learning DRAFT

Reinforcement Learning and Optimal Control. Chapter 4 Infinite Horizon Reinforcement Learning DRAFT Reinforcement Learning and Optimal Control by Dimitri P. Bertsekas Massachusetts Institute of Technology Chapter 4 Infinite Horizon Reinforcement Learning DRAFT This is Chapter 4 of the draft textbook

More information

Introduction to Approximate Dynamic Programming

Introduction to Approximate Dynamic Programming Introduction to Approximate Dynamic Programming Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Approximate Dynamic Programming 1 Key References Bertsekas, D.P.

More information

6.231 DYNAMIC PROGRAMMING LECTURE 17 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 17 LECTURE OUTLINE 6.231 DYNAMIC PROGRAMMING LECTURE 17 LECTURE OUTLINE Undiscounted problems Stochastic shortest path problems (SSP) Proper and improper policies Analysis and computational methods for SSP Pathologies of

More information

Distributed Optimization. Song Chong EE, KAIST

Distributed Optimization. Song Chong EE, KAIST Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Optimistic Policy Iteration and Q-learning in Dynamic Programming

Optimistic Policy Iteration and Q-learning in Dynamic Programming Optimistic Policy Iteration and Q-learning in Dynamic Programming Dimitri P. Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology November 2010 INFORMS, Austin,

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

1 Markov decision processes

1 Markov decision processes 2.997 Decision-Making in Large-Scale Systems February 4 MI, Spring 2004 Handout #1 Lecture Note 1 1 Markov decision processes In this class we will study discrete-time stochastic systems. We can describe

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

Journal of Computational and Applied Mathematics. Projected equation methods for approximate solution of large linear systems

Journal of Computational and Applied Mathematics. Projected equation methods for approximate solution of large linear systems Journal of Computational and Applied Mathematics 227 2009) 27 50 Contents lists available at ScienceDirect Journal of Computational and Applied Mathematics journal homepage: www.elsevier.com/locate/cam

More information

Lecture notes for Analysis of Algorithms : Markov decision processes

Lecture notes for Analysis of Algorithms : Markov decision processes Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

Misc. Topics and Reinforcement Learning

Misc. Topics and Reinforcement Learning 1/79 Misc. Topics and Reinforcement Learning http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html Acknowledgement: some parts of this slides is based on Prof. Mengdi Wang s, Prof. Dimitri Bertsekas and Prof.

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

Computation and Dynamic Programming

Computation and Dynamic Programming Computation and Dynamic Programming Huseyin Topaloglu School of Operations Research and Information Engineering, Cornell University, Ithaca, New York 14853, USA topaloglu@orie.cornell.edu June 25, 2010

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A.

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

9 Improved Temporal Difference

9 Improved Temporal Difference 9 Improved Temporal Difference Methods with Linear Function Approximation DIMITRI P. BERTSEKAS and ANGELIA NEDICH Massachusetts Institute of Technology Alphatech, Inc. VIVEK S. BORKAR Tata Institute of

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

Optimal sequential decision making for complex problems agents. Damien Ernst University of Liège

Optimal sequential decision making for complex problems agents. Damien Ernst University of Liège Optimal sequential decision making for complex problems agents Damien Ernst University of Liège Email: dernst@uliege.be 1 About the class Regular lectures notes about various topics on the subject with

More information

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Assignment #1 Roll-out: nice example paper: X.

More information

A System Theoretic Perspective of Learning and Optimization

A System Theoretic Perspective of Learning and Optimization A System Theoretic Perspective of Learning and Optimization Xi-Ren Cao* Hong Kong University of Science and Technology Clear Water Bay, Kowloon, Hong Kong eecao@ee.ust.hk Abstract Learning and optimization

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning Introduction to Reinforcement Learning Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/ munos/ INRIA Lille - Nord Europe Machine Learning Summer School, September 2011,

More information

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

Hidden Markov Models (HMM) and Support Vector Machine (SVM) Hidden Markov Models (HMM) and Support Vector Machine (SVM) Professor Joongheon Kim School of Computer Science and Engineering, Chung-Ang University, Seoul, Republic of Korea 1 Hidden Markov Models (HMM)

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

Optimal Stopping Problems

Optimal Stopping Problems 2.997 Decision Making in Large Scale Systems March 3 MIT, Spring 2004 Handout #9 Lecture Note 5 Optimal Stopping Problems In the last lecture, we have analyzed the behavior of T D(λ) for approximating

More information

Reinforcement learning an introduction

Reinforcement learning an introduction Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

15-780: LinearProgramming

15-780: LinearProgramming 15-780: LinearProgramming J. Zico Kolter February 1-3, 2016 1 Outline Introduction Some linear algebra review Linear programming Simplex algorithm Duality and dual simplex 2 Outline Introduction Some linear

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Linear Programming Methods

Linear Programming Methods Chapter 11 Linear Programming Methods 1 In this chapter we consider the linear programming approach to dynamic programming. First, Bellman s equation can be reformulated as a linear program whose solution

More information

16.4 Multiattribute Utility Functions

16.4 Multiattribute Utility Functions 285 Normalized utilities The scale of utilities reaches from the best possible prize u to the worst possible catastrophe u Normalized utilities use a scale with u = 0 and u = 1 Utilities of intermediate

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

Procedia Computer Science 00 (2011) 000 6

Procedia Computer Science 00 (2011) 000 6 Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-

More information

Planning in Markov Decision Processes

Planning in Markov Decision Processes Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming Introduction to Reinforcement Learning Part 6: Core Theory II: Bellman Equations and Dynamic Programming Bellman Equations Recursive relationships among values that can be used to compute values The tree

More information

Overview Example (TD-Gammon) Admission Why approximate RL is hard TD() Fitted value iteration (collocation) Example (k-nn for hill-car)

Overview Example (TD-Gammon) Admission Why approximate RL is hard TD() Fitted value iteration (collocation) Example (k-nn for hill-car) Function Approximation in Reinforcement Learning Gordon Geo ggordon@cs.cmu.edu November 5, 999 Overview Example (TD-Gammon) Admission Why approximate RL is hard TD() Fitted value iteration (collocation)

More information

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017 Deep Reinforcement Learning STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017 Outline Introduction to Reinforcement Learning AlphaGo (Deep RL for Computer Go)

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques

More information

WE consider finite-state Markov decision processes

WE consider finite-state Markov decision processes IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 7, JULY 2009 1515 Convergence Results for Some Temporal Difference Methods Based on Least Squares Huizhen Yu and Dimitri P. Bertsekas Abstract We consider

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari

More information

INTRODUCTION TO MARKOV DECISION PROCESSES

INTRODUCTION TO MARKOV DECISION PROCESSES INTRODUCTION TO MARKOV DECISION PROCESSES Balázs Csanád Csáji Research Fellow, The University of Melbourne Signals & Systems Colloquium, 29 April 2010 Department of Electrical and Electronic Engineering,

More information

Section Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018

Section Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018 Section Notes 9 Midterm 2 Review Applied Math / Engineering Sciences 121 Week of December 3, 2018 The following list of topics is an overview of the material that was covered in the lectures and sections

More information

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396 Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction

More information

5. Solving the Bellman Equation

5. Solving the Bellman Equation 5. Solving the Bellman Equation In the next two lectures, we will look at several methods to solve Bellman s Equation (BE) for the stochastic shortest path problem: Value Iteration, Policy Iteration and

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement

More information

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Lecture notes for the course Games on Graphs B. Srivathsan Chennai Mathematical Institute, India 1 Markov Chains We will define Markov chains in a manner that will be useful to

More information

The Markov Decision Process (MDP) model

The Markov Decision Process (MDP) model Decision Making in Robots and Autonomous Agents The Markov Decision Process (MDP) model Subramanian Ramamoorthy School of Informatics 25 January, 2013 In the MAB Model We were in a single casino and the

More information

CS 4100 // artificial intelligence. Recap/midterm review!

CS 4100 // artificial intelligence. Recap/midterm review! CS 4100 // artificial intelligence instructor: byron wallace Recap/midterm review! Attribution: many of these slides are modified versions of those distributed with the UC Berkeley CS188 materials Thanks

More information

arxiv: v1 [cs.lg] 23 Oct 2017

arxiv: v1 [cs.lg] 23 Oct 2017 Accelerated Reinforcement Learning K. Lakshmanan Department of Computer Science and Engineering Indian Institute of Technology (BHU), Varanasi, India Email: lakshmanank.cse@itbhu.ac.in arxiv:1710.08070v1

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Basis Function Adaptation Methods for Cost Approximation in MDP

Basis Function Adaptation Methods for Cost Approximation in MDP Basis Function Adaptation Methods for Cost Approximation in MDP Huizhen Yu Department of Computer Science and HIIT University of Helsinki Helsinki 00014 Finland Email: janeyyu@cshelsinkifi Dimitri P Bertsekas

More information

A Gentle Introduction to Reinforcement Learning

A Gentle Introduction to Reinforcement Learning A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Daniel Hennes 19.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns Forward and backward view Function

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

Reinforcement Learning Part 2

Reinforcement Learning Part 2 Reinforcement Learning Part 2 Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ From previous tutorial Reinforcement Learning Exploration No supervision Agent-Reward-Environment

More information

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Markov Decision Processes and Solving Finite Problems. February 8, 2017 Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:

More information

Deep Reinforcement Learning

Deep Reinforcement Learning Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 50 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015

More information

Markov Decision Processes Chapter 17. Mausam

Markov Decision Processes Chapter 17. Mausam Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.

More information

1 Problem Formulation

1 Problem Formulation Book Review Self-Learning Control of Finite Markov Chains by A. S. Poznyak, K. Najim, and E. Gómez-Ramírez Review by Benjamin Van Roy This book presents a collection of work on algorithms for learning

More information

CS885 Reinforcement Learning Lecture 7a: May 23, 2018

CS885 Reinforcement Learning Lecture 7a: May 23, 2018 CS885 Reinforcement Learning Lecture 7a: May 23, 2018 Policy Gradient Methods [SutBar] Sec. 13.1-13.3, 13.7 [SigBuf] Sec. 5.1-5.2, [RusNor] Sec. 21.5 CS885 Spring 2018 Pascal Poupart 1 Outline Stochastic

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount Yinyu Ye Department of Management Science and Engineering and Institute of Computational

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

Stochastic Primal-Dual Methods for Reinforcement Learning

Stochastic Primal-Dual Methods for Reinforcement Learning Stochastic Primal-Dual Methods for Reinforcement Learning Alireza Askarian 1 Amber Srivastava 1 1 Department of Mechanical Engineering University of Illinois at Urbana Champaign Big Data Optimization,

More information