Dynamic Programming and Reinforcement Learning

Size: px

Start display at page:

Download "Dynamic Programming and Reinforcement Learning"

Joel Allen
5 years ago
Views:

1 Dynamic Programming and Reinforcement Learning

2 Reinforcement learning setting action/decision Agent Environment reward state Action space: A <A, S, R, P> State space: S Reward: R : S A S! R Transition: P : S A! S

3 Reinforcement learning setting <A, S, R, P> action/decision Action space: A State space: S Reward: R : S A S! R Transition: P : S A! S Agent reward state Environment Agent: Policy: : S A! R, X a2a (a s) =1 Policy (deterministic): : S! A

4 Reinforcement learning setting <A, S, R, P> action/decision Action space: A State space: S Reward: R : S A S! R Transition: P : S A! S Agent reward state Environment Agent: Policy: : S A! R, X a2a (a s) =1 Policy (deterministic): : S! A Agent s view: s 0, a 1,r 1,s 1, a 2,r 2,s 2, a 3,r 3,s 3,... (s 0 ) (s 1 ) (s 2 )

5 Reinforcement learning setting <A, S, R, P> action/decision Action space: A State space: S Reward: R : S A S! R Transition: P : S A! S Agent: Policy: : S A! R, Agent reward state X a2a Policy (deterministic): : S! A (a s) =1 Environment Agent s goal: learn a policy to maximize long-term total reward T-step: X T t=1 r t discounted: X 1 t=1 all RL tasks can be defined by maximizing total reward t r t

6 Difference between RL and planning? what if we use planning/search methods to find actions that maximize total reward Planing: find an optimal solution RL: find an optimal policy from samples planning: shortest-path RL: shortest-path policy without knowing the graph

7 Difference between RL and SL? supervised learning also learns a model... supervised learning environment reinforcement learning environment data (x,y) (x,y) (x,y)... algorithm model data data data (s,a,s,r,a,s,r...) (s,a,s,r,a,s,r...) (s,a,s,r,a,s,r...) (s,a,s,r,a,s,r...) (s,a,s,r,a,s,r...) (s,a,s,r,a,s,r...) (s,a,s,r,a,s,r...) (s,a,s,r,a,s,r...) (s,a,s,r,a,s,r...) algorithm model learning from labeled data open loop passive data learning from delayed reward closed loop explore environment

8 Introduction and Examples Dynamic Programming One powerful tool to solve certain types of optimization problems is called dynamic programming (DP). The idea is similar to recursion To illustrate the idea of recursion, consider you want to compute f(n) = n! You know n! = n (n 1)!. Therefore, if you know f(n 1) then you can compute n! easily. Thus you only need to focus on computing f(n 1) To compute f(n 1), you apply the same idea, you only need to compute f(n 2).. so on so forth And you know f(1) = 1

9 Introduction and Examples Dynamic Programming Basically, we want to solve a big problem that is hard We can first solve a few smaller but similar problems, if those can be solved, then the solution to the big problem will be easy to get To solve each of those smaller problems, we use the same idea, we first solve a few even smaller problems. Continue doing it, we will eventually encounter a problem we know how to solve Dynamic programming has the same feature, the difference is that at each step, there might be some optimization involved.

10 Introduction and Examples Shortest Path Problem You have a graph, you want to find the shortest path from s to t Here we use d ij to denote the distance between node i and node j

11 Introduction and Examples DP formulation for the shortest path problem Let V i denote the shortest distance between i to t. Eventually, we want to compute V s It is hard to directly compute V i in general However, we can just look one step We know if the first step is to move from i to j, the shortest distance we can get must be d ij + V j. To minimize the total distance, we want to choose j to minimize d ij + V j To write into a math formula, we get V i = min j {d ij + V j }

12 Introduction and Examples DP for shortest path problem We call this the recursion formula V i = min{d ij + V j } for all i j We also know if we are already at our destination, then the distance is 0. I.e., V t = 0 The above two equations are the DP formulation for this problem

13 Introduction and Examples Solve the DP Given the formula, how to solve the DP? We use backward induction. V i = min j {d ij + V j } for all i, V t = 0 From the last node (which we know the value), we solve the values of V s backwardly.

14 Introduction and Examples Example We have V t = 0. Then we have V f = min (f,j) is a path {d fj + V j } Here, we only have one path, thus V f = 5 + V t = 5 Similarly, V g = 2

15 Introduction and Examples Example continued We have V t = 0, V f = 5 and V g = 2 Now consider c, d, e. For c and e there is only one path V c = d cf + V f = 7, V e = d eg + V g = 5 For d, we have V d = min (d,j) is a path {d dj + V j } = min{d df + V f, d dg + V g } = 10 The optimal way to choose at d is go to g

16 Introduction and Examples Example continued We got V c = 7, V d = 10, V e = 5. Now we compute V a and V b. V a = min{d ac + V c, d ad + V d } = min{3 + 7, } = 10 V b = min{d bd + V d, d be + V e } = min{1 + 10, 2 + 5} = 7 and the optimal path to go at a is to choose c, and the optimal path to go at b is to choose e.

17 Introduction and Examples Example continued Finally, we have V s = min{d sa + V a, d sb + V b } = min{1 + 10, 9 + 7} = 11 and the optimal path to go at s is to choose a Therefore, we found the optimal path is 11, and by connecting the optimal path, we get s a c f t

18 Introduction and Examples Summary of the example In the example, we saw that we have those V i s, indicating the shortest length to go from i to t. We call this V the value function We also have those nodes s, a, b,...,g, t. We call them the states of the problem The value function is a function of the state And the recursion formula V i = min{d ij + V j } for all i j connects the value function at different states. It is known as the Bellman equation

19 Introduction and Examples Dynamic Programming Framework The above is the general framework of dynamic programming problems. To formulate a problem into a DP problem, the first step is to define the states The state variables should be in some order (usually either time order or geographical order) In the shortest path example, the state is simply each node We call the set of all the state the state space In this example, the state space is all the nodes Defining the appropriate state space is the most important step in DP. Definition A state is a collection of variables that summarize all the (historical) information that is useful for (future) decisions. Conditioned on the state, the problem becomes Markov (i.e., memoryless).

20 Introduction and Examples DP: Actions There will be an action set at each state At state x, we denote the set by A(x) For each action a A(x), there is an immediate cost r(x, a) If one exerts action a, the system will go to some next state s(x, a) In the shortest path problem, the action at each state is the path it can take. There is a distance for each path which is the immediate cost. And after you take a certain path, the system will be at the next node

21 Introduction and Examples DP: Value Functions There is a value function V ( ) at each state. The value function denotes that if you choose the optimal action from this state and onward, what is the optimal value. In the shortest path example, the value function is the shortest distance between the current state to the destination Then a recursion formula can be written to link the value functions: V (x) = min {r(x, a) + V (s(x, a))} a A(x) In order to be able to solve the DP, one has to know some terminal values (boundary values) of this V function In the shortest path example, we know V t = 0 The recursion for value functions is known as the Bellman equation.

22 Introduction and Examples Some more general framework The above framework is to minimize the total cost. In some cases, one wants to maximize the total profit. Then the r(x, a) can be viewed as the immediate reward. The DP in those cases can be written as with some boundary conditions V (x) = max {r(x, a) + V (s(x, a))} a A(x)

23 Introduction and Examples Stochastic DP In some cases, when you choose action a at x, the next state is not certain (e.g., you decide a price, but the demand is random). There will be p(x, y, a) probability you move from x to y if you choose action a A(x) Then the recursion formula becomes: V (x) = min a A(x) {r(x, a) + y p(x, y, a)v (y)} or if we choose to use the expectation notation: V (x) = min a A(x) {r(x, a) + EV (s(x, a))}

24 Introduction and Examples Example: Stochastic Shortest Path Problem Stochastic setting: One no longer controls which exact node to jump to next Instead one can choose between different actions a A Each action a is associated with a set of transition probabilities p(j i; a) for all i, j S. The arc length may be random w ija Objective: One needs to decide on the action for every possible current node. In other words, one wants to find a policy or strategy that maps from S to A. Bellman Equation for Stochastic SSP: V (i) = min p(j i; a)(w ija + V (j)), i S. a j S

25 Introduction and Examples Bellman equation continued We rewrite the Bellman equation V (i) = min p(j i; a)(w ija + V (j)), i S. a j S in a vector form: V = min g µ + P µ V µ:s A where average transitional cost starting from i to the next state: g µ (i) = j S p(j i, µ(i))w ij,µ(i) transition probabilities P µ (i, j) = p(j i, µ(i)) V (i) is the expected length of stochastic shortest path starting from i, also known as the value function or cost-to-go vector V is the optimal value function or cost-to-go vector that solves the Bellman equation

26 Introduction and Examples Example: Game of picking coins Assume there is a row of n coins of values v 1,...,v n where n is an even number. You and your opponent takes turn to Either pick the first or last coin in this row, obtain that value. That coin is then removed from this row Both player wants to maximize his total value

27 Introduction and Examples Example Example: 1, 2, 10, 8 If you choose 8, then you opponent will choose 10, then you choose 2, your opponent chooses 1. You get 10 and your opponent get 11. If you choose 1, then your opponent will choose 8, you choose 10, your opponent choose 2. You get 11 and your opponent get 10 When there are many coins, it is not very easy to find the optimal strategy As you have seen, greedy strategy doesn t work

28 Introduction and Examples Dynamic programming formulation Given a sequence of numbers v 1,..., v n. Let the state be the remaining positions that yet to be chosen. That is, the state is a pair of i and j with i j, meaning the remaining coins are v i, v i+1,..., v j The value function V (i, j) denotes the maximum values you can take if the game starts when the coins are v i, v i+1,..., v j (and you are the first one to move) The action at state (i, j) is easy: Either take i or take j The immediate reward function If you take i, then you get v i If you take j, then you get v j What will be the next state if you choose i at state (i, j)?

29 Introduction and Examples Example continued Consider the current state (i, j) if you picked i. Your opponent will either choose i + 1 or j. When he chooses i + 1, the state will become (i + 2, j) When he chooses j, the state will become (i + 1, j 1) He will choose to make most value of the remaining coins, i.e., leave you with the least value of the remaining coins Similar argument if you take j Therefore, we can write down the recursion formula V (i, j) = max{v i + min{v (i + 2, j), V (i + 1, j 1)}, v j + min{v (i + 1, j 1), V (i, j 2)}} And we know V (i, i + 1) = max{v i, v i+1 } (if there are only two coins remaining, you pick the larger one)

30 Introduction and Examples Dynamic programming formulation Therefore, the DP for this problem is: V (i, j) = max{v i + min{v (i + 2, j), V (i + 1, j 1)}, v j + min{v (i + 1, j 1), V (i, j 2)}} with V (i, i + 1) = max{v i, v i+1 } for all i = 1,..., n 1. How to solve this DP? We know V (i, j) when j i = 1 Then we can easily solve V (i, j) for any pair with j i = 3 Then all the way we can solve the initial problem

31 Introduction and Examples Summary of this example The state variable is a two-dimensional vector, indicating the remaining range This is one typical way to set up the state variables We used DP to solve the optimal strategy of a two-person game In fact, for computers to solve games, DP is the main algorithm For example, to solve a go or chess game. One can define the state space of the location of each piece/stone. The value function of a state is simply whether this is a win state or loss state When it is at certain state, the computer will consider all actions. The next state will be the most adversary state the opponent can give you after your move and his move The boundary states are the checkmate states

32 Introduction and Examples DP for games DP is roughly how computer plays games. In fact, if you just use the DP and code it, the code will probably be no more than a page (excluding interface of course) However, there is one major obstacle - the state space is huge For chess, each piece could take one of the 64 spots (and also could has been removed), there are roughly states For Go, each spot on the board (361 spots) could be occupied by white, black or neither. There are states It is impossible for current computers to solve that large problem This is called the curse of dimensionality To solve it, people have developed approximate dynamic programming techniques to get approximate good solutions (both the approximate value function and optimal strategy)

33 Stationary system Infinite-Horizon DP: Theory and Algorithms Markov Decision Process Cost of a policy π = {µ 0, µ 1,...} J π (x 0 ) = lim x k+1 = f(x k, u k, w k ), k = 0, 1,... E w k N k=0,1,... { N 1 k=0 α k g (x k, µ k (x k ), w k ) with α < 1, and g is bounded [for some M, we have g(x, u, w) M for all (x, u, w)] x k : state; summarizing past information that is relevant for future optimization u k : control/action; decision to be selected at time k from a given set U k w k : random disturbance or noise g k (x k, u k, w k ): state transitional cost incurred at time k given current state x k } Optimal cost function is defined as J (x) = min J π (x) π

34 Infinite-Horizon DP: Theory and Algorithms Markov Decision Process Shorthand notation for DP mappings For any function J of x, denote (T J)(x) = min u U(x) E w {g(x, u, w) + αj (f(x, u, w))}, x T J is the optimal cost function for the one-stage problem with stage cost g and terminal cost function αj. T operates on bounded functions of x to produce other bounded functions of x For any stationary policy µ, denote (T µ J)(x) = E w {g (x, µ(x), w) + αj (f(x, µ(x), w))}, x The critical structure of the problem is captured in T and T µ The entire theory of discounted problems can be developed in shorthand using T and T µ True for many other DP problems. T and T µ provide a powerful unifying framework for DP.

35 Infinite-Horizon DP: Theory and Algorithms Markov Decision Process Boundedness of g guarantees that all costs are well-defined and bounded: J π (x) M 1 α Important special case with finite space: Markovian Decision Problem All algorithms ultimately work with a finite spaces MDP approximating the original problem

36 Infinite-Horizon DP: Theory and Algorithms Markov Decision Process Markov Chain A Markov chain is a random process that takes values on the state space {1,..., n}. The process evolves according to a certain transition probability matrix P R n n where P (i k+1 = j i k, i k 1,..., i 0 ) = P (i k+1 = j i k = i) = P ij Markov chain is memoryless, i.e., further evolvements are independent with past trajectory conditioned on the current state. The memoryless property is equivalent to Markov. A state i is recurrent if it will be visited infinitely many times with probability 1. A Markov chain is said to be irreducible if its state space is a single communicating class; in other words, if it is possible to get to any state from any state. When states are modeled appropriately, all stochastic processes are Markov.

37 Infinite-Horizon DP: Theory and Algorithms Markov Decision Process Markovian Decision Problem We will mostly assume the system is an n-state (controlled) Markov chain States i = 1,..., n Transition probabilities p ik i k+1 (u k ) Stage cost g(i k, u k, i k+1) Cost functions J = ( J(1),..., J(n) (vectors in R n ) Cost of a policy π = {µ 0, µ 1,...} J π (i) = lim E N i k k=1,2,... { N 1 } α k g (i k, µ k (i k ), i k+1 ) i 0 = i

38 Infinite-Horizon DP: Theory and Algorithms Markov Decision Process Markovian Decision Problem Shorthand notation for DP mappings n (T J)(i) = min p ij (u) ( g(i, u, j) + αj(j) ), i = 1,..., n, u U(i) j=1 n ( )( ) (T µ J)(i) = p ij µ(i) g (i, µ(i), j) + αj(j), j=1 Vector form of DP mappings i = 1,..., n and where g µ (i) = T J = min µ {g µ + P µ J} T µ J = g µ + P µ J, n p ij (µ(i))g(i, µ(i), j), j=1 P µ (i, j) = p ij (µ(i)).

39 Infinite-Horizon DP: Theory and Algorithms Optimality Condition and Value Iteration Two Key properties Monotonicity property: For any J and J such that J(x) J (x) for all x, and any µ (T J)(x) (T J )(x), x, (T µ J)(x) (T µ J )(x), x. Constant Shift property: For any J, any scalar r, and any µ (T (J + re)) (x) = (T J)(x) + αr, x, (T µ (J + re)) (x) = (T µ J)(x) + αr, x, where e is the unit function [e(x) 1]. Monotonicity is present in all DP models (undiscounted, etc) Constant shift is special to discounted models Discounted problems have another property of major importance: T and T µ are contraction mappings (we will show this later)

40 Infinite-Horizon DP: Theory and Algorithms Optimality Condition and Value Iteration Convergence of Value Iteration Theorem For all bounded J 0, we have J (x) = lim k (T k J 0 )(x), for all x Proof. For simplicity we give the proof for J 0 0. For any initial state x 0, and policy π = {µ 0, µ 1,...}, { } J π (x 0 ) = E α l g (x l, µ l (x l ), w l ) = E l=0 { k 1 } { } α l g (x l, µ l (x l ), w l ) + E α l g (x l, µ l (x l ), w l ) l=0 l=k The tail portion satisfies { E α l g (x l, µ l (x l ), w l )} αk M 1 α l=k where M g(x, u, w). Take min over π of both sides, then lim as k.

41 Infinite-Horizon DP: Theory and Algorithms Optimality Condition and Value Iteration Proof of Bellman s equation Theorem The optimal cost function J is a solution of Bellman s equation, J = T J, i.e., for all x, J (x) = Proof. For all x and k, min E w {g(x, u, w) + αj (f(x, u, w))} u U(x) J (x) αk M 1 α (T k J 0 )(x) J (x) + αk M 1 α, where J 0 (x) 0 and M g(x, u, w). Applying T to this relation, and using Monotonicity and Constant Shift, (T J )(x) αk+1 M 1 α (T k+1 J 0 )(x) (T J )(x) + αk+1 M 1 α Taking the limit as k and using the fact lim k (T k+1 J 0 )(x) = J (x) we obtain J = T J.

42 Infinite-Horizon DP: Theory and Algorithms Optimality Condition and Value Iteration The Contraction Property Contraction property: For any bounded functions J and J, and any µ, max (T J)(x) (T J )(x) α max J(x) J (x), x x Proof. max (T µj)(x) (T µ J )(x) α max J(x) J (x). x x Denote c = max x S J(x) J (x). Then J(x) c J (x) J(x) + c, x Apply T to both sides, and use the Monotonicity and Constant Shift properties: (T J)(x) αc (T J )(x) (T J)(x) + αc, x Hence (T J)(x) (T J )(x) αc, x. This implies that T, T µ have unique fixed points. Then J is the unique solution of J = T J, and J µ is the unique solution of J µ = T µ J µ

43 Infinite-Horizon DP: Theory and Algorithms Optimality Condition and Value Iteration Nec. and Sufficient Opt. Condition Theorem A stationary policy µ is optimal if and only if µ(x) attains the minimum in Bellman s equation for each x; i.e., or, equivalently, for all x, T J = T µ J, µ(x) arg min u U(x) E w {g(x, u, w) + αj (f(x, u, w))}

44 Infinite-Horizon DP: Theory and Algorithms Optimality Condition and Value Iteration Proof of Optimality Condition Proof. We have two directions. If T J = T µ J, then using Bellman s equation (J = T J ), we have J = T µ J, so by uniqueness of the fixed point of T µ, we obtain J = J µ ; i.e., µ is optimal. Conversely, if the stationary policy µ is optimal, we have J = J µ, so J = T µ J. Combining this with Bellman s Eq. (J = T J ), we obtain T J = T µ J.

45 Infinite-Horizon DP: Theory and Algorithms Optimality Condition and Value Iteration Two Main Algorithms for MDP Value Iteration (VI) Solve the Bellman equation J = T J by iterating on the value functions: or for i = 1,..., n. J k+1 (i) = min u J k+1 = T J k, n p ij (u) (g(i, u, j) + αj k (j)), i=1 The program only needs to memorize the current value function J k. We have shown that J k J as k. Policy Iteration (PI) Solve the Bellman equation J = T J by iterating on the policies

46 Infinite-Horizon DP: Theory and Algorithms Policy Iteration Policy Iteration (PI) Given µ k, the k-th policy iteration has two steps Policy evaluation: Find J µ k by solving n ( J µ k(i) = p ij µ k (i) )( g ( i, µ k (i), j ) + αj µ k(j) ), j=1 i = 1,..., n or J µ k = T µ kj µ k Policy improvement: Let µ k+1 be such that µ k+1 (i) arg min u U(i) j=1 n p ij (u) ( g(i, u, j) + αj µ k(j) ), i or T µ k+1j µ k = T J µ k Policy iteration is a method that updates the policy instead of the value function.

47 Infinite-Horizon DP: Theory and Algorithms Policy Iteration Policy Iteration (PI) More abstractly, the k-th policy iteration has two steps Policy evaluation: Find J µ k by solving J µ k = T µ kj µ k = g µ k + P µ kj µ k Policy improvement: Let µ k+1 be such that Comments: T µ k+1j µ k = T J µ k. Policy evaluation is equivalent to solving an n n linear system of equations Policy improvement is equivalent to 1-step lookahead using the evaluated value function For large n, exact PI is out of the question. We use instead optimistic PI (policy evaluation with a few VIs)

48 Infinite-Horizon DP: Theory and Algorithms Policy Iteration Convergence of Policy Iteration Theorem Assume that the state and action spaces are finite. The policy iteration generates µ k that converges to the optimal policy µ in a finite number of steps. Proof. We show that J µ k J µ k+1 for all k. For given k, we have J µ k = T µ kj µ k T J µ k = T µ k+1j µ k Using the monotonicity property of DP, J µ k T µ k+1j µ k T 2 µ k+1j µ k lim N T N µ k+1j µ k Since lim T N µ k+1j µ = J N k µ k+1 we have J µ k J µ k+1. If J µ k = J µ k+1, all above inequalities hold as equations, so J µ k solves Bellman s equation. Hence J µ k = J Thus at iteration k either the algorithm generates a strictly improved policy or it finds an optimal policy. For a finite spaces MDP, the algorithm terminates with an optimal policy.

49 Infinite-Horizon DP: Theory and Algorithms Policy Iteration Shorthand Theory A Summary Infinite horizon cost function expressions [with J 0 (x) 0] J π (x) = lim N (T µ 0 T µ1 T µn J 0 )(x), J µ (x) = lim N (T N µ J 0 )(x) Bellman s equation: J = T J, J µ = T µ J µ Optimality condition: µ: optimal <==> T µ J = T J Value iteration: For any (bounded) J J (x) = lim k (T k J)(x), Policy iteration: Given µ k, Policy evaluation: Find J µ k by solving J µ k = T µ kj µ k Policy improvement : Find µ k+1 such that T µ k+1j µ k = T J µ k x

50 Infinite-Horizon DP: Theory and Algorithms Q-factors Q-Factors Optimal Q-factor of (x, u) : Q (x, u) = E {g(x, u, w) + αj (f(x, u, w))}. It is the cost of starting at x, applying u is the 1st stage, and an optimal policy after the 1st stage The value function is equivalent to J (x) = min u U(x) Q (x, u), x. Q-factors are costs in an augmented problem where states are (x, u) Here (x, u) is a post-decision state.

51 Infinite-Horizon DP: Theory and Algorithms Q-factors VI in Q-factors We can equivalently write the VI method as J k+1 (x) = min Q k+1(x, u), x, u U(x) where Q k+1 is generated by { } Q k+1 (x, u) = E g(x, u, w) + α min Q k(f(x, u, w), v) v U(x) VI converges for Q-factors

52 Infinite-Horizon DP: Theory and Algorithms Q-factors Q-Factors VI and PI for Q-factors are mathematically equivalent to VI and PI for costs They require equal amount of computation... they just need more storage Having optimal Q-factors is convenient when implementing an optimal policy on-line by µ (x) = min u U(x) Q (x, u) Once Q (x, u) are known, the model [g and E{ }] is not needed. Model-free operation Q-Learning (to be discussed later) is a sampling method that calculates Q (x, u) using a simulator of the system (no model needed)

53 Infinite-Horizon DP: Theory and Algorithms Q-factors MDP and Q-Factors Optimal Q-factors - the function Q(i, u) that satisfies the following Bellman equation Q (i, u) = n j=1 ( ) p ij (u) g(i, u, j) + α min v U(j) Q (j, v), or in short Q = F Q Interpretation: Q-factors can be viewed as J values by considering (i, u) as the post-decision state DP Algorithm for Q-values instead of J-values Value Iteration: Q k+1 = F Q k Policy Iteration: F µk+1 Q µk = F Q µk VI and PI are convergent for Q-values Model-free.

54 A Premier on ADP Practical Difficulties of DP The curse of dimensionality Exponential growth of the computational and storage requirements as the number of state variables and control variables increases Quick explosion of the number of states in combinatorial problems The curse of modeling Sometimes a simulator of the system is easier to construct than a model There may be real-time solution constraints A family of problems may be addressed. The data of the problem to be solved is given with little advance notice The problem data may change as the system is controlled need for on-line replanning All of the above are motivations for approximation and simulation

55 A Premier on ADP Why do we use Simulation? One reason: Computational complexity advantage in computing sums/expectations involving a very large number of terms Any sum n i=1 can be written as an expected value : n n a i a i = ξ i = E ξ ξ i i=1 i=1 a i { ai where ξ is any prob. distribution over {1,..., n} It can be approximated by generating many samples {i 1,..., i k } from {1,..., n}, according to distribution ξ, and Monte Carlo averaging: n i=1 a i = E ξ { ai ξ i } 1 k Simulation is also convenient when an analytical model of the system is unavailable, but a simulation/computer model is possible. ξ i k t=1 }, a it ξ it

56 Dimension Reduction in RL Approximation in Value Space Approximation in value space Approximate J or J µ from a parametric class J(i; r) where i is the current state and r = (r 1,..., r m ) is a vector of tunable scalars weights Use J in place of J or J µ in various algorithms and computations Role of r : By adjusting r we can change the shape of J so that it is close to J or J µ A simulator may be used, particularly when there is no mathematical model of the system (but there is a computer model) We will focus on simulation, but this is not the only possibility We may also use parametric approximation for Q-factors or cost function differences

57 Dimension Reduction in RL Approximation in Value Space Approximation Architectures Two key issues: The choice of parametric class J(i; r) (the approximation architecture) Method for tuning the weights ( training the architecture) Success depends strongly on how these issues are handled... also on insight about the problem Divided in linear and nonlinear [i.e., linear or nonlinear dependence of J(i; r) on r] Linear architectures are easier to train, but nonlinear ones (e.g., neural networks) are richer

58 Dimension Reduction in RL Approximation in Value Space Computer chess example Think of board position as state and move as control Uses a feature-based position evaluator that assigns a score (or approximate Q-factor) to each position/move Relatively few special features and weights, and multistep lookahead

59 Dimension Reduction in RL Approximation in Value Space Linear Approximation Architectures With well-chosen features, we can use a linear architecture: J(i; r) = φ(i) r, i = 1,..., n, or s J(r) = Φr = Φ j r j Φ: the matrix whose rows are φ(i), i = 1,..., n, Φ j is the jth column This is approximation on the subspace S = {Φr r R s } spanned by the columns of Φ (basis functions) j=1

60 Dimension Reduction in RL Approximation in Value Space Linear Approximation Architectures Often, the features encode much of the nonlinearity inherent in the cost function approximated Many examples of feature types: Polynomial approximation, radial basis functions, etc

61 Dimension Reduction in RL Approximation in Value Space Example: Polynomial type Polynomial Approximation, e.g., a quadratic approximating function. Let the state be i = (i 1,..., i q ) (i.e., have q dimensions ) and define φ 0 (i) = 1, φ k (i) = i k, φ km (i) = i k i m, k, m = 1,..., q Linear approximation architecture : q J(i; r) = r 0 + r k i k + k=1 q k=1 m=k q r km i k i m, where r has components r 0, r k, and r km. Interpolation : A subset I of special/representative states is selected, and the parameter vector r has one component r i per state i I. The approximating function is J(i; r) = r i, i I, J(i; r) = interpolation using the values at i I, i / I For example, piecewise constant, piecewise linear, more general polynomial interpolations.

62 Dimension Reduction in RL Approximation in Value Space Another Example J (i): optimal score starting from position i Number of states > (for board) Success with just 22 features, readily recognized by tetris players as capturing important aspects of the board position (heights of columns, etc)

63 Dimension Reduction in RL Approximation in Policy Space Approximation in Policy Space A brief discussion; we will return to it later. Use parametrization µ(i; r) of policies with a vector r = (r 1,..., r s ). Examples: Polynomial, e.g., µ(i; r) = r 1 + r 2 i + r 3 i 2 Linear feature-based µ(i; r) = φ 1 (i) r 1 + φ 2 (i) r 2

64 Dimension Reduction in RL Approximation in Policy Space Approximation in Policy Space Optimize the cost over r. For example: Each value of r defines a stationary policy, with cost starting at state i denoted by J(i; r). Let (p 1,..., p n ) be some probability distribution over the states, and minimize over r n p i J(i; r) i=1 Use a random search, gradient, or other method A special case : The parameterization of the policies is indirect, through a cost approximation architecture Ĵ, i.e., µ(i; r) arg min n u U(i) j=1 p ij (u) ( g(i, u, j) + αĵ(j; r))

65 Dimension Reduction in RL State Aggregation Aggregation A first idea : Group similar states together into aggregate states x 1,..., x s ; assign a common cost value r i to each group x i. Solve an aggregate DP problem, involving the aggregate states, to obtain r = (r 1,..., r s ). This is called hard aggregation

Dimension Reduction in RL State Aggregation Aggregation More general/mathematical view : Solve Φr = ΦDT µ (Φr) where the rows of D and Φ are prob. distributions (e.g., D and Φ aggregate rows and columns of the linear system J = T µ J) Compare with projected equation Φr = ΠT µ (Φr).

66 Dimension Reduction in RL State Aggregation Aggregation More general/mathematical view : Solve Φr = ΦDT µ (Φr) where the rows of D and Φ are prob. distributions (e.g., D and Φ aggregate rows and columns of the linear system J = T µ J) Compare with projected equation Φr = ΠT µ (Φr). Note: ΦD is a projection in some interesting cases

Dimension Reduction in RL State Aggregation Aggregation as Problem Abstraction Aggregation can be viewed as a systematic approach for problem approximation.

67 Dimension Reduction in RL State Aggregation Aggregation as Problem Abstraction Aggregation can be viewed as a systematic approach for problem approximation. Main elements: Solve (exactly or approximately) the aggregate problem by any kind of VI or PI method Use the optimal cost of the aggregate problem to approximate the optimal cost of the original problem

68 Dimension Reduction in RL State Aggregation Aggregate System Description The transition probability from aggregate state x to aggregate state y under control u n ˆp xy (u) = i=1 d xi n j=1 p ij (u)φ jy, or ˆP (u) = DP (u)φ where the rows of D and Φ are the disaggregation and aggregation probs. The expected transition cost is n ĝ(x, u) = i=1 d xi n j=1 p ij (u)g(i, u, j), or ĝ = DP (u)g

69 Dimension Reduction in RL State Aggregation Aggregate Bellman s Equation The optimal cost function of the aggregate problem, denoted ˆR, is [ ] ˆR(x) = min u U ĝ(x, u) + α y Bellman s equation for the aggregate problem. ˆp xy (u) ˆR(y), x The optimal cost function J of the original problem is approximated by J given by J(j) = φ jy ˆR(y), j y

70 Dimension Reduction in RL State Aggregation Example I: Hard Aggregation Group the original system states into subsets, and view each subset as an aggregate state Aggregation probs.: φ jy = 1 if j belongs to aggregate state y. Disaggregation probs.: There are many possibilities, e.g., all states i within aggregate state x have equal prob. d xi. If optimal cost vector J is piecewise constant over the aggregate states/subsets, hard aggregation is exact. Suggests grouping states with roughly equal cost into aggregates. A variant: Soft aggregation (provides soft boundaries between aggregate states).

71 Dimension Reduction in RL State Aggregation Example II: Feature-Based Aggregation Important question: How do we group states together? If we know good features, it makes sense to group together states that have similar features A general approach for passing from a feature-based state representation to a hard aggregation-based architecture Essentially discretize the features and generate a corresponding piecewise constant approximation to the optimal cost function Aggregation-based architecture is more powerful (it is nonlinear in the features)... but may require many more aggregate states to reach the same level of performance as the corresponding linear feature-based architecture

72 Dimension Reduction in RL State Aggregation Example III: Rep. States/Coarse Grid Choose a collection of representative original system states, and associate each one of them with an aggregate state Disaggregation probabilities are d xi = 1 if i is equal to representative state x. Aggregation probabilities associate original system states with convex combinations of representative states j y A φ jy y Well-suited for Euclidean space discretization Extends nicely to continuous state space, including belief space of POMDP

On-Policy Learning Direct Projection Direct Policy evaluation Approximate the cost of the current policy by using least squares and simulation-generated cost samples Amounts to

73 On-Policy Learning Direct Projection Direct Policy evaluation Approximate the cost of the current policy by using least squares and simulation-generated cost samples Amounts to projection of J µ onto the approximation subspace Solution by least squares methods Regular and optimistic policy iteration Nonlinear approximation architectures may also be used

74 On-Policy Learning Direct Projection Direct Evaluation by Simulation Projection by Monte Carlo Simulation: Compute the projection ΠJ µ of J µ on subspace S = {Φr r R s }, with respect to a weighted Euclidean norm ξ Equivalently, find Φr, where r = arg min r R s Φr J µ 2 ξ = arg min r R s Setting to 0 the gradient at r, ( n ) 1 r = ξ i φ(i)φ(i) i=1 n ( ξ i φ(i) r J µ (i) ) 2 i=1 n i=1 ξ i φ(i)j µ (i)

75 On-Policy Learning Direct Projection Direct Evaluation by Simulation Generate samples { (i 1, J µ (i 1 )),..., (i k, J µ (i k )) } using distribution ξ Approximate by Monte Carlo the two expected values with low-dimensional calculations ( k ˆr k = t=1 φ(i t )φ(i t ) ) 1 k t=1 φ(i t )J µ (i t ) Equivalent least squares alternative calculation: ˆr k = arg min r R s k ( φ(it ) r J µ (i t ) ) 2 t=1

76 On-Policy Learning Direct Projection Convergence of Evaluated Policy By law of large numbers, we have 1 k k t=1 φ(i t )φ(i t ) a.s. 1 n n ξ i φ(i)φ(i) i=1 and We have 1 k k t=1 φ(i t )J µ (i t ) a.s. 1 n n ξ i φ(i)j µ (i) i=1 r k a.s. r = Π S J µ. As the number of samples increases, the estimated low-dim cost r k converges almost surely to the projected J µ.

77 On-Policy Learning Direct Projection Indirect policy evaluation Solve the projected equation Φr = ΠT µ (Φr) where Π is projection w/ respect to a suitable weighted Euclidean norm Solution methods that use simulation (to manage the calculation of Π) TD(λ): Stochastic iterative algorithm for solving Φr = ΠT µ (Φr) LSTD(λ): Solves a simulation-based approximation w/ a standard solver LSPE(λ): A simulation-based form of projected value iteration ; essentially Φr k+1 = ΠT µ (Φr k ) + simulation noise Almost sure convergence guarantee

78 On-Policy Learning Bellman Error Minimization Bellman equation error methods Another example of indirect approximate policy evaluation: min r Φr T µ (Φr) 2 ξ ( ) where ξ is Euclidean norm, weighted with respect to some distribution ξ It is closely related to the projected equation/galerkin approach (with a special choice of projection norm) Several ways to implement projected equation and Bellman error methods by simulation. They involve: Generating many random samples of states i k using the distribution ξ Generating many samples of transitions (i k, j k ) using the policy µ Form a simulation-based approximation of the optimality condition for projection problem or problem (*) (use sample averages in place of inner products) Solve the Monte-Carlo approximation of the optimality condition Issues for indirect methods: How to generate the samples? How to calculate r efficiently?

79 On-Policy Learning Projected Bellman Equation Method Cost Function Approximation via Projected Equations Ideally, we want to solve the Bellman equation (for a fixed policy µ) In MDP, the equation is n n: J = T µ J J = g µ + αp µ J. We solve a projected version of the high-dim equation J = Π(g µ + αp µ J) Since the projection Π is onto the space spanned by Φ, the projected equation is equivalent to Φr = Π(g µ + αp µ Φr) We fix the policy µ from now on, and omit mentioning it.

80 On-Policy Learning Projected Bellman Equation Method Matrix Form of Projected Equation The solution Φr satisfies the orthogonality condition : The error Φr (g + αp Φr ) is orthogonal to the subspace spanned by the columns of Φ. This is written as Φ Ξ ( Φr (g + αp Φr ) ) = 0, where Ξ is the diagonal matrix with the steady-state probabilities ξ 1,..., ξ n along the diagonal. Equivalently, Cr = d, where C = Φ Ξ(I αp )Φ, d = Φ Ξg but computing C and d is HARD (high-dimensional inner products).

81 On-Policy Learning Projected Bellman Equation Method Simulation-Based Implementations Key idea: Calculate simulation-based approximations based on k samples C k C, d k d Matrix inversion r = C 1 d is approximated by ˆr k = C 1 k d k This is the LSTD (Least Squares Temporal Differences) Method. Key fact: C k, d k can be computed with low-dimensional linear algebra (of order s; the number of basis functions).

82 On-Policy Learning Projected Bellman Equation Method Simulation Mechanics We generate an infinitely long trajectory (i 0, i 1,...) of the Markov chain, so states i and transitions (i, j) appear with long-term frequencies ξ i and p ij. After generating each transition (i t, i t+1 ), we compute the row φ(i t ) of Φ and the cost component g(i t, i t+1 ). We form d k = 1 k + 1 k t=0 φ(i t )g(i t, i t+1 ) i,j ξ i p ij φ(i)g(i, j) = Φ Ξg = d C k = 1 k + 1 k φ(i t ) ( φ(i t ) αφ(i t+1 ) ) Φ Ξ(I αp )Φ = C t=0 a.s. a.s. Convergence based on law of large numbers: C k C, d k d. As sample size increases, r k converges a.s. to the solution of projected Bellman equation.

83 On-Policy Learning From On-Policy to Off-Policy Outer Loop (Off-Policy RL): Approx. PI via On-Policy Learning Estimate the value function of the current policy µ t using linear features: J µt Φr t. Inner Loop (On-Policy RL): Generate state trajectories... Estimate r t via Bellman error minimization (or direct projection, or projected equation approach) Update the policy by Comments: µ t+1 (i) = argmin a p ij (a)(g(i, a, j) + φ(j) r t ), j Requires knowledge of p ij (suitable for computer games with known transitions) The policy µ t+1 is parameterized by r t. i.

84 On-Policy Learning From On-Policy to Off-Policy Approx. PI via On-Policy Learning Use simulation to approximate the cost J µ of the current policy µ Generate improved policy µ by minimizing in (approx.) Bellman equation Alternatively we can approximate the Q-factors of µ

85 On-Policy Learning From On-Policy to Off-Policy Theoretical Basis of Approximate PI If policies are approximately evaluated using an approximation architecture such that max J(i, r k ) J µ k(i) d, k = 0, 1,... i If policy improvement is also approximate, max (T J)(i, µ k+1 r k ) (T J)(i, r k ) ɛ, k = 0, 1,... i Error bound: The sequence {µ k } generated by approximate policy iteration satisfies ( lim sup max Jµ k(i) J (i) ) ɛ + 2αd k i (1 α) 2 Typical practical behavior: The method makes steady progress up to a point and then the iterates J µ k oscillate within a neighborhood of J. In practice oscillations between policies is probably not the major concern.

86 Off-Policy RL via Linear Duality Look at the Bellman Equation Again Consider a MDP model with States i = 1,..., n Probability transition matrix under policy µ is P µ R n n Reward of transition is g µ R n The Bellman equation is J = min µ g µ + P µ J. This is a nonlinear system of equations. Note: The righthandside is the infimum of a number of linear mappings of J!

87 Off-Policy RL via Linear Duality DP is a special case of LP Theorem Every finite-state DP problem is an LP problem. We construct the following LP or more compactly: minimize J(1) J(n) n subject to J(i) p ij (u)g iju + u A, j=1 minimize e J n p ij (u)j(j), j=1 subject to J g µ + P µ J, µ. The variables are J ( i) where i = 1,..., n. For each state action pair (i, u), there is an inequality constraint. Dimension of LP: n T n A (here we assume a finite-state problem)

88 Off-Policy RL via Linear Duality DP is a special case of LP Theorem This solution to the constructed LP minimize e J subject to J g µ + P µ J, µ. is exactly the solution to the Bellman s equation J = min µ g µ + P µ J. Proof. The solution J to the Bellman equation is obviously a feasible solution to the LP. If the LP solution J is different from J, it must solve the Bellman equation at the same time. Since the Bellman equation has a unique solution, J = J.

89 Off-Policy RL via Linear Duality ADP via Approximate Linear Programming The constructed LP is of huge scale. Approximate LP: minimize e J subject to J g µ + P µ J, µ. We may approximate J by adding the constraint J = Φr, so the variable dimension becomes smaller. We may sample a subset of all constraints, so the constraint dimension becomes smaller. LP and Approximate LP can be solved by simulation/online.

90 Final Remarks Tetris Height: 12 Width: 7 Rotate and move the falling shape Gravity related to current height Score when eliminating an entire level Game over when reaching the ceiling

Final Remarks DP Model of Tetris State: The current board, the current falling tile, predictions of future tiles Termination state: when the tiles reach the ceiling, the game is over with no more

91 Final Remarks DP Model of Tetris State: The current board, the current falling tile, predictions of future tiles Termination state: when the tiles reach the ceiling, the game is over with no more future reward Action: Rotation and shift System Dynamics: The next board is deterministically determined by the current board and the player s placement of the current tile. The future tiles are generated randomly. Uncertainty: Randomness in future tiles Transitional cost g: If a level is cleared by the current action, score 1; otherwise score 0. Objective: Expectation of total score.

92 Final Remarks Interesting facts about Tetris First released in 1984 by Alexey Pajitnov from the Soviet Union Has been proved to be NP-complete. Game will be over with probability 1. For a 12 7 board, the number of possible states Highest score achieved by human 1 million Highest score achieved by algorithm 35 million (average performance)

93 Final Remarks Solve Tetris by Approximate Dynamic Programming Curse of dimensionality: The exact cost-to-go vector V has dimension equal to the state space size (10 25 ). The transition matrix is , although sparse. There is no way to solve an Bellman equation. Cost function approximation Instead of characterizing the entire board configuration, we use features φ 1,..., φ d to describe the current state Represent the cost-to-go vector V using linear combinations of a small number of features V = β 1 φ β d φ d We approximate the high-dim space of V use a low-dim feature space of β.

Final Remarks Features of Tetris The earliest tetris learning algorithm uses 3 features The latest best algorithm uses 22 features The mapping from state

94 Final Remarks Features of Tetris The earliest tetris learning algorithm uses 3 features The latest best algorithm uses 22 features The mapping from state space to feature space can be viewed as state aggregation: different states/boards with similar height/holes/structures are treated as a single meta-state.

95 Final Remarks Learning in Tetris Ideally, we still want to solve the Bellman equation V = max µ g µ + P µ V We consider an approximate solution Ṽ to the Bellman equation such that Ṽ = Φβ We apply policy iteration with approximate function evaluation Outer loop: keep improving the policies µ k using one-step lookahead Inner loop: finding the cost-to-go V µk associated with the current policy µ k, by taking samples and using feature representation We want to avoid high-dimensional algebraic operations, as well as memory of high-dimensional quantities.

96 Final Remarks Algorithms tested on tetris More examples of learning methods: Temporal difference learning, TD(λ), LSTD, cross-entropy method, actor-critic, active learning, etc... These algorithms are essentially all combinations of DP, sampling, parametric models.

97 Final Remarks Tetris World Record: Human vs. AI

AlphaGo a Rollout policy policy network: value network: Human expert positions SL policy network RL policy network

colour 3 Player stone /opponent stone /empty Ones 1 Aconstant plane filled with 1 Turns since 8 How many turns since

would be captured Self-atari size 8 How many ofown stones would be captured Liberties after move 8 Number of

Ladder escape 1 Whether amove at this point is asuccessful ladder escape Sensibleness 1 Whether amove is legal and

98 AlphaGo a Rollout policy policy network: value network: Human expert positions SL policy network RL policy network Value network Policy network Value network p p p (a s) (s ) Feature #of planes Description Self-play positions Stone colour 3 Player stone /opponent stone /empty Ones 1 Aconstant plane filled with 1 Turns since 8 How many turns since amove was played Liberties 8 Number of liberties (empty adjacent points) Capture size 8 How many opponent stones would be captured Self-atari size 8 How many ofown stones would be captured Liberties after move 8 Number of liberties after this move is played Ladder capture 1 Whether amove at this point is asuccessful ladder capture Ladder escape 1 Whether amove at this point is asuccessful ladder escape Sensibleness 1 Whether amove is legal and does not fill its own eyes Zeros 1 Aconstant plane filled with 0 Player color 1 Whether current player is black p Neural network Data b s s

Introduction Masashi Sugiyama Statistical

Learning: State-of-the-Art Also in MDP books

99 Books Richard S. Sutton and Andrew G. Barto Reinforcement Learning: An Introduction Masashi Sugiyama Statistical Reinforcement Learning: Modern Machine Learning Approaches Marco Wiering and Martijn van Otterlo (eds) Reinforcement Learning: State-of-the-Art Also in MDP books Mykel J. Kochenderfer Decision Making Under Uncertainty: Theory and Application

Infinite-Horizon Dynamic Programming

1/70 Infinite-Horizon Dynamic Programming http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html Acknowledgement: this slides is based on Prof. Mengdi Wang s and Prof. Dimitri Bertsekas lecture notes 2/70 作业