Introduction to Approximate Dynamic Programming

Size: px

Start display at page:

Download "Introduction to Approximate Dynamic Programming"

Sharlene Norris
5 years ago
Views:

1 Introduction to Approximate Dynamic Programming Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Approximate Dynamic Programming 1

2 Key References Bertsekas, D.P Chapter 6, Approximate Dynamic Programming, Dynamic Programming and Optimal Control, 3rd Edition, Volume II. Available online at Bertsekas, D.P., J.N. Tsitsiklis Neuro-Dynamic Programming. Athena Scientific, Belmont, MA. Powell, W Approximate Dynamic Programming: Solving the Curses of Dimensionality. Wiley-Interscience. Dan Zhang, Spring 2012 Approximate Dynamic Programming 2

3 Outline Setup: infinite horizon discounted MDPs Policy evaluation via Monte Carlo simultion Q-Learning Linear programming based approximate dynamic programming Approximate policy iteration Rollout policies State aggregation Dan Zhang, Spring 2012 Approximate Dynamic Programming 3

4 Setup Infinite-horizon discounted MDP Transition probability p ij (u) Cost: g(i, u, j) with g(i, u, j) < Discounting: α [0, 1). Discrete state space: S is finite Action space U. Dan Zhang, Spring 2012 Approximate Dynamic Programming 4

5 MDP Model Let π Π MD. States visited under policy π follows a Markov chain with transition probability P π = {p ij (π(i)) : i, j S}. Total discounted cost { T } J π (i) = lim E α t g(it π, π(it π ), it+1) π. T t=0 An optimal policy can be computed by solving the optimality equations: J(i) = min p ij (u)[g(i, u, j) + αj(j)]. u U(i) j S Dan Zhang, Spring 2012 Approximate Dynamic Programming 5

6 Value Iteration The optimality equation can be written as J = TJ, where [TJ](i) = min u U(i) p ij (u)[g(i, u, j) + αj(j)]. j S The value function J is a fixed point of the operator T. Under suitable technical conditions, J = lim k T k J 1 for an initial vector J 1. Dan Zhang, Spring 2012 Approximate Dynamic Programming 6

7 Policy Iteration Define [T π J](i) = j S p ij (π(i))[g(i, π(i), j) + αj(j)]. It can be shown that the discounted cost J π incurred by policy π is a fixed point of the operator T π ; i.e., J π = T π J π. Dan Zhang, Spring 2012 Approximate Dynamic Programming 7

8 Policy Iteration (Continued) Policy evaluation: Compute the discounted cost J πk incurred by policy π k, possibly by solving the system of linear equations J πk = g πk + αp πk J πk. Policy improvement: For all i S, let policy π k+1 be defined such that we have π k+1 (i) = argmin p ij (u)[g(i, u, j) + αj πk (j)]. u U(i) j S Stop if π k = π k+1. Dan Zhang, Spring 2012 Approximate Dynamic Programming 8

9 Three Curses of Dimensionality (Powell, 2007) State space is large: Computing and storing the value function v( ) can be difficult for both value iteration and policy iteration algorithms Action space is large: Both value iteration and policy evaluation become difficult Computing expectations with respect to the transition probability matrix can be difficult when the system dynamics is complex Dan Zhang, Spring 2012 Approximate Dynamic Programming 9

10 Policy Evaluation with Monte Carlo Simulation Policy evaluation requires solving linear equations of the form J π = g π + αp π J π. Difficult when P is large or unknown Idea: Simulate the sequence of state {i 0, i 1,... } by following a particular policy π. An approximation J of J π can be updated as follows: J(i k ) (1 γ)j(i k ) + γ[g(i k, π(i k ), i k+1 ) + αj(i k+1 )]. Dan Zhang, Spring 2012 Approximate Dynamic Programming 10

11 Policy Evaluation with Monte Carlo Simulation Policy evaluation requires solving linear equations of the form J π = g π + αp π J π. Difficult when P is large or unknown Idea: Simulate the sequence of state {i 0, i 1,... } by following a particular policy π. An approximation J of J π can be updated as follows: J(i k ) (1 γ)j(i k ) + γ[g(i k, π(i k ), i k+1 ) + αj(i k+1 )]. Why does this make sense? Dan Zhang, Spring 2012 Approximate Dynamic Programming 10

12 Policy Evaluation with Monte Carlo Simulation (continued) The procedure requires storing J, which can be problematic when the state space S is large. Idea: Use a parameterized approximation architecture: J(i, r) = L r l φ l (i), l=1 where φ l ( ) s are pre-specified functions ( feature functions ) and r is a vector of adjustable parameters. Dan Zhang, Spring 2012 Approximate Dynamic Programming 11

13 Policy Evaluation with Monte Carlo Simulation (continued) The procedure requires storing J, which can be problematic when the state space S is large. Idea: Use a parameterized approximation architecture: J(i, r) = L r l φ l (i), l=1 where φ l ( ) s are pre-specified functions ( feature functions ) and r is a vector of adjustable parameters. How to update r? Dan Zhang, Spring 2012 Approximate Dynamic Programming 11

14 Q-Learning Q-learning is based on an alternative representation of the optimality equation: J(i) = min Q(i, u), u U(i) Q(i, u) = j S p ij (u)[g(i, u, j) + αj(j)]. Dan Zhang, Spring 2012 Approximate Dynamic Programming 12

15 Q-Learning Q-learning is based on an alternative representation of the optimality equation: J(i) = min Q(i, u), u U(i) Q(i, u) = j S p ij (u)[g(i, u, j) + αj(j)]. What is the interpretation of Q(i, u)? Dan Zhang, Spring 2012 Approximate Dynamic Programming 12

16 Q-Learning (Continued) Key idea: Evaluate Q(i, u) by using a stochastic approximation iteration: Q(i k, u k ) (1 γ)q(i k, u k )+γ[g(i k, u k, s k )+α min Q(s k, v)], v U(s k ) where the successor state s k is sampled according to the probabilities {p ik,j(u k ) : j S}. Dan Zhang, Spring 2012 Approximate Dynamic Programming 13

17 Q-Learning (Continued) Key idea: Evaluate Q(i, u) by using a stochastic approximation iteration: Q(i k, u k ) (1 γ)q(i k, u k )+γ[g(i k, u k, s k )+α min Q(s k, v)], v U(s k ) where the successor state s k is sampled according to the probabilities {p ik,j(u k ) : j S}. What is the benefit of Q-learning? Dan Zhang, Spring 2012 Approximate Dynamic Programming 13

18 Linear Programming Approach Let θ(s) be positive scalars such that s S θ(s) = 1. The linear programming formulation is given by max θ(i)j(i) J i S J(i) j S αp ij (u)j(j) j S P ij (u)g(i, u, j), i S, u U(i). Dual linear program is given by min p ij (u)g(i, u, j)x(i, u) x i S u U(i) j S u U(i) x(i, u) j S x(i, u) 0, u U(j) αp ji (u)x(j, u) = θ(i), i S, i S, u U(i). Dan Zhang, Spring 2012 Approximate Dynamic Programming 14

19 Linear Programming Approach (continued) The size of LP can be reduced by using a parameterized approximation architecture (Schweitzer and Seidmann, 1985): J(i, r) = L r l φ l (i), l=1 Two solution approaches: Simulation-based approach (de Farias and Van Roy, 2003; also Powell, 2007) Mathematical programming-based approach (Adelman, 2003; Adelman, 2007) Dan Zhang, Spring 2012 Approximate Dynamic Programming 15

20 Approximate Policy Iteration Policy evaluation algorithm can be difficult when the problem is large. Idea: Carry out policy iteration approximately. Dan Zhang, Spring 2012 Approximate Dynamic Programming 16

21 Approximate Policy Iteration (Continued) Approximate policy evaluation: Simulate a sequence of states {i 0, i 1,... } by following policy π k. Let C l = t=l αt l g(i t, π k (i t ), i t+1 ) for all l = 0, 1,.... Let r k be the solution to the regression problem min r { t=0 [ J(i t, r) C t ] 2 } = min r [ L t=0 l=1 ] 2 r l φ l (i t ) C t. Approximate policy improvement: For all i S, let policy π k+1 be defined such that we have π k+1 (i) = argmin p ij (u)[g(i, u, j) + α J(j, r k )]. u U(i) j S Dan Zhang, Spring 2012 Approximate Dynamic Programming 17

22 Rollout Policy Idea: Improve the performance of a given policy. Given policy π, let π be defined such that for each i S, π (i) = argmin p ij (u)[g(i, u, j) + αj π (j)]. u U(i) j S Then it can be shown that π is at least as good as π. Simulation may be involved to estimate J π. Dan Zhang, Spring 2012 Approximate Dynamic Programming 18

23 State Aggregation Idea: Partition the state space into a number of subsets and assume the value function is constant over each partition. State aggregation can be combined with other approximation methods. Dan Zhang, Spring 2012 Approximate Dynamic Programming 19

24 Discussion ADP aims at alleviating computational effort required to solve large-scale dynamic programs The area is still in its infancy A commonly accepted definition of ADP does not seem to exist Open problem: How to specify feature functions? Research to date usually assume they are fixed in advance. Recent advances: Klabjan and Adelman (2007) Develop efficient solution approaches for practical applications could be valuable. Dan Zhang, Spring 2012 Approximate Dynamic Programming 20

Computation and Dynamic Programming

Computation and Dynamic Programming Huseyin Topaloglu School of Operations Research and Information Engineering, Cornell University, Ithaca, New York 14853, USA topaloglu@orie.cornell.edu June 25, 2010