Linear Programming Methods

Similar documents
Value and Policy Iteration

Stochastic Shortest Path Problems

Supplementary lecture notes on linear programming. We will present an algorithm to solve linear programs of the form. maximize.

Section Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018

The dual simplex method with bounds

- Well-characterized problems, min-max relations, approximate certificates. - LP problems in the standard form, primal and dual linear programs

TMA947/MAN280 APPLIED OPTIMIZATION

Lecture notes for Analysis of Algorithms : Markov decision processes

"SYMMETRIC" PRIMAL-DUAL PAIR

HOW TO CHOOSE THE STATE RELEVANCE WEIGHT OF THE APPROXIMATE LINEAR PROGRAM?

1 Stochastic Dynamic Programming

Robustness of policies in Constrained Markov Decision Processes

Constrained Optimization

A Review of Linear Programming

Duality Theory of Constrained Optimization

5. Duality. Lagrangian

A Parametric Simplex Algorithm for Linear Vector Optimization Problems

Lectures 6, 7 and part of 8

ORIE 6300 Mathematical Programming I August 25, Lecture 2

Convex Optimization Boyd & Vandenberghe. 5. Duality

An introductory example

Introduction to Mathematical Programming IE406. Lecture 10. Dr. Ted Ralphs

minimize x subject to (x 2)(x 4) u,

3.10 Lagrangian relaxation

OPTIMALITY OF RANDOMIZED TRUNK RESERVATION FOR A PROBLEM WITH MULTIPLE CONSTRAINTS

Lecture 4 September 15

c) Place the Coefficients from all Equations into a Simplex Tableau, labeled above with variables indicating their respective columns

Lecture: Duality.

In the original knapsack problem, the value of the contents of the knapsack is maximized subject to a single capacity constraint, for example weight.

Part 1. The Review of Linear Programming

N. L. P. NONLINEAR PROGRAMMING (NLP) deals with optimization models with at least one nonlinear function. NLP. Optimization. Models of following form:

IE 5531 Midterm #2 Solutions

4.4 The Simplex Method and the Standard Minimization Problem

Q-Learning for Markov Decision Processes*

Understanding the Simplex algorithm. Standard Optimization Problems.

Motivating examples Introduction to algorithms Simplex algorithm. On a particular example General algorithm. Duality An application to game theory

The Simplex Method for Some Special Problems

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Ω R n is called the constraint set or feasible set. x 1

September Math Course: First Order Derivative

Chap6 Duality Theory and Sensitivity Analysis

Linear Programming Redux

Linear Programming Inverse Projection Theory Chapter 3

Section Notes 9. IP: Cutting Planes. Applied Math 121. Week of April 12, 2010

Algorithms for constrained local optimization

MDP Preliminaries. Nan Jiang. February 10, 2019

2.098/6.255/ Optimization Methods Practice True/False Questions

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount

IE 5531: Engineering Optimization I

Section Notes 8. Integer Programming II. Applied Math 121. Week of April 5, expand your knowledge of big M s and logical constraints.

Lecture 5. Theorems of Alternatives and Self-Dual Embedding

Simplex Algorithm for Countable-state Discounted Markov Decision Processes

Convex Optimization M2

The general programming problem is the nonlinear programming problem where a given function is maximized subject to a set of inequality constraints.

Valid Inequalities and Restrictions for Stochastic Programming Problems with First Order Stochastic Dominance Constraints

Written Examination

Optimality, Duality, Complementarity for Constrained Optimization

Integer Programming ISE 418. Lecture 12. Dr. Ted Ralphs

where u is the decision-maker s payoff function over her actions and S is the set of her feasible actions.

Homework Assignment 4 Solutions

Chapter 5 Linear Programming (LP)

MAT016: Optimization

Interior-Point Methods for Linear Optimization

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE

Part IB Optimisation

OPTIMISATION 3: NOTES ON THE SIMPLEX ALGORITHM

1. Algebraic and geometric treatments Consider an LP problem in the standard form. x 0. Solutions to the system of linear equations

ORF 522. Linear Programming and Convex Analysis

Distributed Optimization. Song Chong EE, KAIST

Lecture Simplex Issues: Number of Pivots. ORIE 6300 Mathematical Programming I October 9, 2014

CSCI 1951-G Optimization Methods in Finance Part 01: Linear Programming

Lecture 5: The Bellman Equation

Lecture 14: Random Walks, Local Graph Clustering, Linear Programming

Lecture 8 Plus properties, merit functions and gap functions. September 28, 2008

A linear programming approach to nonstationary infinite-horizon Markov decision processes

Statistics 150: Spring 2007

1 Review Session. 1.1 Lecture 2

Week_4: simplex method II

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

1 Overview. 2 Extreme Points. AM 221: Advanced Optimization Spring 2016

Chapter 1. Preliminaries

Another max flow application: baseball

Lagrange Duality. Daniel P. Palomar. Hong Kong University of Science and Technology (HKUST)

Exam in TMA4180 Optimization Theory

Computational Integer Programming Universidad de los Andes. Lecture 1. Dr. Ted Ralphs

Lecture 3: Semidefinite Programming

A Shadow Simplex Method for Infinite Linear Programs

On the relation between concavity cuts and the surrogate dual for convex maximization problems

TRINITY COLLEGE DUBLIN THE UNIVERSITY OF DUBLIN. School of Mathematics

Linear programs, convex polyhedra, extreme points

Linear Programming. Jie Wang. University of Massachusetts Lowell Department of Computer Science. J. Wang (UMass Lowell) Linear Programming 1 / 47

Network Flows. 6. Lagrangian Relaxation. Programming. Fall 2010 Instructor: Dr. Masoud Yaghini

The L-Shaped Method. Operations Research. Anthony Papavasiliou 1 / 44

Linear Programming, Lecture 4

Introduction to Approximate Dynamic Programming

Integer programming: an introduction. Alessandro Astolfi

LP. Kap. 17: Interior-point methods

EE/AA 578, Univ of Washington, Fall Duality

Lecture 13: Constrained optimization

Transcription:

Chapter 11 Linear Programming Methods 1 In this chapter we consider the linear programming approach to dynamic programming. First, Bellman s equation can be reformulated as a linear program whose solution is the optimal value function, offering a computational alternative to the value and policy iteration methods. Next, the dual of this linear program gives a new point of view on the optimal control problem, where one solves a dynamic optimization problem by optimizing over the vectors of state-action frequencies, which correspond to Markov randomized stationary policies. This formulation is particularly useful in treating multi-objective or constrained dynamic programming problems, as well as Markov games. LP methods are also useful for sensitivity analysis. Finally, they can be combined with approximation methods as discussed in chapter 20. 11.1 Linear Programming Formulations We consider in this chapter finite Markov decision processes, i.e., problems with a finite number of states and controls. We discuss first consider discounted cost problems as in chapter 6. Denote the state space X = {1,..., n}. We then know that starting with a cost vector J R n, we have lim k T k J = J, where J is the optimal value function, see theorem 6.5.1. Suppose J TJ. Then by the monotonicity of the dynamic programming operator T, we get J TJ T 2 J... lim k T k J = J = TJ. Hence J TJ J J, and so J is the largest vector that satisfies the constraint J TJ. Moreover, the nonlinear constraints n J(i) min u U(i) c(i, u)+α p ij (u)j(j),i=1,..., n, 1 This version: November 16 2009. j=1 109

can be rewritten as a system of linear inequalities on the variables J(i),i = 1,..., n J(i) c(i, u)+α n p ij (u)j(j), i=1,..., n, u U(i). j=1 Let the α (0, 1). From the preceding discussion we see immediately that the vector J is the solution of any linear program of the form maximize (1 α) n ν i J i (11.1) i=1 subject to J i c(i, u)+α n p ij (u)j(j), i=1,..., n, u U(i), j=1 where ν =[ν 1,..., ν n ] T is a set of positive weights ν i > 0,i=1,..., n, and the decision variables are J i,i=1,..., n. The normalization factor (1 α) can be omitted but it is convenient for our purpose. Note that if in addition we choose ν to belong to the interior of the probability simplex ν n 1 = {x R n : x i > 0, n x i =1}, then ν i can be interpreted as a probability of starting in state i X, and ν T J = n i=1 ν ij (i) is the optimal expected cost for the optimal control problem when the initial state is probabilistically distributed with distribution ν. If we wanted to compute J (i) for a subset of states X X, we could take some of the coefficients ν i equal to 0 for i / X, but it is not clear why one would want to do this, since the computational complexity of the problem is not reduced. The linear program (11.1) has n variables and up to n i=1 U(i) constraints, which makes the method impractical for large state and control spaces, even when some large scale linear programming techniques are used. In this case, the linear program (11.1) can be combined with approximation architectures, see chapter 20. Dual Linear Program The dual program of (11.1) is perhaps more interesting for applications. It can be written min c(x, u)ρ(x, u) (11.2) x X s.t. ρ(x, u) x X ρ(x, u) 0, x X,u U(x). i=1 αp x x(u)ρ(x,u) = (1 α)ν x, x X (11.3) 110

It is sometimes stated that this dual program is computationally preferable because it has only X constraints (not counting the nonnegativity constraints) [Put94, p.224], but since now we have more variables than in (11.1), this statement does not seem very clear. Remark. For the initial distribution, in the following we use the notation ν x and ν(x) interchangeably. Also, an equivalent way of writing (11.3) is x X (δ x (x ) αp x x(u))ρ(x,u) = (1 α)ν(x). If we sum these constraints over x, we see that ρ(x,u)=1, (11.4) x X and so the set of feasible solutions is a polytope (i.e., bounded polyhedron). 11.2 State-Action Frequencies The quantities {ρ(x, a)} x,a, if they satisfy the constraints (11.3), can be interpreted as a state-action frequencies, i.e., ρ(x, a) is the proportion of time for which the system is in state x and action u is taken, under a certain stationary policy. In fact, randomized Markov 2 stationary policies are in correspondence with the feasible solutions of (11.2). Note that the quantity ρ(x) := ρ(x, u) appearing in the LP (11.2) has then the interpretation of state frequencies. Note that by (11.4) we see that {ρ(x)} x forms a probability distribution over X, i.e., x X ρ(x) = 1. Similarly {ρ(x, u)} x,u is a probability measure on the space of state-action pairs by (11.4). Consider a Markov policy π, not necessarily stationary, and possibly randomized. Together with the initial distribution ν, this defines a probability measure P π ν on the sample paths the state-action pairs. We define the stateaction frequencies corresponding to ν and π as ρ π ν (x, u) := (1 α) x X ν(x ) = (1 α) α k P π (X k = x, U k = u X 0 = x ) (11.5) α k P π ν (X k = x, U k = u). 2 A Markov policy is just a policy that depends only on the current state. We haven t encountered any other policy in this course, nor the need for them. Here we are adding the possibility of randomizing controls. 111

Clearly ρ π ν forms a probability measure on the space of state-action pairs. It turns out that there is always a stationary policy µ which achieves the same vector of state-action frequencies. To see this, consider the stationary randomized policy µ defined by µ(u x) = ρπ ν (x, u) ρ π ν (x), whenever the denominator is non-zero. When it is zero, chose µ( x) arbitrarily. Here µ(u x) represents the probability of choosing control u U(x) when the state is x. We have ρ π ν (x) = (1 α)ν(x) + (1 α)α α k P π ν (X k+1 = x) { = (1 α) ν(x) + α α k x X = (1 α)ν(x)+ x X } P(X k+1 = x X k = x,u k = u)p π ν (X k = x,u k = u) = (1 α)ν(x)+α x X ρ π ν (x ) αp x x(u)ρ π ν (x,u) (11.6) = (1 α)ν(x)+α x X ρ π ν (x )P µ (x,x). p x x(u)µ(u x ) In matrix notation, this can be written ( ) (ρ π ν ) T = (1 α)ν T (I αp µ ) 1 = (1 α)ν T α k Pµ k. In other words, ρ π ν (x) = (1 α) α k P µ ν (X k = x) =ρ µ ν (x). (11.7) Moreover, by definition of µ(u x), we have ρ µ ν (x, u) = (1 α) α k P µ ν (X k = x)µ(u x). Using the relation (11.7), we get ρ µ ν (x, u) = (1 α) α k P µ ν (X k = x) ρπ ν (x, u) ρ µ ν (x) = ρ π ν (x, u). Note that the constraints (11.6) satisfied by the state-action frequencies (of any Markov policy) are exactly the constraints (11.3) of the linear program. We have the following theorem. 112

Theorem 11.2.1. 1. For each Markov stationary randomized policy µ and positive initial distribution ν, {ρ µ ν (x, u)} x,u is a feasible solution to the linear program (11.2). 2. Suppose {ρ(x, u)} x,u is a feasible solution to the problem (11.2). Then, for each x X, we have ρ(x, u) > 0. Define the Markov randomized stationary policy µ by µ(u x) = ρ(x, u) ρ(x, u) u U(x) ρ(x, = u ) ρ(x). (11.8) Then for this policy µ, we can define the state-action frequencies ρ µ ν (11.5), and we have ρ µ ν (x, u) =ρ(x, u) for all x X and u U(x). by Proof. The first statement was proved above, see (11.6). Consider now the statement 2. The positivity of ρ(x) follows from that of ν(x) and the nonnegativity of ρ(x, u) in (11.3). Then note that ρ(x, a) satisfies the constraint (11.6) and so we have immediately ρ(x, u) =ρ µ ν (x, u) for all x, u by following the steps above. Recall now that our objective is to optimize a function of the form J π (ν) := (1 α) x X ν(x)j π (x) =E π ν [ ] α k (1 α)c(x k,u k ), (11.9) where E π ν is the expectation operator over paths of the Markov chain obtained once the policy π and the initial distribution ν are fixed. Here we generalize our earlier notation, with J π (ν) being the cost of the policy π when the initial state is distributed according to ν. We can rewrite this objective as J π (ν) =E π ν = x X [ x X c(x, u)ρ π ν (x, u), ] α k (1 α)c(x, u)1{x k = x, U k = u}, where the interchange of summation and expectation is allowed by the dominated convergence theorem. This last expression is exactly the objective of the linear program (11.2). By theorem 11.2.1, assuming ν is positive for simplicity, there is a bijection between Markov stationary randomized policies and feasible solutions of the LP (11.2) (which maps a stationary policy to its state-action frequencies, and in the reverse direction, defines a stationary policy by (11.8)) 3. Recall that a basic feasible solution of an LP is a solution that cannot be expressed as a nontrivial convex combination of any other feasible solutions of the 3 ν(x) > 0 guarantees ρ(x) > 0. If ρ(x) = 0, we have a choice in the definition of the stationary policy for the states that are almost surely never visited, see the beginning of this section. 113

LP. It corresponds to extreme points of the feasible region. A key property of basic feasible solutions is that when an LP with m rows has a bounded optimal solution, then any basic feasible solution has at most m positive components. The next proposition establishes a one-to-one correspondence between stationary deterministic policies and extreme points (basic feasible solutions) of the LP (11.2). Since we know that the optimal solution of (11.2) is attained at one of these extreme points, this gives us a proof that to find optimal policies, it is sufficient to consider deterministic policies. Moreover, the discussion at the beginning of this section tells us that non-stationary policies do not provide better solutions. Finally, the LP geometry shows that in certain cases, it is possible that several deterministic policies are optimal, in which case randomizing between these policies gives a convex set of optimal randomized policies. Proposition 11.2.2. Let ρ be a basic feasible solution to the LP (11.2). Then µ defined by (11.8) is deterministic. Conversely, the state-action frequency vector of a Markov deterministic policy is a basic feasible solution to the LP (11.2). Proof. We know that for all x, ρ(x, u) > 0. Moreover, the LP (11.2) has X rows. Hence in a basic feasible solution, we must have ρ(x, u) > 0 for exactly one u, and we conclude that µ defined by (11.8) is deterministic. For the converse, assume ρ µ ν is feasible but not basic. Then there are two distinct basic feasible solutions ρ 1 and ρ 2 and 0 θ 1 such that ρ µ ν = (1 θ)ρ 1 + θρ 2. Because ρ 1 and ρ 2 are distinct, there is some x X such that ρ 1 (x, u 1 ) > 0 and ρ 1 (x, u 2 ) > 0 for u 1 u 2. But then ρ µ ν (x, u 1 ) > 0 and ρ µ ν (x, u 2 ) > 0, so µ must be randomized. We summarize the results in the following theorem. Theorem 11.2.3. There exists a bounded optimal basic feasible solution ρ to the LP (11.2). If ρ is an optimal solution to (11.2), then µ defined by (11.8) is an optimal policy. If ρ is an optimal basic solution to (11.2), then µ defined by (11.8) is an optimal deterministic policy. If µ is an optimal policy, then ρ µ ν is an optimal solution for (11.2). If µ is an optimal deterministic policy, then ρ µ ν is an optimal basic solution for (11.2). For any positive vector ν, the LP (11.2) has the same optimal basis (columns of the constraint matrix that determine the basic feasible solution). Hence the optimal policy does not depend on ν. 114

Note that if ρ and J are optimal solutions of the LPs (11.2) and (11.1), we have (strong duality holds) J (ν) = (1 α) ν(x)j (x) = ρ (x, u)c(x, u). x X x X Remark. Solving the LP (11.2) by the simplex method with block pivoting is equivalent to policy iteration. 11.3 Constrained Dynamic Programming In view of (11.9), it is easy to add constraints to the model. Let (x, u) d(x, u) be another cost function, and suppose we wish to solve the initial optimal control problem for c(x, u) with the additional constraint that the expected total discounted cost for d does not exceed some constant D E π ν [ ] α k d(x k,u k ) D. This constraints can be rewritten ρ µ ν (x, a)d(x, a) D, x X and can be directly added to the LP (11.2). This cuts the polytope corresponding to the feasible region of the unconstrained DP problem, and in general, optimal policies must then include some randomization, with more randomization necessary as we add more constraints, see the proof of proposition 11.2.2. To deal with such constraints in the more standard dynamic programming formulation, we would have to introduce Lagrange multipliers. 11.4 Markov Games For a future version. 115