Robust Modified Policy Iteration

Similar documents
CORC Tech Report TR Robust dynamic programming

Infinite-Horizon Discounted Markov Decision Processes

Lecture notes for Analysis of Algorithms : Markov decision processes

Reinforcement Learning

Policy iteration for robust nonstationary Markov decision processes

Simplex Algorithm for Countable-state Discounted Markov Decision Processes

Preference Elicitation for Sequential Decision Problems

Practicable Robust Markov Decision Processes

Robust Markov Decision Processes

Infinite-Horizon Average Reward Markov Decision Processes

Markov decision processes and interval Markov chains: exploiting the connection

MDP Preliminaries. Nan Jiang. February 10, 2019

Internet Monetization

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Markov Decision Processes and Dynamic Programming

Planning and Model Selection in Data Driven Markov models

Duality in Robust Dynamic Programming: Pricing Convertibles, Stochastic Games and Control

Distributed Optimization. Song Chong EE, KAIST

Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty

A Shadow Simplex Method for Infinite Linear Programs

Robust Stochastic Lot-Sizing by Means of Histograms

Quantifying Stochastic Model Errors via Robust Optimization

Loss Bounds for Uncertain Transition Probabilities in Markov Decision Processes

Total Expected Discounted Reward MDPs: Existence of Optimal Policies

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Linear Programming Methods

Introduction to Reinforcement Learning

Real Time Value Iteration and the State-Action Value Function

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Decision Theory: Markov Decision Processes

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Algorithms for MDPs and Their Convergence

Artificial Intelligence

Markov Decision Processes and Dynamic Programming

Value and Policy Iteration

Information Relaxation Bounds for Infinite Horizon Markov Decision Processes

Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty

Semidefinite and Second Order Cone Programming Seminar Fall 2012 Project: Robust Optimization and its Application of Robust Portfolio Optimization

Multi-model Markov Decision Processes

Robust Dual-Response Optimization

Optimality Results in Inventory-Pricing Control: An Alternate Approach

Stochastic Shortest Path Problems

Lecture 1. Stochastic Optimization: Introduction. January 8, 2018

Sequential Decision Problems

Value Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes

Motivation for introducing probabilities

Robust Partially Observable Markov Decision Processes

Near-Potential Games: Geometry and Dynamics

1 Stochastic Dynamic Programming

On deterministic reformulations of distributionally robust joint chance constrained optimization problems

CS261: Problem Set #3

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Fast Bellman Iteration: An Application of Legendre-Fenchel Duality to Infinite-Horizon Dynamic Programming in Discrete Time

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Introduction and Preliminaries

Discrete planning (an introduction)

On Finding Optimal Policies for Markovian Decision Processes Using Simulation

Fast Bellman Iteration: An Application of Legendre-Fenchel Duality to Infinite-Horizon Dynamic Programming in Discrete Time

Artificial Intelligence & Sequential Decision Problems

Control Theory : Course Summary

CSE250A Fall 12: Discussion Week 9

Fast Bellman Iteration: An Application of Legendre-Fenchel Duality to a Class of Deterministic Dynamic Programming Problems in Discrete Time

MIDAS: A Mixed Integer Dynamic Approximation Scheme

University of Warwick, EC9A0 Maths for Economists Lecture Notes 10: Dynamic Programming

Reinforcement Learning

Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS

Game Theory and its Applications to Networks - Part I: Strict Competition

Reinforcement Learning

Basic Deterministic Dynamic Programming

(s, S) Optimality in Joint Inventory-Pricing Control: An Alternate Approach*

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

Optimal Stopping Problems

Linear Programming Formulation for Non-stationary, Finite-Horizon Markov Decision Process Models

Decision Theory: Q-Learning

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Unconstrained optimization

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount

Fast Bellman Iteration: An Application of Legendre-Fenchel Duality to Deterministic Dynamic Programming in Discrete Time

16.4 Multiattribute Utility Functions

A linear programming approach to nonstationary infinite-horizon Markov decision processes

Probabilistic Model Checking Michaelmas Term Dr. Dave Parker. Department of Computer Science University of Oxford

21 Markov Decision Processes

Long-run Average Reward for Markov Decision Processes

An Introduction to Markov Decision Processes. MDP Tutorial - 1

Sampled Fictitious Play for Approximate Dynamic Programming

Reinforcement Learning II

OPTIMALITY OF RANDOMIZED TRUNK RESERVATION FOR A PROBLEM WITH MULTIPLE CONSTRAINTS

Robust Control of Uncertain Markov Decision Processes with Temporal Logic Specifications

The Complexity of Ergodic Mean-payoff Games,

1 Markov decision processes

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

On Robust Arm-Acquiring Bandit Problems

Cyclic Equilibria in Markov Games

A linear programming approach to constrained nonstationary infinite-horizon Markov decision processes

Minimax and risk averse multistage stochastic programming

Lecture notes: Rust (1987) Economics : M. Shum 1

arxiv: v1 [math.oc] 9 Oct 2018

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm

Transcription:

Robust Modified Policy Iteration David L. Kaufman Department of Industrial and Operations Engineering, University of Michigan 1205 Beal Avenue, Ann Arbor, MI 48109, USA 8davidlk8umich.edu (remove 8s) Andrew J. Schaefer Department of Industrial Engineering, University of Pittsburgh 1048 Benedum Hall, Pittsburgh, PA 15261, USA schaefer@ie.pitt.edu September 30, 2011 Robust dynamic programming (robust DP) mitigates the effects of ambiguity in transition probabilities on the solutions of Markov decision problems. We consider the computation of robust DP solutions for discrete-stage, infinite-horizon, discounted problems with finite state and action spaces. We present robust modified policy iteration (RMPI) and demonstrate its convergence. RMPI encompasses both of the previously known algorithms, robust value iteration and robust policy iteration. In addition to proposing exact RMPI, in which the inner problem is solved precisely, we propose inexact RMPI, in which the inner problem is solved to within a specified tolerance. We also introduce new stopping criteria based on the span seminorm. Finally, we demonstrate through some numerical studies that RMPI can significantly reduce computation time. Key words: control, control processes, Markov processes, optimization, programming: dynamic 1. Introduction Markov decision models typically assume that state transition probabilities are known with certainty. In practice, however, these probabilities are estimated from data. It may be the case that data are scarce for some state-action pairs; that is, the true transition probability measures may not be known. Unfortunately, dynamic programming (DP) solutions may be sensitive to errors in the estimation of these probabilities. Robust dynamic programming mitigates the effects of ambiguity in transition probabilities on resulting decisions. In this paper we consider the computation of robust DP solutions in the context of discrete-stage, 1

infinite-horizon, discounted Markov decision problems with finite state and action spaces. We present a new algorithm robust modified policy iteration (RMPI). Rather than simply using point estimates for state transition probabilities, robust DP makes use of sets of possible transition measures. These so-called uncertainty sets, denoted P, can be constructed to correspond to any given level of confidence. The more ambiguity, the larger the sets P. Robust DP can be thought of as a game between the decision-maker and an adversary called nature. While the decision-maker seeks to maximize (minimize) total reward (cost), nature seeks to minimize (maximize) the decision-maker s reward (cost) by selecting worst-case transition measures in P. Hence, robust DP solutions are optimal under a max-min objective function. Standard DP models (without ambiguity) in the infinite-horizon, finite states and actions setting are well understood (c.f., Puterman 1994). Solution methodologies include value iteration, linear programming, policy iteration, and modified policy iteration. We are interested in their robust counterparts. The study of Markov decision processes (MDPs) with uncertain transition probabilities goes back at least to the work of Silver (1963), which was expanded upon by Martin (1967). These works use a Bayesian framework for updating information about unknown transition probabilities. At the start of the decision process, some prior distribution is assumed. A chosen conjugate prior may actually assume very little initial information about the transition probabilities. The decision-maker s problem is to maximize reward while at the same time update relevant information. In the decision process then, there is a tradeoff between exploitation of taking an action that presumably has the higher payoff and exploration to get more information about the transition probabilities. The framework we are interested in is different. We are not concerned with exploration in the decision process, and the decision process does not incorporate Bayesian updating. The max-min setting that we consider only uses statistical information available a priori. MDPs with ambiguous transition probabilities in a max-min framework were previously studied by Satia and Lave (1973), White and Eldeib (1994), and Bagnell et al. (2001). Bagnell et al. (2001) present a robust value iteration algorithm for solving robust DP problems when uncertainty sets are convex and satisfy some conditions. Robust value iteration is similar to standard value iteration, but it includes both a maximization and a minimization in each step. The minimization, which is due to nature s objective, is referred to as the inner problem. 2

More recently, the theory of robust DP was advanced in two concurrent papers by Iyengar (2005) and Nilim and El Ghaoui (2005). In addition to establishing theoretical grounds for robust value iteration, these papers propose methods for efficient computation when uncertainty sets are based on relative entropy bounds or on likelihood regions. These advances open the door for practical applications of robust DP, which include path planning for robots moving around dynamic objects (Bagnell et al. 2001) and for aircraft moving around storms (Nilim and El Ghaoui 2005). Our work is an extension of the work of Iyengar (2005) and Nilim and El Ghaoui (2005) since robust value iteration is a special case of our RMPI algorithm. Other related works include the papers of Harmanec (2002), Mannor et al. (2007), Li and Si (2008), and Delage and Mannor (2010). Both Iyengar (2005) and Nilim and El Ghaoui (2005) provide conditions for and proof of convergence of robust value iteration. As for linear programming, there is no efficient robust counterpart for the robust DP framework; the analogous problem is a non-convex optimization problem (Iyengar 2005, p. 269). A robust policy iteration algorithm is presented by Satia and Lave (1973). As Iyengar (2005, p. 269) points out, however, it is not clear, and [Satia and Lave (1973)] do not show, that this iterative procedure converges. The issue, it seems, is not whether their procedure converges, but rather it is whether it converges to the right value function. Proposition 4 of Satia and Lave (1973) claims that their policy evaluation procedure results in an ɛ-optimal policy for nature, but it has not been proven. Iyengar proposes an alternative robust policy iteration algorithm. For policy evaluation, he presents a robust optimization problem (Iyengar 2005, Lemma 3.2). According to Iyengar (2005, p. 268) though, for most practical applications, the policy evaluation step is computationally expensive and is usually replaced by an m-step look-ahead value iteration. He does not provide such an algorithm nor prove convergence. Prior to RMPI, the most viable option for solving robust DPs has been robust value iteration. Therefore, our main interest is in comparing RMPI to robust value iteration. White and Eldeib (1994) present the only other known modified policy iteration algorithm for a problem with ambiguity. They combine modified policy iteration with a reward revision algorithm, and present an algorithm in Proposition 7 of White and Eldeib (1994). However, they are not able to provide conditions that guarantee convergence of this algorithm. (Note that Assumption (i) of White and Eldeib (1994, Proposition 7) needs to hold at every step, but there is no guarantee that it will.) According to White and Eldeib (1994), Unfortunately, the proof of Lemma 5a in [White et al. (1985)] does not generalize when the 3

[P] are sets under the max-min strategy. We provide conditions that guarantee convergence of RMPI. Our results do not depend on a specific form for P. Depending on how the uncertainty sets are constructed, it might not be efficient to solve the inner problem precisely. Similar to analysis in Nilim and El Ghaoui (2005), we distinguish between two cases: exact RMPI in which the inner problem is solved precisely, and inexact RMPI in which the inner problem is solved only to within a specified tolerance. Our main contributions are as follows: We are the first to present a robust modified policy iteration algorithm together with conditions for and a proof of convergence. For inexact RMPI, we provide conditions on the tolerance that guarantee convergence to a decision rule that is ɛ-optimal: its total reward is within a pre-specified value ɛ of the optimal reward. Moreover, convergence is guaranteed within a finite number of iterations. We provide new stopping criteria based on the span seminorm that guarantee that, after a finite number of iterations, the decision rule is ɛ-optimal. In addition to applying to RMPI, these conditions apply to robust value iteration and are in many cases an improvement over previously known stopping criteria based on the supremum norm. We consider two numerical studies that demonstrate that, as compared to robust value iteration, the RMPI algorithm can substantially reduce the time it takes to compute solutions. The rest of this paper is outlined as follows: In Section 2 we discuss preliminaries and introduce some notation. In Section 3 we present the RMPI algorithm along with proof of its convergence. We analyze the exact case first, in Section 3.1. We then analyze the inexact case, in Section 3.2. We present new stopping criteria based on the span seminorm in Section 4. Numerical results are presented in Section 5. We conclude in Section 6. 2. Preliminaries Denote the state variable by s and the finite state space by S. Decision epochs occur in discrete periods. For each period, for state s S, the decision-maker chooses an action a from the finite set of feasible actions, A s, resulting in an immediate reward r(s, a) plus an 4

expected reward-to-go. The function r(s, a) is assumed to be independent of time. Since S and A s are finite, r(s, a) is bounded. The probability that the state transitions from s in one period to s in the next period is determined by a probability measure chosen by nature. Given s and the decision-maker s choice a for that state, i.e., a state-action pair (s, a), nature is allowed to choose any transition probability in the set P(s, a) M(S), where M(S) denotes the set of all probability measures on S. We will restrict attention to finite S and A s. Hence, an element p P(s, a) is a probability mass function, for which p(s ) is the probability of transitioning from s to s under action a. Denote by d a deterministic function from S to A := s A s (called a decision rule) that prescribes that the decision-maker choose action d(s) in state s, and let D be the set of feasible decision rules. A deterministic Markovian policy is a sequence of decision rules d 0, d 1,...}, where d t is the decision rule for period t. Under some rectangularity assumptions (described below) the decision-maker s optimal deterministic Markovian policy is also optimal among the class of all (allowably) history dependent and randomized control policies (Iyengar 2005, Theorem 3.1). A control policy that employs the same decision rule d in every period, d, d,...}, denoted (d), is a stationary deterministic policy. We assume that the decision-maker follows such a policy. Nature also has a policy, for choosing p P(s, a). It is assumed that nature s choices for a given state-action pair are independent of actions chosen in other states and independent of the history of previously visited states and actions. These independence assumptions are the so-called rectangularity assumptions (Iyengar (2005, Assumption 2.1), Nilim and El Ghaoui (2005, section 2.2)). While we follow these assumptions, they are not the only modeling option available. Li and Si (2008) present a framework in which nature s choices must adhere to a correlated structure nature s choice for one transition matrix row can depend on the choice for another row. Again, we assume independence. A stationary policy of nature is a policy that chooses the same probability measure every time the same state-action pair is visited. Denote such a policy by π, and denote the set of all possible stationary policies of nature by Π. The robust objective function is the expected total discounted reward over an infinite horizon conditioned on the initial state. Infinite-horizon robust DP is defined by the following optimization problems: [ ] v (d) (s) = inf π Π Eπ s λ t r(s t, d(s t )), (1) t=0 v (s) = max d D v(d) (s), (2) 5

where s t is the state realized in period t, E π s denotes expectation under π and initial state s 0 = s, and λ is a discount factor, 0 < λ < 1. This is a sequential game in which the decision-maker first selects d, in (2), and then nature responds by selecting π, in (1). Note that the decision-maker s problem (2) is solved for each initial state, and nature s problem (1) is solved for each initial state and for each feasible decision rule. One can, without loss of generality, relax the stationarity assumptions by allowing both the decision-maker and nature to vary their decision rules over time. Such a relaxation will nonetheless result in optimal policies that are stationary (Nilim and El Ghaoui 2005, Theorem 4). Moreover, the optimal actions of the robust DP problem are characterized by a set of Bellman-type optimality equations (Iyengar 2005, Theorem 3.2): } v (s) = max r(s, a) + λ inf p(s )v (s ), s S. (3) a A s p P(s,a) We are interested in computing the vector v, the value function, so that we can identify optimal actions as those that maximize the right-hand-side of (3). To do so, we need to solve the inner problem: σ P(s,a) (v) := inf s S p P(s,a) s S p(s )v(s ). (4) If the P(s, a) are singletons for all (s, a), then the inner problem reduces to a simple expectation operator and (3) reduces to standard DP optimality equations. Solving the robust inner problem might require more computational effort. For example, if one constructs the sets P(s, a) using relative entropy bounds, then, as shown by Iyengar (2005, Lemma 4.1) and Nilim and El Ghaoui (2005, section 6), (4) can be reduced to a one-dimensional convex optimization problem. We distinguish between solving the inner problem exactly or inexactly. For example, assuming relative entropy bounds, one can solve a one-dimensional convex optimization problem with an algorithm such as bisection. In the inexact case, one can stop with a value σ such that σ σ P(s,a) (v) σ + δ, (5) for some specified δ 0. Note that while the inner problem is a minimization, it is typically solved via its dual maximization problem and this results in a lower bound σ for σ P (v). 6

Robust value iteration successively approximates the value function and updates the decision rule at iteration n + 1, n = 0, 1, 2,..., according to: v n+1 (s) = max a A s r(s, a) + λσp(s,a) (v n ) }, s S. It is important to note that, for each iteration, a separate inner problem is solved for each feasible state-action pair (s, a). If the inner problems are computationally challenging, then the total computational effort required for robust DP might be significantly more than that for standard DP. To ameliorate this challenge, some alternative uncertainty sets have been suggested based on inner and outer approximations of relative entropy bounds that require less computational effort (Iyengar 2005, section 4). At the heart of the matter is the fact that RMPI reduces computational effort by avoiding the need to solve inner problems for all feasible actions in every step. Rather, after the actions are updated in some iteration n, the decision rule is fixed (with one action per state) and the value function is successively approximated for some number of steps, denoted m n 1, prior to updating the decision rule again. When m n = 1 for all n, RMPI is equivalent to robust value iteration. At the other extreme, taking the limits m n results in robust policy iteration. That is, evaluating each fixed policy by successive approximations of the value function until convergence is reached, prior to choosing the next policy, is a special case of the robust policy iteration algorithm of Iyengar (2005, Figure 2). It might take many iterations for a fixed policy s value function to converge, and RMPI can avoid unnecessary computations by updating the decision rule ahead of time. While dualization of the constraints in the policy evaluation problem presented in Lemma 3.2 of Iyengar (2005) might lead to solution methods other than successive approximation (see comment on p. 268 of Iyengar (2005)), no other methods have been proposed. 3. Robust Modified Policy Iteration In this section we present RMPI and prove its convergence. We distinguish between two cases: exact RMPI and inexact RMPI. Inexact RMPI solves the inner problem to within some specified tolerance δ > 0. Exact RMPI assumes δ = 0. We begin by proving convergence of exact RMPI, for which the main result is Theorem 5. We analyze Inexact RMPI in Section 3.2. 7

Let V denote the set of all bounded real-valued functions on S, and let v denote the supremum norm: v = max s S v(s). The space (V, ) is a Banach space: a complete vector space with norm. Let Υ: V V denote the robust value iteration operator: Υv(s) = max a As r(s, a) + λσp(s,a) (v) }. As proved in Theorem 3.2 of Iyengar (2005), Υ is a contraction mapping and v is its unique fixed point. We call a decision rule d ɛ-optimal if, for all s S, v (d) (s) v (s) ɛ. Define V 0 = v V : Υv v}, where the inequality is componentwise. Let v 0 denote an initial estimate of the value function. In order to guarantee convergence we require v 0 V 0, which is analogous to the initial condition of Puterman and Shin (1978). This initial condition is not very restrictive; since the state and action spaces are finite, r(s, a) < M for some finite value M, and we may, without loss of generality, assume that r(s, a) 0 and choose v 0 = 0 to satisfy v 0 V 0. Next, we present the RMPI algorithm, which is used to compute an ɛ-optimal decision rule d ɛ. Robust Modified Policy Iteration Input: v 0 V 0, ɛ > 0, δ : 0 δ < (1 λ)2 ɛ 2λ, and a sequence of positive integers m n }. Output: ɛ-optimal decision rule d ɛ and v ɛ V such that v ɛ v ɛ. 1. Set n = 0. 2. (Policy improvement) (a) For all states s and actions a A s, compute a value σ s,a such that (b) Choose d n+1 to satisfy 3. (Partial policy evaluation) (a) Set k = 1 and σ s,a σ P(s,a) (v n ) σ s,a + δ. d n+1 (s) argmax a A s r(s, a) + λ σ s,a }. u n 1(s) = max a A s r(s, a) + λ σ s,a }. 8

(b) If u n 1 v n < (1 λ)ɛ 2λ δ, (6) go to step 4. Otherwise, go to step 3 (c). (c) If k = m n, go to step 3 (e). Otherwise, for all states, compute a value σ s such that σ s σ P(s,dn+1 (s))(u n k ) σ s + δ. (7) and u n k+1 (s) = r(s, d n+1(s)) + σ s. (d) Increment k by 1 and return to step 3 (c). (e) Set v n+1 = u n m n, increment n by 1, and go to step 2. 4. Set v ɛ = u n 1 and d ɛ = d n+1, and stop. 3.1 Exact RMPI The main convergence result for exact RMPI is Theorem 5. While the proof of this theorem is different, the general approach is similar to the proof of Theorem 6.5.5 of Puterman (1994). We divide the proof into several lemmas. Lemma 3 is analogous to Lemma 6.5.2 of Puterman (1994) and Lemma 4 is analogous to Lemma 6.5.4 of Puterman (1994). The most notable (but not all) complications arising from nature s inner problem, are resolved in Lemma 2. These lemmas require some additional definitions. For clarity of exposition, we restrict our analysis to the case m n = m for all n. Extending to the more general case is straightforward. For d D, define the operator ψ m d ψ 0 dv(s) = v(s), : V V, m 1, 2,...}, by the recursive relationship ψdv(s) k = r(s, d(s)) + λσ P(s,d(s)) (ψ k 1 d v), k 1, 2,..., m}. A decision rule d is said to be v-improving if it satisfies d(s) argmax a A s r(s, a) + λσp(s,a) (v) }, 9

for all s. Let Ψ m v be ψ m d v v, where d v is any v-improving decision rule obtained using any fixed method to choose among the maximizers. The sequence v n } generated by exact RMPI satisfies v n+1 = Ψ m v n. The special case m = 1 is robust value iteration. Define the operator Φ m : V V by Φ m v(s) = max d D ψ m d v(s)}. Note that Φ1 v = Ψ 1 v = ψ 1 d v v = Υv. Denote by e a vector of all ones. Lemma 1 For v V and c R, (a) Υ(v + ce) = λc + Υv, (b) Φ m (v + ce) = λ m c + Φ m v. Proof: Since s S p(s ) = 1 for every p P(s, a), it is straightforward to show that σ P(s,a) (v + ce) = c + σ P(s,a) (v). Hence, Υ(v + ce) = max a A s r(s, a) + λσp(s,a) (v + ce) } = max a A s r(s, a) + λc + λσp(s,a) (v) } = λc + max a A s r(s, a) + λσp(s,a) (v) } = λc + Υv. This establishes part (a). The proof of part (b) is similar, and is omitted for brevity. Lemma 2 For u, v V, if u v, then (a) ψ m d u ψm d v, for any d D, (b) Φ m u Φ m v, (c) Υu Υv. Proof: Fix a decision rule d D. For k 1, 2,..., m}, consider any feasible responses of nature p k s P(s, d(s)). Then, [ ψd m u(s) r(s, d(s)) + m ( m ) λ m+1 j p k s k (s k 1 ) r(s j 1, d(s j 1 )) s m=s;s m 1,s m 2,...,s 0 S ( m ) ] + λ m p k s k (s k 1 ) u(s 0 ), k=1 j=2 k=j where the right-hand side of this expression is the total m-period reward corresponding to d and the responses p k s. Since u v, substituting v for u yields [ ψd m u(s) r(s, d(s)) + m ( m ) λ m+1 j p k s k (s k 1 ) r(s j 1, d(s j 1 )) s m=s;s m 1,s m 2,...,s 0 S ( m ) ] + λ m p k s k (s k 1 ) v(s 0 ). k=1 10 j=2 k=j

The infimum of the right-hand side of this expression over nature s feasible choices p k s equals ψd mv(s). It follows that ψm d u(s) ψm d v(s), establishing part (a). Let d argmax d D ψd mu(s)}. Then, Φm u(s) = ψd mu(s) ψm d v(s) Φm v(s), where the first inequality follows from part (a) and the second inequality follows immediately from the definition of Φ m. This establishes part (b). Part (c) follows from part (b) with m = 1. Lemma 3 For w 0 V and m > 0, the following hold: (a) Φ m is a contraction mapping with constant λ m. (b) The sequence w n+1 = Φ m w n, n = 0, 1,..., converges in norm to the unique fixed point of Φ m, denoted w. (c) w n+1 w λ m w n w. (d) w = v where v is the unique fixed point of Υ: Υv = v. Proof: Let u, v V, and let c = max s S v(s) u(s). Then u ce v u + ce, which by Lemma 1 (b) and Lemma 2 (b) implies that Φ m u λ m c Φ m v Φ m u + λ m c. Hence, Φ m v Φ m u λ m c = λ m v u. This establishes part (a). Parts (b) and (c) then follow immediately from the Banach Fixed Point Theorem (c.f., Puterman 1994, Theorem 6.2.3). Let d v be a v -improving decision rule. Since v is the unique fixed point of Υ, by the convergence of robust value iteration we have v = (Υ) m v = ψd m v v Φ m v. Similarly for all n, v (Φ m ) n v, and we conclude that v w. Since Φ m assumes a fixed decision rule (while nature s responses may vary) but the decision maker s actions in (Υ) m are allowed to be dynamic, it can be shown that Φ m w (Υ) m w. Hence, w = (Φ m ) n w ((Υ) m ) n w, and letting n gives w v. Therefore, w = v ; part (d) holds. Lemma 4 Suppose v V 0. Then, for all m 1, 2,...}, (a) Ψ m+1 v Ψ m v, (b) Ψ m v V 0. 11

Proof: Let d v be a v-improving decision rule. Then, Ψ m+1 v = ψ m+1 d v v = ψ m d v ( ψ 1 dv v ) = ψ m d v (Υv). Since v V 0, Υv v, and ψ m d v (Υv) ψ m d v v by Lemma 2 (a). Therefore, Ψ m+1 v ψ m d v v = Ψ m v; part (a) holds. Part (b) is equivalent to Υ (Ψ m v) Ψ m v. We have, Υ (Ψ m v) ψ 1 d v ( ψ m dv v ) = ψ m+1 d v v = Ψ m+1 v Ψ m v, where the last inequality follows from part (a). Theorem 5 Suppose that the inner problems can be solved finitely. Then, for v 0 V 0 : (a) The iterates of exact RMPI converge monotonically and in norm to v. (b) Exact RMPI terminates in a finite number of iterations with an ɛ-optimal policy d ɛ and a corresponding value function v ɛ V 0 that satisfies v ɛ v ɛ/2. Proof: Define the sequences u n }, v n }, and w n }, n = 0, 1,..., by u 0 = v 0 = w 0, u n+1 = Υu n, v n+1 = Ψ m v n, and w n+1 = Φ m w n. The sequence u n } is the robust value iteration iterates, and v n } is the RMPI iterates. Since v 0 V 0, Lemma 4 (b) implies that v n V 0 : for all n, Υv n v n. Lemma 4 (a) implies that Ψ m v n Ψ 1 v n = Υv n. Hence, v n+1 = Ψ m v n Υv n v n, so that v n } is monotone. To prove that v n } converges in norm to v, we first show that u n v n w n, (8) for all n by induction. For n = 0, (8) holds by definition. Assume that (8) holds for n. Then, by Lemma 2 (b), w n+1 = Φ m w n Φ m v n ψd m vn v n = Ψ m v n = v n+1. Also, v n+1 = Ψ m v n Υv n Υu n = u n+1, where the first inequality is by Lemma 4 (a), and the second inequality is by Lemma 2 (c). Therefore, (8) holds for n + 1 as well. By induction, (8) holds for all n. Because u n v n w n, v n w n u n w n so that v n v v n w n + w n v u n w n + w n v. Since u n w n u n v + v w n, we have v n v u n v + 2 w n v. Then, since by Lemma 3 u n } and w n } both converge in norm to v, it follows that v n } must also converge in norm to v. Part (a) holds. Let us now establish part (b). Since v Υv n v n implies Υv n v n v v n, it follows from the convergence of v n to v that (6) will be satisfied in a finite number of 12

iterations. Suppose that the algorithm terminates with n = N. Let v N+1 := Υv N and define (s, a) = σ P(s,a) (v N ) σ P(s,a) (v N+1 ), so that v N+1 (s) = max a A s r(s, a) + λ (s, a) + λσp(s,a) (v N+1 ) }. (9) For any u, v V and P M(S), ( σ P (u) σ P (v) = inf sup p(s)u(s) ) ( ) q(s)v(s) sup q(s)(u(s) v(s)) p P q P q P s S s S s S u v, (10) since 0 q(s) 1. Therefore, (s, a) v N+1 v N ; so, by (6), (s, a) (1 λ)ɛ/(2λ). Since v N+1 is a fixed point of (9), which has the same form as the optimality equations (3), a policy d ɛ satisfying d ɛ (s) argmax a As r(s, a) + λσp(s,a) (v N ) } is optimal for the same problem but with a different cost function r(s, a) = r(s, a) + λ (s, a). (Note that (s, a) and r(s, a) are stationary.) The bound on (s, a) implies that v (dɛ ) v N+1 i=1 λi (1 λ)ɛ/(2λ) = ɛ/2, since v N+1 = Υv N where Υ is a contraction mapping with constant λ. Following the proof of (Puterman 1994, Theorem 6.3.1), it can be shown that v v N+1 λ 1 λ v N+1 v N. (11) Therefore, v v N+1 ɛ/2, and v v (dɛ ) v v N+1 + v N+1 v (dɛ ) ɛ. 3.2 Inexact RMPI Nilim and El Ghaoui (2005, Section 4.2) present an inexact version of robust value iteration in which the inner problem is solved to within a specified tolerance δ 0. We have a similar result, Theorem 7. In the inexact setting, the solution to the inner problem can be interpreted as a function where δ P(s,a) (v) satifies 0 δ P(s,a) (v) δ. σ P(s,a) (v) = σ P(s,a) (v) δ P(s,a) (v), Some of our analysis in the proof of Theorem 7 is similar to analysis in Section 4.2 of Nilim and El Ghaoui (2005), and the form of our stopping criterion (6) is the same the sup-norm of the difference in value functions is less than or equal to (1 λ)ɛ δ. However, 2λ the precision of δ in Theorem 5 of Nilim and El Ghaoui (2005) is only (1 λ)ɛ. We require it 2λ to be within (1 λ)2 ɛ, which means that the inner problems need to be solved more precisely. 2λ 13

Theorem 5 of Nilim and El Ghaoui (2005) only claims that inexact robust value iteration will yield an ɛ-optimal policy if the algorithm terminates. In order to guarantee that the algorithm actually does terminate in a finite number of iterations, our result requires more precision. Define the operator ψ m d : V V, m 0, 1,...}, by the recursive relationship ψ k d v(s) = r(s, d(s)) + λ σ P(s,d(s))( ψ d 0 v(s) = v(s). k 1 ψ d v), k 1, 2,..., m}, Let d v (s) argmax a As r(s, a) + λ σp(s,a) (v) } and define Ψ m v = ψ m dv v and Ῠv = ψ 1 dv v. Denote the iterates of inexact RMPI by v n }. They satisfy v n+1 = Ψ m v n, with v 0 = v 0. Lemma 6 If v 0 = v 0 V, then the iterates of exact RMPI, v n+1 = Ψ m v n, n = 0, 1,..., and those of inexact RMPI, v n+1 = Ψ m v n, n = 0, 1,..., satisfy for all n 0. v n λδ 1 λ e v n v n, (12) ψd k vn v n λδ 1 λ e ψ k d vn v n ψd k vn v n, k 1, 2,..., m}, (13) Proof: The proof is by induction. For n = 0, (12) holds by assumption. Assume that (12) holds for n and consider n + 1. It is easy to show that for u v, σ P(s,a) (u) σ P(s,a) (v), for every (s, a), so that σ P(s,a) (v n ) σ P(s,a) ( v n ) σ P(s,a) (v n λ(1 λ) 1 δe) = σ P(s,a) (v n ) λ(1 λ) 1 δ, where the last equality holds since P(s, a) M(S). It follows that, ψ 1 d vn v n (s) ψ 1 d vn v n (s) ψ 1 d vn v n (s) λ(δ + λ(1 λ) 1 δ) = ψ 1 d vn (s) λ(1 λ) 1 δ. Similarly, it is easy to show inductively that (13) holds for k = 2, 3,..., m. So, (12) is satisfied for n + 1. By induction, (12) and (13) hold for all n 0. Theorem 7 For δ 0, suppose that the inner problems can be solved finitely to within δ. Then, for v 0 V 0 : (a) If δ < (1 λ)2 ɛ, then the inexact RMPI algorithm terminates in a finite number of 2λ iterations. 14

(b) The inexact RMPI algorithm terminates with an ɛ-optimal policy d ɛ and a corresponding value function v ɛ V satisfying v ɛ v ɛ. Proof: Lemma 6 with m = 1 implies that Ῠ v n v n Υv n v n + λ(1 λ) 1 δ, (14) since y 1 c x 1 y 1 and y 2 c x 2 y 2 imply x 2 x 1 y 2 y 1 + c. Because the exact RMPI algorithm terminates in a finite number of iterations by Theorem 5 (b), for any ɛ 2 > 0 there exists N such that Υv N v N (1 λ)ɛ 2 /(2λ). Then by (14), Ῠv N v N (1 λ)ɛ 2 /(2λ) + λ(1 λ) 1 δ, which will satisfy the stopping condition (6) if (1 λ)ɛ 2 /(2λ) + λ(1 λ) 1 δ (1 λ)ɛ/(2λ) δ, ɛ 2 ɛ 2λ(1 λ) 2 δ. This will hold for some ɛ 2 > 0 if and only if δ < (1 λ) 2 ɛ/(2λ). This establishes part (a). Suppose the inexact RMPI algorithm terminates with n = N. Let v N+1 = Ῠ v N and v N+1 = Υ v N. Then, and Hence, v N+1 (s) = λδ + max r(s, a) λδ + λσp(s,a) ( v N ) } a A s λδ + max r(s, a) λδp(s,a) ( v N ) + λσ P(s,a) ( v N ) } a A s = λδ + v N+1 (s), v N+1 (s) = max r(s, a) λδp(s,a) ( v N ) + λσ P(s,a) ( v N ) } a A s max r(s, a) + λσp(s,a) ( v N ) } a A s = v N+1 (s). v N+1 v N+1 v N+1 + λδe. (15) As in (11), it can be shown that v v N+1 λ(1 λ) 1 v N+1 v N, so v v N+1 λ(1 λ) 1 ( v N+1 v N+1 + v N+1 v N ) λ(1 λ) 1 (λδ +(1 λ)ɛ/(2λ) δ). Therefore, v v N+1 v v N+1 + v N+1 v N+1 λ(1 λ) 1 (λδ +(1 λ)ɛ/(2λ) δ)+λδ = ɛ/2. 15

Define (s, a) = δ P(s,a) ( v N ) + σ P(s,a) ( v N ) σ P(s,a) ( v N+1 ) so that } v N+1 (s) = max r(s, a) + λ (s, a) + λσ P(s,a) ( v N+1 ). (16) a A s The vector v N+1 is a fixed point of (16), which has the same form as the optimality equations (3), so that a policy d ɛ satisfying d ɛ (s) argmax a As r(s, a) λδp(s,a) ( v N ) + λσ P(s,a) ( v N ) } is optimal for the same problem but with a different cost function r(s, a) = r(s, a)+λ (s, a). By (10), we have (s, a) δ + v N+1 v N (1 λ)ɛ/(2λ). This implies that v (de ) v N+1 i=1 λi (1 λ)ɛ/(2λ) = ɛ/2. Therefore, v v (de ) v v N+1 + v N+1 v (de ) ɛ/2 + ɛ/2 = ɛ. 4. Span Stopping Criteria The stopping criteria of the previous section are based on conservative sup-norm bounds. These criteria may result in some unnecessary iterations. For standard DP, improved stopping criteria have been established based on the span seminorm (Puterman 1994, Section 6.6.3). In this section we provide similar results for RMPI. For v V, define MIN(v) = min s S v(s) and MAX(v) = max s S v(s) and define the span of v, denoted sp(v), by sp(v) = MAX(v) MIN(v). Lemma 8 For any v V, v + (1 λ) 1 MIN(Υv v)e v (dv) v v + (1 λ) 1 MAX(Υv v)e, (17) and v + (1 λ) 1 MIN(Ῠv v)e v( d v) v v + (1 λ) 1 [λδ + MAX(Ῠv v)]e. (18) Proof: First consider the exact case (17). Let u 0 = v and u n = Υu n 1, n > 0; u n } are iterates of robust value iteration. Then, u 1 = Υv and v + MIN(Υv v)e u 1 v + MAX(Υv v)e. We will show by induction that, for all n, v + n λ i MIN(Υv v)e u n v + i=0 n λ i MAX(Υv v)e. (19) i=0 16

Assume that (19) holds for iteration n. By Lemma 2 (c), ( ) n u n+1 = Υu n Υ v + λ i MAX(Υv v)e. Then, by Lemma 1 (a), n u n+1 Υv(s) + λ i+1 MAX(Υv v)e, i=0 so that n n+1 u n+1 v + MAX(Υv v)e + λ i+1 MAX(Υv v)e = v + λ i MAX(Υv v); i=0 i=0 i=0 the upper bound for iteration n+1 holds. It can be shown similarly that the lower bound for iteration n + 1 holds as well. Hence, by induction, (19) holds for all n. Since value iteration converges to v, taking the limit as n gives v + (1 λ) 1 MIN(Υv v)e v v + (1 λ) 1 MAX(Υv v)e. (20) To establish (17), we will also show that v (dv) is bounded similarly. Note that v (dv) is the fixed point of the contraction mapping ψd 1 v. (To see that ψd 1 v is a contraction mapping, consider Υ with A s = d v (s)}.) Since (Υv v) = (ψd 1 v v v), arguments similar to those above, but with Υ replaced by ψd 1 v, lead to v + (1 λ) 1 MIN(Υv v)e v (dv) v + (1 λ) 1 MAX(Υv v)e. (21) Clearly, v (dv) v. This together with (20) and (21) establish (17). For the inexact case (18), note that, similar to (15), Ῠv Υv Ῠv + λδe. The bounds on v then follow from (17) since MIN(Υv v) MIN(Ῠv v) and MAX(Υv v) λδ +MAX(Ῠv v). For the bounds on v( d v), note that v ( d v) is the fixed point of ψ 1 dv, and MIN(ψ 1 dv v v) MIN(Ῠv v) and v v) λδ + MAX(Ῠv v). The proof of MAX(ψ1 dv (18) is then similar to that of (17). Theorem 9 Suppose v V and ɛ > 0. (a) If sp(υv v) < (1 λ)ɛ, (22) then v v + (1 λ) 1 MIN(Υv v)e < ɛ and v (dv) v < ɛ. 17

(b) If sp(ῠv v) < (1 λ)ɛ λδ, (23) then v v + (1 λ) 1 MIN(Ῠv v)e < ɛ and v( d v) v < ɛ. Proof: We will establish part (b); part (a) is similar. By Lemma 8, 0 v 1 v (1 λ) MIN(Ῠv v)e (1 λ) 1 (sp(ῠv v)e + λδ) < ɛ. Hence, v + (1 λ) 1 MIN(Ῠv v)e v < ɛ. The fact that v ( d v) v < ɛ also follows from Lemma 8, since a b c d implies 0 c b d a. Theorem 9 suggests stopping exact RMPI under (22) and setting d ɛ = d v and v ɛ = v (1 λ) 1 MIN(Υv v)e. Inexact RMPI should be stopped under (23), setting d ɛ = d v and v ɛ 1 = v (1 λ) MIN(Ῠv v)e. In many cases the span stopping criterion is less restrictive than the sup-norm stopping criterion, allowing the algorithm to terminate sooner under the span stopping criterion. Comparing the sup-norm stopping criterion (6) to the span stopping criterion (23), we find that when (1 λ)ɛ 2λ δ < (1 λ)ɛ λδ λ 1/2 λ > δ ɛ. (24) If Ῠv v, then sp(ῠv v) Ῠv v and the span stopping criterion is less restrictive than the sup-norm stopping criterion when (24) holds. Note that (24) holds for sure when λ >.5. For exact RMPI, Υv v is guaranteed by Lemma 4 (b) so that sp(υv v) Υv v, and hence, the span stopping criterion is less restrictive than the sup-norm stopping criterion when λ >.5. For most applications, λ >.5 is satisfied. 5. Numerical Studies In this section we present numerical studies for two classes of inventory control problems. The purpose of our numerical studies is to show that (inexact) RMPI can significantly outperform robust value iteration, rather than to perform an exhaustive analysis on the best 18

RMPI parameters on a variety of instances. Because it has not been show to converge, we do not consider the policy iteration algorithm of Satia and Lave (1973). Recall that robust value iteration is equivalent to RMPI when m n = 1 for all n. RMPI is also a special case of robust policy iteration as the m n become large. The robust policy iteration of Iyengar (2005) assumes that a fixed policy can be evaluated exactly, and this is guaranteed with successive approximation in the limits m n. Of course, the successive approximation does not need to be carried out to such extremes; with RMPI the policy evaluation step is performed in a finite number of iterations following a schedule m n }. For the numerical study, our analysis is restricted to constant schedules: m n = m for all n. The first class of problems considers a single manufacturer facing uncertain customer demand. The second considers two retailers who are linked through transshipments. We use the robust DP framework. Other robust optimization frameworks for inventory control problems have been proposed. Scarf (1958) provides the earliest min-max framework for inventory optimization. For a background in robust inventory control, see Bienstock and Özbay (2008), See and Sim (2010), or Bertsimas et al. (2010). Uncertainty sets are modeled using bounds on the relative entropy distance between the true, unknown demand distribution, q, and a point estimate, ˆq. The Kullback-Leibler distance (or relative entropy distance) between two probability mass functions q 1 and q 2 with sample space Z is defined as: D(q 1 q 2 ) = z Z q 1 (z) log ( ) q 1 (z). q 2 (z) We are interested in sets of the form q : D(q ˆq) β}, where β is a scalar that depends on both the amount of data available for estimation and a confidence parameter ω (0, 1). The following statistical justification appears in Section 4 of Iyengar (2005). Let ˆq be the maximum likelihood estimate of q, and let K be the number of historical samples used for the estimate. That is, if k(z) is the number times z is realized in the sample, then K = z Z k(z) and ˆq(z) = k(z)/k. Let χ2 Z 1 denote a chi-squared random variable with Z 1 degrees of freedom and let F Z 1 ( ) denote its distribution function with inverse F 1 Z 1 ( ). As K becomes large, there is a convergence in probability: KD( q ˆq) = 1 2 χ2 Z 1. 19

Therefore, for β > 0, } P q q : D(q ˆq) β} P χ 2 Z 1 2Kβ } = F Z 1 (2Kβ). This implies that an (approximate) ω-confidence set for the true distribution q is Accordingly, let q : D(q ˆq) F 1 Z 1 (ω)/(2k) }. β = F 1 Z 1 (ω)/(2k). (25) Not only are the sets q : D(q ˆq) β} statistically meaningful, but they also give rise to inner problems that can be solved efficiently. Using standard duality arguments, Nilim and El Ghaoui (2005) show that the related inner problem can be recast as the minimization of a convex function in one dimension over a bounded region. For the numerical study, the inner problems were solved using the bisection algorithm presented in Section 6.3 of Nilim and El Ghaoui (2005). Computations were performed by an Intel Core i7 2.66 GHz Processor. The code was written in C++ and compiled with the GNU Compiler Collection. The operating system was 64 bit Fedora (Red Hat). The computations were timed using the clock() function from the C Time library. The CPU times reported here match closely to the wall clock times. Unless stated otherwise, the reported computation times correspond to the span stopping criterion. 5.1 Inventory Control for a Single Manufacturer Consider a single manufacturer whose inventory decisions are affected by uncertainty in customer demand. The stationary demand distribution is not known, but historical data are available. The manufacturer implements a periodic review control policy. At the beginning of the period, demand from the previous period is realized. If the amount of inventory on hand during the previous period is y, then the inventory position after realizing demand z is x = y z. Demand is assumed to be fully backlogged. If x < 0, then the manufacturer must produce at least x in order to satisfy the backlogged demand. Additionally, the manufacturer pays a penalty cost ϱ per unit backordered. On the other hand, if x > 0, then 20

the manufacturer pays a holding cost h > 0 per unit of excess inventory, where h < ϱ. Each new unit costs c to produce. For each unit sold, revenue φ is earned. Any backorders must be satisfied right away, so total revenue for demand realization z, φz, is assumed to come at the start of the period. In each period, the manufacturer must decide on the quantity to produce to bring the total inventory level up to y. Our robust model assumes that nature is allowed to choose different demand distributions for different up-to levels y; the rectangularity assumptions hold. Although it seems natural to work in terms of x, it is convenient to define the state of the decision process as (y, z). In response to y, nature is allowed to choose from P(y) where p P(y) is a probability measure for transitions to (y, z); demand z is then realized and inventory drops to x = y z. The sets P(y) are constructed using the same demand distribution estimate, ˆq, independent of y. We assume that there is an upper bound on demand, z. The set of possible demand realizations is Z = 0, 1,..., z}. Since h > 0, it does not make sense to produce up to y > z. So given x, the manufacturer chooses some action y x +, x + + 1,..., z}. The manufacturer s action space, the set of all possible actions for all states, is Z. Under these assumptions, the optimality equations are: f(x) = κ hx + ϱx + max y: x + y z ( c(y x) + λσp(y) (v) ), x z, z + 1,..., z 1, z}; v(y, z) = φz + f(y z), where λ (0, 1) is the discount factor, x + denotes the positive part (x + = maxx, 0}), x denotes the negative part (x = max x, 0}), and κ is a constant. The constant κ guarantees that v remains nonnegative. It does not affect the optimal decisions of the manufacturer or the optimal responses of nature. We set κ = (ϱ + c) z. This together with v 0 (y, z) = 0 guarantee that the RMPI initial condition v 0 V 0 is satisfied. For standard DP, P(y) is a singleton. Demand can be represented by a random variable, Z, which has the same distribution for all y. The inner problem is an expectation with respect to Z, and the standard DP optimality equations reduce to: f(x) = κ hx + ϱx + max y: x + y z ( c(y x) + λ E[φZ + f(y Z)]), x z, z + 1,..., z 1, z}. For the robust DP, ˆq(z) is an estimate for the probability of realizing demand z. Given a confidence parameter ω and a number of samples K, we can calculate β in (25) with 21

Parameters set label ɛ λ φ c ϱ h η ω K z I 0 (ζ) 0.1.99 100.3φ.1φ.01φ.5.95 10,000 ζ I 1 (ζ) 1,000 ζ I B (ζ).2φ.005φ.1 10,000 ζ Table 1: Parameter settings. I 0 ( z) is the base settings. I 1 ( z) reduces K. I B ( z) increases the incentive to produce more. Z 1 = z. For the relative entropy bounds, following arguments of Nilim and El Ghaoui (2005) we have σ P(y) (v) = min γ 0 ν y (γ), where ( ν y (γ) = γβ + γ log z Z ( ˆq(z), exp ) )} v(y, z). γ The function ν y (γ) is one-dimensional and convex, and it has a first derivative with a known form. The optimal value for γ is known to lie in the interval [0, (v min z Z ˆq(z)v(y, z))/β], where v min = min z Z v(y, z). Let Q(v) = z Z:v(y,z)=v min ˆq(z). If β > log Q(v), then γ = 0 is optimal with σ P(y) (v) = v min. Otherwise, the optimal γ can be found using the bisection algorithm. For an ɛ-optimal policy, in accordance with (5) and Theorem 7, each inner problem was solved to within δ = (1 λ) 2 ɛ/2 < (1 λ) 2 ɛ/(2λ). If the optimal γ is known to lie in [γ, γ + ], then dνy dγ ((γ + γ + )/2) δ(γ + γ )/2 is sufficient to stop bisecting. A key observation is that σ P(y) (v) depends on y but is independent of x. This means that for each iteration of robust value iteration, an inner problem only has to be solved once for each feasible y. Its value is stored and used again for other values of x for which y is feasible. In total, there are z + 1 distinct inner problems, one for each element of the manufacturer s action space, that need to be solved at each iteration. For RMPI, there is savings potential because the value function is successively approximated while the policy is fixed. A fixed policy, which varies with x, might only map to a subset of the action space. So, it may be the case that fewer than z inner problems need to be solved for each successive approximation, in all steps k < m n in (7). For RMPI, we say that n counts the total number of policy updates. The total number of iterations through n accounts for both policy updates and successive approximations under fixed polices and equals n + (n 1)(m 1). The total number of iterations for value iteration is n. For all examples, we fixed φ = 100, ɛ = 0.1, λ =.99, and ω =.95. For a base, we set c =.3φ, ϱ =.1φ, and h =.01φ. The parameter settings are displayed in Table 1. Parameter set I 0 ( z) with z = 300 is the base set. The estimate ˆq was assumed to have a 22

K 10,000 1,000 y 188 216 Robust value iteration 2:03.0 2:08.6 RMPI, m = 50 0:47.0 0:37.7 Savings 61.8% 70.7 % Table 2: Total computation times (m:ss.0) and savings for RMPI (m = 50) compared to robust value iteration for the single manufacturer problem under I 0 (300), I 1 (300). triangular distribution. Let G η ( ) denote the cumulative distribution function for a triangular distribution over the range [ 1, z] with mode equal to 1 + η( z + 1). The base parameters have η =.5, for a pinnacle half way between 1 and z. We set ˆq(z) = G η (z) G η (z 1). While ˆq follows a specified triangular distribution, there is ambiguity in this estimate. Relative entropy bounds are invoked through relationship (25). The base settings have K = 10, 000. Note that raising K does not affect the computation time directly through (25). Changing K does, however, impact computation times when the decisions change. It is well known that the optimal standard DP solution is a base-stock policy, characterized by a level y : if x < y, produce y x; otherwise, produce nothing. For all of the instances we considered, the optimal robust production policy turns out to be a base-stock policy. It is not known whether this structure will persist for all parameters. Our algorithms did not take advantage of this structure explicitly. RMPI does, however, benefit from it. When a policy is fixed during successive approximations, and it happens to be a base stock policy with some base stock level yn, only z y n + 1 inner problems have to be solved for each step k. The point is that inner problem only have to be solved for the up-to levels y that are touched; y < yn would not be touched. The higher y n, the larger the savings. Table 2 reports results for I 0 (300). Robust value iteration took 2:03.0 (2 min. 3.0 sec.) to compute an optimal policy. The optimal base-stock level is 188. For RMPI, m = 50 was chosen initially. RMPI took 0:47.0 a reduction in time by 61.8% over robust value iteration. Again, a major factor in this reduction is the fact that RMPI does not have to solve all z + 1 = 301 distinct inner problems in every step, while robust value iteration does. Towards the end of RMPI, when the optimal base-stock policy has been identified, RMPI only solves z y + 1 = 300 188 + 1 = 113 distinct inner problems during successive approximations under the fixed policy; 113/301 =.38. This reduction in effort is directly reflected in the ratio of the total times: 0:47.0/2:03.0 38%. When using the sup-norm stopping criterion instead of the span, robust value iteration took an additional 5.8 sec., only 23

140 120 time (sec.) 100 80 60 40 20 0 1 200 400 600 800 1000 1200 1400 1600 m Figure 1: RMPI total computation time (seconds) vs. m for the single manufacturer problem under I 0 (300). Robust value iteration corresponds to m = 1. 4.1% more time. For all instances, the sup-norm required additional time, but not more than and additional 5.0%. The sup-norm results are not reported in the tables. Parameter set I 1 ( z) explores the effects of lowering K, from 10,000 to 1,000. By lowering K, nature s options are expanded. We find in Table 2 that the optimal base-stock level increases to y = 216. Robust value iteration took 02:08.6 to compute a solution, which is close to the time for K = 10, 000. RMPI with m = 50 now only took 0:37.7, a reduction in time of 71%. 5.1.1 Varying m Figure 1 is a plot of the computation time, under parameters I 0 (300), for various m. There is a significant reduction in computation time for a range of m. It turns out that the choice of m = 50 was a good one. Choosing m = 100 would have been a slightly better (a savings of 71.0% instead of 70.7%). The savings for the range m = 25 to m = 200 are within 2.1%. For small m, the drop in computation time is dramatic. As m increases, the computation time eventually increases. As m becomes large, the computation for RMPI would eventually increase beyond that of robust value iteration. It is interesting to note that the computation time is not necessarily convex in m. The total time increases just before m = 1, 000 and then decreases again. Table 3 reports the total number of iterations and the total number of policy updates for this example. Again, m = 1 corresponds to robust value iteration, for which there is a policy update at each iteration. The number of policy updates is decreasing in m, but the total number of iterations is not monotone in m. For the same total number of policy updates, the total number of iterations increases in m. The drop in total time after m = 1, 000 corresponds to a decrease in the 24

m Total iterations Policy updates Time Savings 1 1681 1681 2:08.6 0% 2 1681 841 1:21.8 36.4% 3 1681 561 1:06.2 48.6% 4 1681 421 0:58.4 54.6% 5 1681 337 0:53.7 58.3% 10 1681 169 0:44.4 65.5% 25 1701 69 0:39.4 69.4% 50 1701 35 0:37.7 70.7% 75 1726 24 0:37.9 70.5% 100 1701 18 0:37.3 71.0% 200 1801 10 0:40.0 68.9% 400 2001 6 0:46.0 64.2% 600 2401 5 0:56.4 56.2% 800 3201 5 1:15.1 41.6% 1000 4001 5 1:33.7 27.1% 1200 3601 4 1:27.4 32.0% 1400 2801 3 1:12.9 43.4% 1600 3201 3 1:23.2 35.3% Table 3: Varying m: Total number of iterations, total number of policy updates, total computation times (m:ss.0), and savings for RMPI compared to robust value iteration for the single manufacturer problem under I 0 (300). number of policy updates required from 5 down to 4 for m = 1, 200. There is a farther drop between m = 1, 200 and m = 1, 400, as the number of policy updates again drops from 4 to 3. For m = 1, 600, the number of policy updates is still 3 and the total number of iterations increases, as does the total computation time. In summary, the reduced effort from avoiding inner problems is more significant than increased effort due to additional total iterations. 5.1.2 Increasing the Base-Stock Level We have seen that the computation time for RMPI can decrease as y increases. In I B ( z), h is reduced from.01φ to.005φ, c is reduced from.3φ to.2φ, and η is increased from.5 to 1. Each of these changes increases the incentive for the manufacturer to hold more inventory. Table 4 reports the result for I B (300). As compared to I 0 (300), the optimal base-stock level is raised all the way to y = 287. For robust value iteration, the computation time increased slightly from 2:08.6 to 2:12.5 seconds. For RMPI with m = 50, the change was greater. The total time dropped from 0:37.7 to 0:10.1 seconds. This is a savings of 92.4% compared to robust value iteration. Figure 2 is a plot of computation time vs. m. In summary, while the computation time for robust value iteration might not vary much with the optimal decisions, 25