Decomposition Algorithms for Stochastic Programming on a Computational Grid

Size: px

Start display at page:

Download "Decomposition Algorithms for Stochastic Programming on a Computational Grid"

Derek Ryan
6 years ago
Views:

1 Optimization Technical Report 02-07, September, 2002 Computer Sciences Department, University of Wisconsin-Madison Jeff Linderoth Stephen Wright Decomposition Algorithms for Stochastic Programming on a Computational Grid September 10, 2002 Abstract. We describe algorithms for two-stage stochastic linear programming with recourse and their implementation on a grid computing platform. In particular, we examine serial and asynchronous versions of the L-shaped method and a trust-region method. The parallel platform of choice is the dynamic, heterogeneous, opportunistic platform provided by the Condor system. The algorithms are of master-worker type (with the workers being used to solve second-stage problems), and the MW runtime support library (which supports masterworker computations) is key to the implementation. Computational results are presented on large sample average approximations of problems from the literature. 1. Introduction Consider the two-stage stochastic linear programming problem with fixed recourse, defined as follows: min c T x + N i=1 p iqi T y i, subject to (1a) Ax = b, x 0, (1b) W y i = h i T i x, y(ω i ) 0, i = 1, 2,..., N. (1c) The N scenarios are represented by ω 1, ω 2,..., ω N, with probabilities p i and data objects (q i, T i, h i ) for each i = 1, 2,..., N. The unknowns are first-stage variables x IR n1 and second-stage variables y i IR n2, i = 1, 2,..., N. This formulation is sometimes known as the deterministic equivalent because it lists the unknowns for all scenarios explicitly and poses the problem as a (structured) linear program. An alternative formulation is obtained by defining the ith second-stage problem as a linear program (LP) parametrized by the first-stage variables x, that is, Q i (x) def = min y i q T i y i subject to W y i = h i T i x, y i 0, (2) Jeff Linderoth: Industrial and Systems Engineering Department, Lehigh University, 200 West Packer Avenue, Bethlehem, PA 18015; jtl3@lehigh.edu Stephen Wright: Computer Sciences Department, University of Wisconsin, 1210 W. Dayton Street, Madison, WI 53706; swright@cs.wisc.edu Mathematics Subject Classification (1991): 90C15, 65K05, 68W10

2 2 Jeff Linderoth, Stephen Wright so that Q i ( ) is a piecewise linear convex function. The objective in (1a) is then and we can restate (1) as Q(x) def = c T x + N p i Q i (x), (3) i=1 min x Q(x), subject to Ax = b, x 0. (4) We can derive subgradient information for Q i (x) by considering dual solutions of (2). If π i is the Lagrange multiplier vector for the equality constraint in (2), it is easy to show that T T i π i Q i (x), (5) where Q i denotes the subgradient of Q i. By Rockafellar [19, Theorem 23.8], using polyhedrality of each Q i, we have from (3) that Q(x) = c + N p i Q i (x), (6) for each x that lies in the domain of every Q i, i = 1, 2,..., N. Let S denote the solution set for (4). Since (4) is a convex program, S is closed and convex. If S is nonempty, the projection operator P ( ) onto S is well defined. Subgradient information can be used by algorithms in different ways. Cuttingplane methods use this information to construct convex estimates of Q(x), and obtain each iterate by minimizing this estimate, as in the L-shaped methods described in Section 2. This approach can be stabilized by the use of a quadratic regularization term (Ruszczyński [20, 21], Kiwiel [15]) or by the explicit use of a trust region, as in the l trust-region approach described in Section 3. Alternatively, when an upper bound on the optimal value Q is available, one can derive each new iterate from an approximate analytic center of an approximate epigraph of Q. The latter approach has been explored by Bahn et al. [1] and applied to a large stochastic programming problem by Frangière, Gondzio, and Vial [9]. Parallel implementation of these approaches is obvious in principle. Because evaluation of Q i (x) and elements of its subdifferential can be carried out independently for each i = 1, 2,..., N, we can partition the scenarios i = 1, 2,..., N into clusters of scenarios and define a computational task to be the solution of all the second-stage LPs (2) in a number of clusters. Each such task could be assigned to an available worker processor. Bunching techniques (see Birge and Louveaux [5, Section 5.4]) can be used to exploit the similarity between different scenarios within a chunk. To avoid inefficiency, each task should contain enough scenarios that its computational requirements significantly exceeds the time required to send the data to the worker processor and to return the results. i=1

3 Stochastic Programming on a Computational Grid 3 In this paper, we describe implementations of decomposition algorithms for stochastic programming on a dynamic, heterogeneous computational grid made up of workstations, PCs, and supercomputer nodes. Specifically, we use the environment provided by the Condor system [16] running the MW runtime library (Goux et al. [13, 12]), a software layer that significantly simplifies the process of implementing parallel algorithms. For the dimensions of problems and the size of the computational grids considered in this paper, evaluation of the functions Q i (x) and their subgradients at a single x sometimes is insufficient to make effective use of the available processors. Moreover, synchronous algorithms those that depend for efficiency on all tasks completing in a timely fashion run the risk of poor performance in a parallel environment in which failure or suspension of worker processors while performing computation is not infrequent. We are led therefore to asynchronous approaches that consider different points x simultaneously. Asynchronous variants of the L-shaped and l -trust-region methods are described in Sections 2.2 and 4, respectively. Other parallel algorithms for stochastic programming have been described by Birge et al. [3,4], Birge and Qi [6], Ruszczyński [21], and Frangière, Gondzio, and Vial [9]. In [3], the focus is on multistage problems in which the scenario tree is decomposed into subtrees, which are processed independently and in parallel on worker processors. Dual solutions from each subtree are used to construct a model of the first-stage objective (using an L-shaped approach like that described in Section 2), which is periodically solved by a master process to obtain a new first-stage iterate. Birge and Qi [6] describe an interior-point method for two-stage problems, in which the linear algebra operations are implemented in parallel by exploiting the structure of the two-stage problem. However, this approach involves significant data movement and does not scale particularly well. In [9], the second-stage problems (2) are solved concurrently and inexactly by using an interior-point code. The master process maintains an upper bound on the optimal objective, and this information is used along with the approximate subgradients is used to construct an approximate truncated epigraph of the function. The analytic center of this epigraph is computed periodically to obtain a new iterate. The numerical results in [9] report solution of a two-stage stochastic linear program with 2.6 million variables and 1.2 million constraints in three hours on a cluster of 10 Linux PCs. The approach that is perhaps closest to the ones we describe in this paper is that of Ruszczyński [21]. When applied to two-stage problems (1), this algorithm consists of processes that solve each second-stage problem (2) at the latest available value of x to generate cuts; and a master process that solves a cutting-plane problem with the latest available cuts and a quadratic regularization term to generate new iterates x. The master process and second-stage processes can execute in parallel and share information asynchronously. This approach is essentially an asynchronous parallel version of the serial bundle-trustregion approaches described by Ruszczyński [20], Kiwiel [15], and Hiriart-Urruty and Lemaréchal [14, Chapter XV]. Algorithm ATR described in Section 4 likewise is an asynchronous parallel version of the bundle-trust-region method TR

4 4 Jeff Linderoth, Stephen Wright of Section 3, although the asynchronicity in the algorithm ATR described in Section 4 is more structured that that considered in [21]. In addition, l trust regions take the place of quadratic regularization terms in both Algorithms TR and ATR. We discuss the relationships between all these methods further in later sections. The remainder of this paper is structured as follows. Section 2 describes an L-shaped method and an asynchronous variant. Algorithm TR, a bundle-trustregion method with l trust regions is described and analyzed in Section 3, while its asynchronous cousin Algorithm ATR is described and analyzed in Section 4. Section 5 discusses computational grids and implementations of the algorithms on these platforms. Finally, computational results are given in Section L-shaped methods We describe briefly a well known variant of the L-shaped method for solving (4), together with an asynchronous variant The Multicut L-shaped method The L-shaped method of Van Slyke and Wets [25] for solving (4) proceeds by finding subgradients of partial sums of the terms that make up Q (3), together with linear inequalities that define the domain of Q. We sketch the approach here, and refer to Birge and Louveaux [5, Chapter 5] for a more complete description. Suppose that the second-stage scenarios 1, 2,..., N are partitioned into C clusters denoted by N 1, N 2,..., N C. Let Q [j] represent the partial sum from (3) corresponding to the cluster N j ; that is, Q [j] (x) = i N j p i Q i (x). (7) The algorithm maintains a model function m k [j] which is a piecewise linear lower bound on Q [j] for each j. We define this function at iteration k by m k [j] (x) = inf{θ j θ j e F[j] k x + f [j] k }, (8) where e = (1, 1,..., 1) T and F k [j] is a matrix whose rows are subgradients of Q [j] at previous iterates of the algorithm. The constraints in (8) are called optimality cuts. A subgradient g j of Q [j] is obtained from the dual solutions π i of (2) for each i N j as follows: g j = i N j p i T T i π i ; (9) see (5) and (6). An optimality cut is not added to the model m k [j] if the model function takes on the same value as Q [j] at iteration k. Cuts may also be deleted

5 Stochastic Programming on a Computational Grid 5 in the manner described below. The algorithm also maintains a collection of feasibility cuts of the form D k x d k, (10) which have the effect of excluding values of x for which some of the second-stage linear programs (2) are infeasible. By Farkas s theorem (see Mangasarian [17, p. 31]), if the constraints in (2) are infeasible, there exists π i with the following properties: W T π i 0, [h i T i x] T π i > 0. (In fact, such a π i can be obtained from the dual simplex method for the feasibility problem for (2).) To exclude this x from further consideration, we simply add the inequality [h i T i x] T π i 0 to the constraint set (10). The kth iterate x k of the multicut L-shaped method is obtained by solving the following approximation to (4): min x m k (x), subject to D k x d k, Ax = b, x 0, (11) where m k (x) def = c T x + C j=1 m k [j](x). (12) In practice, we substitute from (8) to obtain the following linear program: min c T x + x,θ 1,...,θ C C θ j, subject to (13a) j=1 θ j e F[j] k x + f [j] k, j = 1, 2,..., C, (13b) D k x d k, Ax = b, x 0. We make the following assumption for the remainder of the paper. Assumption 1. (13c) (13d) (i) The problem has complete recourse; that is, the feasible set of (2) is nonempty for all i = 1, 2,..., N and all x, so that the domain of Q(x) in (3) is IR n. (ii) The solution set S is nonempty. Under this assumption, feasibility cuts (10), (13c) are not present. Our algorithms and their analysis can be generalized to handle situations in which Assumption 1 does not hold, but for the sake of simplifying our analysis, we avoid discussing this more general case here. Under Assumption 1, we can specify the L-shaped algorithm formally as follows:

6 6 Jeff Linderoth, Stephen Wright Algorithm LS choose tolerance ɛ tol ; choose starting point x 0 ; define initial model m 0 to be a piecewise linear underestimate of Q(x) such that m 0 (x 0 ) = Q(x 0 ) and m 0 is bounded below; Q min Q(x 0 ); for k = 0, 1, 2,... obtain x k+1 by solving (11); if Q min m k (x k+1 ) ɛ tol (1 + Q min ) STOP; evaluate function and subgradient information at x k+1 ; Q min min(q min, Q(x k+1 )); obtain m k+1 by adding optimality cuts to m k ; end(for) Asynchronous parallel variant of the L-shaped method The L-shaped approach lends itself naturally to implementation in a masterworker framework. The problem (13) is solved by the master process, while solution of each cluster N j of second-stage problems, and generation of the associated cuts, can be carried out by the worker processes running in parallel. This approach can be adapted for an asynchronous, unreliable environment in which the results from some second-stage clusters are not returned in a timely fashion. Rather than having all the worker processors sit idle while waiting for the tardy results, we can proceed without them, re-solving the master by using the additional cuts that were generated by the other second-stage clusters. We denote the model function simply by m for the asynchronous algorithm, rather than appending a subscript. Whenever the time comes to generate a new iterate, the current model is used. In practice, we would expect the algorithm to give different results each time it is executed, because of the unpredictable speed and order in which the functions are evaluated and subgradients generated. Because of Assumption 1, we can write the subproblem min x m(x), subject to Ax = b, x 0. (14) Algorithm ALS, the asynchronous variant of the L-shaped method that we describe here, is made up of four key operations, three of which execute on the master processor and one of which runs on the workers. These operations are as follows: partial evaluate. Worker routine for evaluating Q [j] (x) defined by (7) for a given x and one or more of the clusters j = 1, 2,..., C, in the process generating a subgradient g j of each Q [j] (x). This task runs on a worker processor and returns its results to the master by activating the routine act on completed task on the master processor. evaluate. Master routine that places tasks of the type partial evaluate for a given x into the task pool for distribution to the worker processors as

7 Stochastic Programming on a Computational Grid 7 they become available. The completion of all these tasks leads to evaluation of Q(x). initialize. Master routine that performs initial bookkeeping, culminating in a call to evaluate for the initial point x 0. act on completed task. Master routine, activated whenever the results become available from a partial evaluate task. It updates the model and increments a counter to keep track of the number of tasks that have been evaluated at each candidate point. When appropriate, it solves the master problem with the latest model to obtain a new candidate iterate and then calls evaluate. In our implementation of both this algorithm and its more sophisticated cousin Algorithm ATR of Section 4, a single task consists of the evaluation of one or more clusters N j. We may bundle, say, 2 or 4 clusters into a single computational task, to make the task large enough to justify the master s effort in packing its data and unpacking its results, and to maintain the ratio of compute time to communication cost at a high level. We use T to denote the number of computational tasks, and T r, r = 1, 2,..., T to denote a partitioning of {1, 2,..., C}, so that task r consists of evaluation of the clusters j T r. The implementation depends on a synchronicity parameter σ which is the proportion of tasks that must be evaluated at a point to trigger the generation of a new candidate iterate. Typical values of σ are in the range 0.25 to 0.9. A logical variable speceval k keeps track of whether x k has yet triggered a new candidate. Initially, speceval k is set to false, then set to true when the proportion of evaluated clusters passes the threshold σ. We now specify all the methods making up Algorithm ALS. ALS: partial evaluate(x q, q, r) Given x q, index q, and task number r, evaluate Q [j] (x q ) from (7) for all j T r together with partial subgradients g j from (9); Activate act on completed task(x q, q, r) on the master processor. ALS: evaluate(x q, q) for r = 1, 2,..., T (possibly concurrently) partial evaluate(x q, q, r); end (for) ALS: initialize determine number of clusters C and number of tasks T, and the partitions N 1, N 2,..., N C and T 1, T 2..., T T ; choose tolerance ɛ tol ; choose starting point x 0 ; choose threshold σ (0, 1]; Q min ; k 0, speceval 0 false, t 0 0; evaluate(x 0, 0).

8 8 Jeff Linderoth, Stephen Wright ALS: act on completed task(x q, q, r) t q t q + 1; for each j T r add Q [j] (x q ) and cut g j to the model m; if t q = T Q min min(q min, Q(x q )); else if t q σt and not speceval q speceval q true; k k + 1; solve current model problem (14) to obtain x k+1 ; if Q min m(x k+1 ) ɛ tol (1 + Q min ) STOP; evaluate(x k, k); end (if) We present results for Algorithm ALS in Section 6. While the algorithm is able to use a large number of worker processors on our opportunistic platform, it suffers from the usual drawbacks of the L-shaped method, namely, that cuts, once generated, must be retained for the remainder of the computation to ensure convergence and that large steps are typically taken on early iterations before a sufficiently good model approximation to Q(x) is created, making it impossible to exploit prior knowledge about the location of the solution. 3. A Bundle-trust-region method Trust-region approaches can be implemented by making only minor modifications to implementations of the L-shaped method, and they possesses several practical advantages along with stronger convergence properties. The trustregion methods we describe here are related to the regularized decomposition method of Ruszczyński [20] and the bundle-trust-region approaches of Kiwiel [15] and Hiriart-Urruty and Lemaréchal [14, Chapter XV]. The main differences are that we use box-shaped trust regions yielding linear programming subproblems (rather than quadratic programs) and that our methods manipulate the size of the trust region directly rather than indirectly via a regularization parameter. We discuss these differences further in Section 3.3. When requesting a subgradient of Q at some point x, our algorithms do not require particular (e.g., extreme) elements of the subdifferential to be supplied. Nor do they require the subdifferential Q(x) to be representable as a convex combination of a finite number of vectors. In this respect, our algorithms contrast with that of Ruszczyński [20], for instance, which exploits the piecewise-linear nature of the objectives Q i in (2). Because of our weaker conditions on the subgradient information, we cannot prove a finite termination result of the type presented in [20, Section 3]. However, these conditions potentially allow our algorithms to be extended to a more general class of convex nondifferentiable functions.

9 Stochastic Programming on a Computational Grid A Method based on l trust regions A key difference between the trust-region approach of this section and the L- shaped method of the preceding section is that we impose an l norm bound on the size of the step. It is implemented by simply adding bound constraints to the linear programming subproblem (13) as follows: e x x k e, (15) where e = (1, 1,..., 1) T, is the trust-region radius, and x k is the current iterate. During the kth iteration, it may be necessary to solve several problems with trust regions of the form (15), with different model functions m and possibly different values of, before a satisfactory new iterate x k+1 is identified. We refer to x k and x k+1 as major iterates and the points x k,l, l = 0, 1, 2,... obtained by minimizing the current model function subject to the constraints and trustregion bounds of the form (15) as minor iterates. Another key difference between the trust-region approach and the L-shaped approach is that a minor iterate x k,l is accepted as the new major iterate x k+1 only if it yields a substantial reduction in the objective function Q over the previous iterate x k, in a sense to be defined below. A further important difference is that one can delete optimality cuts from the model functions, between minor and major iterations, without compromising the convergence properties of the algorithm. To specify the method, we need to augment the notation established in the previous section. We define m k,l (x) to be the model function after l minor iterations have been performed at iteration k, and k,l > 0 to be the trustregion radius at the same stage. Under Assumption 1, there are no feasibility cuts, so that the problem to be solved to obtain the minor iteration x k,l is as follows: min x m k,l (x) subject to Ax = b, x 0, x x k k,l (16) (cf. (11)). By expanding this problem in a similar fashion to (13), we obtain min c T x + x,θ 1,...,θ C C θ j, subject to (17a) j=1 θ j e F k,l [j] x + f k,l [j], j = 1, 2,..., C, (17b) Ax = b, x 0, (17c) k,l e x x k k,l e. (17d) We assume the initial model m k,0 at major iteration k to satisfy the following two properties: m k,0 (x k ) = Q(x k ), m k,0 is a convex, piecewise linear underestimate of Q. (18a) (18b) Denoting the solution of the subproblem (17) by x k,l, we accept this point as the new iterate x k+1 if the decrease in the actual objective Q (see (4)) is at

10 10 Jeff Linderoth, Stephen Wright least some fraction of the decrease predicted by the model function m k,l. That is, for some constant ξ (0, 1/2), the acceptance test is Q(x k,l ) Q(x k ) ξ ( Q(x k ) m k,l (x k,l ) ). (19) (A typical value for ξ is 10 4.) If the test (19) fails to hold, we obtain a new model function m k,l+1 by adding and possibly deleting cuts from m k,l (x). This process aims to refine the model function, so that it eventually generates a new major iteration, while economizing on storage by allowing deletion of subgradients that no longer seem helpful. Addition and deletion of cuts are implemented by adding and deleting rows from F k,l [j] and f k,l [j], to obtain F k,l+1 [j] and f k,l+1 [j], for j = 1, 2,..., C. Given some parameter η (ξ, 1), we obtain m k,l+1 from m k,l by means of the following procedure: Procedure Model-Update (k, l) for each optimality cut possible delete true; if the cut was generated at x k possible delete false; else if the cut is active at the solution of (17) with positive Lagrange multiplier possible delete false; else if the cut was generated at an earlier minor iteration l = 0, 1,..., l 1 such that [ Q(x k ) m k,l (x k,l ) > η Q(x k ) m k, l(x k, l) ] (20) possible delete false; end (if) if possible delete possibly delete the cut; end (for each) add optimality cuts obtained from each of the component functions Q [j] at x k,l. In our implementation, we delete the cut if possible delete is true at the final conditional statement and, in addition, the cut has not been active during the last 100 solutions of (17). More details are given in Section 6.2. Because we retain all cuts generated at x k during the course of major iteration k, the following extension of (18a) holds: m k,l (x k ) = Q(x k ), l = 0, 1, 2,.... (21) Since we add only subgradient information, the following generalization of (18b) also holds uniformly: m k,l is a convex, piecewise linear underestimate of Q, for l = 0, 1, 2,.... (22)

11 Stochastic Programming on a Computational Grid 11 We may also decrease the trust-region radius k,l between minor iterations (that is, choose k,l+1 < k,l ) when the test (19) fails to hold. We do so if the match between model and objective appears to be particularly poor, adapting the procedure of Kiwiel [15, p. 109] for increasing the coefficient of the quadratic penalty term in his regularized bundle method. If Q(x k,l ) exceeds Q(x k ) by more than an estimate of the quantity max Q(x k ) Q(x), (23) x x k 1 we conclude that the upside variation of the function Q deviates too much from its downside variation, and we reduce the trust-region radius k,l+1 so as to bring these quantities more nearly into line. Our estimate of (23) is simply 1 [ Q(x k ) m k,l (x k,l ) ], min(1, k,l ) that is, an extrapolation of the model reduction on the current trust region to a trust region of radius 1. Our complete strategy for reducing is therefore as follows. (The counter is initialized to zero at the start of each major iteration.) Procedure Reduce- evaluate ρ = min(1, k,l ) Q(xk,l ) Q(x k ) Q(x k ) m k,l (x k,l ) ; (24) if ρ > 0 counter counter+1; if ρ > 3 or (counter 3 and ρ (1, 3]) set reset counter 0; k,l+1 = 1 min(ρ, 4) k,l; If the test (19) is passed, so that we have x k+1 = x k,l, we have a great deal of flexibility in defining the new model function m k+1,0. We require only that the properties (18) are satisfied, with k + 1 replacing k. Hence, we are free to delete much of the optimality cut information accumulated at iteration k (and previous iterates). In practice, of course, it is wise to delete only those cuts that have been inactive for a substantial number of iterations; otherwise we run the risk that many new function and subgradient evaluations will be required to restore useful model information that was deleted prematurely. If the step to the new major iteration x k+1 shows a particularly close match between the true function Q and the model function m k,l at the last minor iteration of iteration k, we consider increasing the trust-region radius. Specifically, if Q(x k,l ) Q(x k ) 0.5 ( Q(x k ) m k,l (x k,l ) ), x k x k,l = k,l, (25)

12 12 Jeff Linderoth, Stephen Wright then we set k+1,0 = min( hi, 2 k,l ), (26) where hi is a prespecified upper bound on the radius. Before specifying the algorithm formally, we define the convergence test. Given a parameter ɛ tol > 0, we terminate if Q(x k ) m k,l (x k,l ) ɛ tol (1 + Q(x k ) ). (27) Algorithm TR choose ξ (0, 1/2), cut deletion parameter η (ξ, 1), maximum trust region hi, tolerance ɛ tol ; choose starting point x 0 ; define initial model m 0,0 with the properties (18) (for k = 0); choose 0,0 [1, hi ]; for k = 0, 1, 2,... finishedminoriteration false; l 0; counter 0; repeat solve (16) to obtain x k,l ; if (27) is satisfied STOP with approximate solution x k ; evaluate function and subgradient at x k,l ; if (19) is satisfied set x k+1 = x k,l ; obtain m k+1,0 by possibly deleting cuts from m k,l, but retaining the properties (18) (with k + 1 replacing k); choose k+1,0 [ k,l, hi ] according to (25), (26); finishedminoriteration true; else obtain m k,l+1 from m k,l via Procedure Model-Update (k, l); obtain k,l+1 via Procedure Reduce- ; l l + 1; until finishedminoriteration end (for) 3.2. Analysis of the trust-region method We now describe the convergence properties of Algorithm TR. We show that for ɛ tol = 0, the algorithm either terminates at a solution or generates a sequence of major iterates x k that approaches the solution set S (Theorem 2). Given some starting point x 0 satisfying the constraints Ax 0 = b, x 0 0, and setting Q 0 = Q(x 0 ), we define the following quantities that are useful in describing and analyzing the algorithm: L(Q 0 ) = {x Ax = b, x 0, Q(x) Q 0 }, (28) L(Q 0 ; ) = {x Ax = b, x 0, x y, for some y L(Q 0 )}, (29) β = sup{ g 1 g Q(x), for some x L(Q 0 ; hi )}. (30)

13 Stochastic Programming on a Computational Grid 13 Using Assumption 1, we can easily show that β <. Note that Q(x) Q g T (x P (x)), for all x L(Q 0 ; hi ), all g Q(x), so that Q(x) Q g 1 x P (x) β x P (x). (31) We start by showing that the optimal objective value for (16) cannot decrease from one minor iteration to the next. Lemma 1. Suppose that x k,l does not satisfy the acceptance test (19). Then we have m k,l (x k,l ) m k,l+1 (x k,l+1 ). Proof. In obtaining m k,l+1 from m k,l in Model-Update, we do not allow deletion of cuts that were active at the solution x k,l of (17). Using and to denote the active rows in F k,l [j] and f k,l [j], we have that xk,l is also the solution of the following linear program (in which the inactive cuts are not present): min c T x + x,θ 1,...,θ C F k,l [j] f k,l [j] C θ j, subject to (32a) j=1 θ j e Ax = b, x 0, F k,l [j] k,l e x x k k,l e. k,l x + f [j], j = 1, 2,..., C, (32b) (32c) (32d) The subproblem to be solved for x k,l+1 differs from (32) in two ways. First, additional rows may be added to and, consisting of function values F k,l [j] and subgradients obtained at x k,l and also inactive cuts carried over from the previous (17). Second, the trust-region radius k,l+1 may be smaller than k,l. Hence, the feasible region of the problem to be solved for x k,l+1 is a subset of the feasible region for (32), so the optimal objective value cannot be smaller. Next we have a result about the amount of reduction in the model function m k,l. Lemma 2. For all k = 0, 1, 2,... and l = 0, 1, 2,..., we have that m k,l (x k ) m k,l (x k,l ) = Q(x k ) m k,l (x k,l ) ( ) k,l min x k P (x k, 1 [Q(x k ) Q ]. (33) ) Proof. The first equality follows immediately from (21). To prove (33), consider the following subproblem in the scalar τ: min m ( k,l x k + τ[p (x k ) x k ] ) subject to τ[p (x k ) x k ] τ [0,1] k,l. (34) f k,l [j]

14 14 Jeff Linderoth, Stephen Wright Denoting the solution of this problem by τ k,l, we have by comparison with (16) that m k,l (x k,l ) m k,l ( x k + τ k,l [P (x k ) x k ] ). (35) If τ = 1 is feasible in (34), we have from (35) and (22) that m k,l (x k,l ) m k,l ( x k + τ k,l [P (x k ) x k ] ) m k,l ( x k + [P (x k ) x k ] ) = m k,l (P (x k )) Q(P (x k )) = Q. Hence, we have from (21) that m k,l (x k ) m k,l (x k,l ) Q(x k ) Q, so that (33) holds in this case. When τ = 1 is infeasible for (34), consider setting τ = k,l / x k P (x k ) (which is certainly feasible for (34)). We have from (35), the definition of τ k,l, (22), and convexity of Q that ( m k,l (x k,l ) m k,l x k P (x k ) x k ) + k,l P (x k ) x k Q (x k P (x k ) x k ) + k,l P (x k ) x k Q(x k k,l ) + P (x k ) x k (Q Q(x k )). Therefore, using (21), we have m k,l (x k ) m k,l (x k,l ) verifying (33) in this case as well. k,l P (x k ) x k [Q(x k ) Q ], Our next result finds a lower bound on the trust-region radii k,l. For purposes of this result we define a quantity E k to measure the closest approach to the solution set for all iterates up to and including x k, that is, def E k = min x k P (x k). (36) k=0,1,...,k Note that E k decreases monotonically with k. We also define F k as follows F k def = min k=0,1,...,k, x k / S Q(x k) Q x k P (x k), (37) with the convention that F k = 0 if x k S for any k k. By monotonicity of {Q(x k )}, we have F k > 0 whenever x k / S. Note also from (31) and the fact that x k L(Q 0 ; hi ) that F k β, k = 0, 1, 2,... (38)

15 Stochastic Programming on a Computational Grid 15 Lemma 3. For all trust regions k,l used in the course of Algorithm TR, we have k,l (1/4) min(e k, F k /β), (39) for β defined in (30). Proof. Suppose for contradiction that there are indices k and l such that k,l < (1/4) min (E k, F k /β). Since the trust region can be reduced by at most a factor of 4 by Procedure Reduce-, there must be an earlier trust region radius k, l (with k k) such that k, l < min (E k, F k /β), (40) and ρ > 1 in (24), that is, Q(x k, l) Q(x k) > 1 min(1, k, l) = 1 k, l ( ) Q(x k) m k, l(x k, l) ( Q(x k) m k, l(x k, l) ), (41) where we used (38) in (40) to deduce that k, l < 1. By applying Lemma 2, and using (40) again, we have ( ) k, l Q(x k) m k, l(x k, l) min, 1 [Q(x k) Q x k P (x k) ] = k, l x k P (x k) [Q(x k) Q ] (42) where the last equality follows from x k P (x k) E k E k > k, l. By combining (42) with (41), we have that Q(x k, l) Q(x k) Q(x k) Q F k F k. (43) x k P (x k) By using standard properties of subgradients, we have Q(x k, l) Q(x k) g T l (x k, l x k) g l 1 x k x k, l g l 1 k, l, for all g l Q(x k, l). (44) By combining this expression with (43), and using (40) again, we obtain that g l 1 1 k, l [Q(x k, l) Q(x k)] 1 k, l F k > β. However, since x k, l L(Q 0 ; hi ), we have from (30) that g l 1 β, giving a contradiction. Finite termination of the inner iterations is proved in the following two results. Recall that the parameters ξ and η are defined in (19) and (20), respectively.

16 16 Jeff Linderoth, Stephen Wright Lemma 4. Let ɛ tol = 0 in Algorithm TR, and let ξ and η be the constants from (19) and (20), respectively. Let l 1 be any index such that x k,l1 fails to satisfy the test (19). Then either the sequence of inner iterations eventually yields a point x k,l2 satisfying the acceptance test (19), or there is an index l 2 > l 1 such that Q(x k ) m k,l2 (x k,l2 ) η [ Q(x k ) m k,l1 (x k,l1 ) ]. (45) Proof. Suppose for contradiction that the none of the minor iterations following l 1 satisfies either (19) or the criterion (45); that is, Q(x k ) m k,q (x k,q ) > η [ Q(x k ) m k,l1 (x k,l1 ) ], for all q > l 1. (46) It follows from this bound, together with Lemma 1 and Procedure Model- Update, that none of the cuts generated at minor iterations q l 1 is deleted. We assume in the remainder of the proof that q and l are generic minor iteration indices that satisfy q > l l 1. Because the function and subgradients from minor iterations x k,l, l = l 1, l 1 + 1,... are retained throughout the major iteration k, we have By definition of the subgradient, we have m k,q (x k,l ) = Q(x k,l ). (47) m k,q (x) m k,q (x k,l ) g T (x x k,l ), for all g m k,q (x k,l ). (48) Therefore, from (22) and (47), it follows that so that Q(x) Q(x k,l ) g T (x x k,l ), for all g m k,q (x k,l ), m k,q (x k,l ) Q(x k,l ). (49) Since Q(x k ) < Q(x 0 ) = Q 0, we have from (28) that x k L(Q 0 ). Therefore, from the definition (29) and the fact that x k,l x k k,l hi, we have that x k,l L(Q 0 ; hi ). It follows from (30) and (49) that g 1 β, for all g m k,q (x k,l ). (50) Since x k,l is rejected by the test (19), we have from (47) and Lemma 1 that the following inequalities hold: m k,q (x k,l ) = Q(x k,l ) Q(x k ) ξ [ Q(x k ) m k,l (x k,l ) ] By rearranging this expression, we obtain Q(x k ) ξ [ Q(x k ) m k,l1 (x k,l1 ) ]. Q(x k ) m k,q (x k,l ) ξ [ Q(x k ) m k,l1 (x k,l1 ) ]. (51)

17 Stochastic Programming on a Computational Grid 17 Recalling that η (ξ, 1), we consider the following neighborhood of x k,l : x x k,l η ξ β Using this bound together with (48) and (50), we obtain m k,q (x k,l ) m k,q (x) g T (x k,l x) [ Q(x k ) m k,l1 (x k,l1 ) ] def = ζ > 0. (52) β x k,l x (η ξ) [ Q(x k ) m k,l1 (x k,l1 ) ]. By combining this bound with (51), we find that the following bound is satisfied for all x in the neighborhood (52): Q(x k ) m k,q (x) = [ Q(x k ) m k,q (x k,l ) ] + [ m k,q (x k,l ) m k,q (x) ] η [ Q(x k ) m k,l1 (x k,l1 ) ]. It follows from this bound, in conjunction with (46), that x k,q (the solution of the trust-region problem with model function m k,q ) cannot lie in the neighborhood (52). Therefore, we have x k,q x k,l > ζ. (53) But since x k,l x k k,l hi for all l l 1, it is impossible for an infinite sequence {x k,l } l l1 to satisfy (53). We conclude that (45) must hold for some l 2 l 1, as claimed. We now show that the minor iteration sequence terminates at a point x k,l satisfying the acceptance test, provided that x k is not a solution. Theorem 1. Suppose that ɛ tol = 0. (i) If x k / S, there is an l 0 such that x k,l satisfies (19). (ii) If x k S, then either Algorithm TR terminates (and verifies that x k S), or Q(x k ) m k,l (x k,l ) 0. Proof. Suppose for the moment that the inner iteration sequence is infinite, that is, the test (19) always fails. By applying Lemma 4 recursively, we can identify a sequence of indices 0 < l 1 < l 2 <... such that Q(x k ) m k,lj (x k,lj ) η [ Q(x k ) m k,lj 1 (x k,lj 1 ) ] When x k / S, we have from Lemma 3 that η 2 [ Q(x k ) m k,lj 2 (x k,lj 2 ) ]. η j [ Q(x k ) m k,0 (x k,0 ) ]. (54) k,l (1/4) min(e k, F k /β) def = lo > 0, for all l = 0, 1, 2,..., so the right-hand side of (33) is uniformly positive (independently of l). However, (54) indicates that we can make Q(x k ) m k,lj (x k,lj ) arbitrarily small by choosing j sufficiently large, contradicting (33).

18 18 Jeff Linderoth, Stephen Wright For the case of x k S, there are two possibilities. If the inner iteration sequence terminates finitely at some x k,l, we must have Q(x k ) m k,l (x k,l ) = 0. Hence, from (22), we have Q(x) m k,l (x) Q(x k ) = Q, for all feasible x with x x k k,l. Therefore, termination under these circumstances yields a guarantee that x k S. When the algorithm does not terminate, it follows from (54) that Q(x k ) m k,l (x k,l ) 0. By applying Lemma 1, we verify that the convergence is monotonic. We now prove convergence of Algorithm TR to S. Theorem 2. Suppose that ɛ tol = 0. The sequence of major iterations {x k } is either finite, terminating at some x k S, or is infinite, with the property that x k P (x k ) 0. Proof. If the claim does not hold, there are two possibilities. The first is that the sequence of major iterations terminates finitely at some x k / S. However, Theorem 1 ensures that the minor iteration sequence will terminate at some new major iteration x k+1 under these circumstances, so we can rule out this possibility. The second possibility is that the sequence {x k } is infinite but that there is some ɛ > 0 and an infinite subsequence of indices {k j } j=1,2,... such that x kj P (x kj ) ɛ, j = 0, 1, 2,.... Since the sequence {Q(x kj )} j=1,2,... is infinite, decreasing, and bounded below, it converges to some value Q > Q. Moreover, since the entire sequence {Q(x k )} is monotone decreasing, it follows that Q(x k ) Q > Q Q > 0, k = 0, 1, 2,.... Hence, by boundedness of the subgradients (see (30)), and using the definitions (36) and (37), we can identify a constant ɛ > 0 such that E k ɛ and F k ɛ for all k. Therefore, by Lemma 2, we have Q(x k ) m k,l (x k,l ) min( k,l / ɛ, 1)[ Q Q ], k = 0, 1, 2,.... (55) For each major iteration index k, let l(k) be the minor iteration index that passes the acceptance test (19). By combining (19) with (55), we have that Q(x k ) Q(x k+1 ) ξ min( k,l(k) / ɛ, 1)[ Q Q ]. Since Q(x k ) Q(x k+1 ) 0, we deduce that lim k k,l(k) = 0. However, since E k and F k are bounded away from 0, we have from Lemma 3 that k,l is bounded away from 0, giving a contradiction. We conclude that the second possibility (an infinite sequence {x k } not converging to S) cannot occur either, so the proof is complete.

19 Stochastic Programming on a Computational Grid 19 It is easy to show that the algorithm terminates finitely when ɛ tol > 0. The argument in the proof of Theorem 1 shows that either the test (27) is satisfied at some minor iteration, or the algorithm identifies a new major iteration. Since the amount of reduction at each major iteration is at least ξɛ tol (from (19)), and since we assume that a solution set exists, the number of major iterations must be finite Discussion If a 2-norm trust region is used in place of the -norm trust region of (16), it is well known that the solution of the subproblem min x is identical to the solution of m k,l (x) subject to Ax = b, x 0, x x k 2 k min x m k,l (x) + λ x x k 2 subject to Ax = b, x 0, (56) for some λ 0. We can transform (56) to a quadratic program in the same fashion as the transformation of (16) to (17). The regularized or proximal bundle approaches described in Kiwiel [15], Hiriart-Urruty and Lemaréchal [14, Chapter XV], and Ruszczyński [20, 21] work with the formulation (56). They manipulate the parameter λ directly rather than adjusting the trust-region radius, more in the spirit of the Levenberg-Marquardt method for least-squares problems than of a true trust-region method. We chose to devise and analyze an algorithm based on the -norm trust region for two reasons. First, the linear-programming trust-region subproblems (17) can be solved by high-quality linear programming software, making the algorithm much easier to implement than the specialized quadratic programming solver required for (56). Although it is well known that the 2-norm trust region often yields a better search direction than the -norm trust region when the objective is smooth, it is not clear if the same property holds for the the function Q, which is piecewise linear with a great many pieces. Our second reason was that the convergence analysis of the -norm algorithm differs markedly from that of the regularized methods presented in [15, 14, 20, 21], making this project interesting from the theoretical point of view as well as computationally. Finally, we note that aggregation of cuts, which is a feature of the regularized methods mentioned above and which is useful in limiting storage requirements, can also be performed to some extent in Algorithm TR. In Procedure Model- Update, we still need to retain the cuts generated at x k, and at the earlier minor iterations l satisfying (20). However, the cuts active at the solution of the subproblem (17) can be aggregated into C cuts, one for each index j = 1, 2,..., C. To describe the aggregation, we use the alternative form (32) of the subproblem (17), from which the inactive cuts have been removed. Denoting the Lagrange multiplier vectors for the constraints (32b) by λ j, j = 1, 2,..., C, we have from the optimality conditions for (32b) that λ j 0 and e T λ j = 1,

20 20 Jeff Linderoth, Stephen Wright j = 1, 2,..., C. Moreover, if we replace the constraints (32b) by the C aggregated constraints θ j λ T k,l j F [j] x + λt k,l j f [j], j = 1, 2,..., C, (57) then the solution of (32) and its optimal objective value are unchanged. Hence, in Procedure Model-Update, we can delete the else if clause concerning the constraints active (17), and insert the addition of the cuts (57) to the end of the procedure. 4. An Asynchronous bundle-trust-region method In this section we present an asynchronous, parallel version of the trust-region algorithm of the preceding section and analyze its convergence properties Algorithm ATR We now define a variant of the method of Section 3 that allows the partial sums Q [j], j = 1, 2,..., C (7) and their associated cuts to be evaluated simultaneously for different values of x. We generate candidate iterates by solving trust-region subproblems centered on an incumbent iterate, which (after a startup phase) is the point x I that, roughly speaking, is the best among those visited by the algorithm whose function value Q(x) is fully known. By performing evaluations of Q at different points concurrently, we relax the strict synchronicity requirements of Algorithm TR, which requires Q(x k ) to be evaluated fully before the next candidate x k+1 is generated. The resulting approach, which we call Algorithm ATR (for asynchronous TR ), is more suitable for implementation on computational grids of the type we consider here. Besides the obvious increase in parallelism that goes with evaluating several points at once, there is no longer a risk of the entire computation being held up by the slow evaluation of one of the partial sums Q [j] on a recalcitrant worker. Algorithm ATR has similar theoretical properties to Algorithm TR, since the mechanisms for accepting a point as the new incumbent, adjusting the size of the trust region, and adding and deleting cuts are all similar to the corresponding mechanisms in Algorithm TR. Algorithm ATR maintains a basket B of at most K points for which the value of Q and associated subgradient information is partially known. When the evaluation of Q(x q ) is completed for a particular point x q in the basket, it is installed as the new incumbent if (i) its objective value is smaller than that of the current incumbent x I ; and (ii) it passes a trust-region acceptance test like (19), with the incumbent at the time x q was generated playing the role of the previous major iteration in Algorithm TR. Whether x q becomes the incumbent or not, it is removed from the basket. When a vacancy arises in the basket, we may generate a new point by solving a trust-region subproblem similar to (16), centering the trust region at the

21 Stochastic Programming on a Computational Grid 21 current incumbent x I. During the startup phase, while the basket is being populated, we wait until the evaluation of some other point in the basket has reached a certain level of completion (that is, until a proportion σ (0, 1] of the partial sums (7) and their subgradients have been evaluated) before generating a new point. We use a logical variable speceval q to indicate when the evaluation of x q passes the specified threshold and to ensure that x q does not trigger the evaluation of more than one new iterate. (Both σ and speceval q play a similar role in Algorithm ALS.) After the startup phase is complete (that is, after the basket has been filled), vacancies arise only after evaluation of an iterate x q is completed. We use m( ) (without subscripts) to denote the model function for Q( ). When generating a new iterate, we use whatever cuts are stored at the time to define m. When solved around the incumbent x I with trust-region radius, the subproblem is as follows: trsub(x I, ): min x m(x) subject to Ax = b, x 0, x x I. (58) We refer to x I as the parent incumbent of the solution of (58). In the following description, we use k to index the successive points x k that are explored by the algorithm, I to denote the index of the incumbent, and B to denote the basket. As in the description of ALS, we use T 1, T 2,..., T T to denote a partition of {1, 2,..., C} such that the rth computational task consists of the clusters j T r (that is, evaluation of the partial sums Q [j], j T r and their subgradients). We use t k to count the number of tasks for the evaluation of Q(x k ) that have been completed so far. Given a starting guess x 0, we initialize the algorithm by setting the dummy point x 1 to x 0, setting the incumbent index I to 1, and setting the initial incumbent value Q I = Q 1 to. The iterate at which the first evaluation is completed becomes the first serious incumbent. We now outline some other notation used in specifying Algorithm ATR: Q I : The objective value of the incumbent x I, except in the case of I = 1, in which case Q 1 =. I q : The index of the parent incumbent of x q, that is, the incumbent index I at the time that x q was generated from (58). Hence, Q Iq = Q(x Iq ) (except when I q = 1; see previous item). q : The value of the trust-region radius used when solving for x q. curr : Current value of the trust-region radius. When it comes time to solve (58) to obtain a new iterate x q, we set q curr. m q : The optimal value of the objective function m in the subproblem trsub(x Iq, q ) (58). Our strategy for maintaining the model closely follows that of Algorithm TR. Whenever the incumbent changes, we have a fairly free hand in deleting the cuts that define m, just as we do after accepting a new major iterate in Algorithm TR. If the incumbent does not change for a long sequence of iterations (corresponding to a long sequence of minor iterations in Algorithm TR), we can still delete

22 22 Jeff Linderoth, Stephen Wright stale cuts that represent information in m that has likely been superseded (as quantified by a parameter η [0, 1)). The following version of Procedure Model- Update, which applies to Algorithm ATR, takes as an argument the index k of the latest iterate generated by the algorithm. It is called after the evaluation of Q at an earlier iterate x q has just been completed, but x q does not meet the conditions needed to become the new incumbent. Procedure Model-Update (k) for each optimality cut defining m possible delete true; if the cut was generated at the parent incumbent I k of k possible delete false; else if the cut was active at the solution x k of trsub(x I k, k ) possible delete false; else if the cut was generated at an earlier iteration l such that I l = I k 1 and possible delete false; end (if) if possible delete possibly delete the cut; end (for each) Q I k m k > η[q I k m l] (59) Our strategy for adjusting the trust region curr also follows that of Algorithm TR. The differences arise from the fact that between the time an iterate x q is generated and its function value Q(x q ) becomes known, other adjustments of current may have occurred, as the evaluation of intervening iterates is completed. The version of Procedure Reduce- for Algorithm ATR is as follows. Procedure Reduce- (q) if I q = 1 return; evaluate ρ = min(1, q ) Q(xq ) Q Iq Q Iq m q ; (60) if ρ > 0 counter counter+1; if ρ > 3 or (counter 3 and ρ (1, 3]) set + q q / min(ρ, 4); set curr min( curr, + q ); reset counter 0; return. The protocol for increasing the trust region after a successful step is based on (25), (26). If on completion of evaluation of Q(x q ), the iterate x q becomes the new incumbent, then we test the following condition: Q(x q ) Q Iq 0.5(Q Iq m q ) and x q x Iq = q. (61)

23 Stochastic Programming on a Computational Grid 23 If this condition is satisfied, we set curr max( curr, min( hi, 2 q )). (62) The convergence test is also similar to the test (27) used for Algorithm TR. We terminate if, on generation of a new iterate x k, we find that Q I m k ɛ tol (1 + Q I ). (63) We now specify the four key routines of the Algorithm ATR, which serve a similar function to the four main routines of Algorithm ALS. The routine partial evaluate defines a single task that executes on worker processors, while the other three routines execute on the master processor. ATR: partial evaluate(x q, q, r) Given x q, index q, and task index r, evaluate Q [j] (x q ) from (7) for each j T r, together with partial subgradients g j from (9); Activate act on completed task(x q, q, r) on the master processor. ATR: evaluate(x q, q) for r = 1, 2,..., T (possibly concurrently) partial evaluate(x q, q, r); end (for) ATR: initialization(x 0 ) determine number of clusters C and number of tasks T, and the partitions N 1, N 2,..., N C and T 1, T 2..., T T ; choose ξ (0, 1/2), trust region upper bound hi > 0; choose synchronicity parameter σ (0, 1]; choose maximum basket size K > 0; choose curr (0, hi ], counter 0; B ; I 1; x 1 x 0 ; Q 1 ; I 0 1; k 0; speceval 0 false; t 0 0; evaluate(x 0, 0). ATR: act on completed task(x q, q, r) t q t q + 1; for each j T r add Q [j] (x q ) and cut g j to the model m; basketfill false; basketupdate false; if t q = T (* evaluation of Q(x q ) is complete *) if Q(x q ) < Q I and (I q = 1 or Q(x q ) Q Iq ξ(q Iq m q )) (* make x q the new incumbent *) I q; Q I Q(x I ); possibly increase curr according to (61) and (62);

An Adaptive Partition-based Approach for Solving Two-stage Stochastic Programs with Fixed Recourse

An Adaptive Partition-based Approach for Solving Two-stage Stochastic Programs with Fixed Recourse Yongjia Song, James Luedtke Virginia Commonwealth University, Richmond, VA, ysong3@vcu.edu University