arxiv: v2 [math.oc] 6 Oct 2014

Size: px

Start display at page:

Download "arxiv: v2 [math.oc] 6 Oct 2014"

Audra Williams
5 years ago
Views:

1 EFFICIENT FIRST-ORDER METHODS FOR LINEAR PROGRAMMING AND SEMIDEFINITE PROGRAMMING arxiv: v2 [math.oc] 6 Oct 204 JAMES RENEGAR. Introduction The study of first-order methods has largely dominated research in continuous optimization for the last decade, yet still the range of problems for which optimal first-order methods have been developed is surprisingly limited, even though much has been achieved in some areas with high profile, such as compressed sensing. Even if one restricts attention to, say, linear programming, the problems proven to be solvable by first-order methods in O(/ǫ) iterations all possess noticeably strong structure. We present a simple transformation of any linear program or semidefinite program into an equivalent convex optimization problem whose only constraints are linear equations. The objective function is defined on the whole space, making virtually all subgradient methods be immediately applicable. We observe, moreover, that the objective function is naturally smoothed, thereby allowing most first-order methods to be applied. We develop complexity bounds in the unsmoothed case for a particular subgradient method, and in the smoothed case for Nesterov s original optimal first-order method for smooth functions. We achieve the desired bounds on the number of iterations, O(/ǫ 2 ) and O(/ǫ), respectively. However, contrary to most of the literature on first-order methods, we measure error relatively, not absolutely. On the other hand, also unlike most of the literature, we require only the level sets to be bounded, not the entire feasible region to be bounded. Perhaps most surprising is that the transformation from a linear program or a semidefinite program is simple and so is the basic theory, and yet the approach has been overlooked until now, a blind spot. Once the transformation is realized, the remaining effort in establishing complexity bounds is mainly straightforward, by making use of various works of Nesterov. The following section presents the transformation and basic theory. At the end of the section we observe that the transformation and theory extend far beyond semidefinite programming with proofs virtually identical to the ones given. Thereafterweturn toalgorithms, firstforthe unsmoothed case. Thisis whereweactually rely on structure possessed by linear programs and semidefinite programs but not by conic optimization problems in general. A forthcoming paper [5] generalizes the results to all of hyperbolic programming. That paper depends on this one. Date: September 9, 204. Thanks to Yurii Nesterov for interesting conversations, and for encouragement, during a recent visit to Cornell and special thanks for his creative research in the papers without which this one would not exist.

2 2 J. RENEGAR As a linear programming problem 2. Basic Theory min c T x s.t. Ax = b x 0 is a special case duality aside of a semidefinite program in which all off-diagonal entries are constrained to equal 0, in developing the theory we focus on semidefinite programming, as there is no point in doing proofs twice, once for linear programming and again for semidefinite programming. After proving the first theorem, we digress to make certain the reader is clear on how to determine the implications of the paper for the special case of linear programming. (We sometimes digress to consider the special case of linear programming in later sections as well.) For C,A,...,A m S n n (n n symmetric matrices), and b R m, consider the semidefinite program inf C, X s.t. A(X) = b SDP X 0 where, is the trace inner product, where A(X) := ( A,X,..., A m,x ), and where X 0 is shorthand for X S n n (cone of positive semidefinite matrices). + Let opt val be the optimal value of SDP. Assume C is not orthogonalto the nullspace of A, as otherwise all feasible points are optimal. Assume a strictly feasible matrix E is known. Until section 5, assume E = I, the identity matrix. Assuming the identity is feasible makes the ideas and analysis particularly transparent. In section 5, it is shown that the results for E = I are readily converted to results when the known feasible matrix E is a positive-definite matrix other than the identity. Until section 5, however, the assumption E = I stands, but is not made explicit in the formal statement of results. For symmetric matrices X, let λ min (X) denote the minimum eigenvalue of X. It is well known that X λ min (X) is a concave function. Lemma 2.. Assume SDP has bounded optimal value. If X S n n satisfies A(X) = b and C,X < C,I, then λ min (X) <. Proof: If λ min (X), then I +t(x I) is feasible for all t 0. As the function t C,I +t(x I) is strictly decreasing (because C,X < C,I ), this implies SDP has unbounded optimal value, contrary to assumption. For all X S n n for which λ min <, let Z(X) denote the matrix where the line from I in direction X I intersects the boundary of S n n +, that is, Z(X) := I + λ min(x) (X I). We refer to Z(X) as the projection (from I) of X to the boundary of the semidefinite cone. The following result shows that SDP is equivalent to a particular eigenvalue optimization problem for which the only constraints are linear equations. Although the proof is straightforward, the centrality of the result to the development makes the result be a theorem.

3 EFFICIENT FIRST-ORDER METHODS FOR LP AND SDP 3 Theorem 2.2. Let val be any value satisfying val < C,I. If X solves max λ min (X) s.t. A(X) = b C,X = val, then Z(X ) is optimal for SDP. Conversely, if Z is optimal for SDP, then X := I + (Z I) is optimal for (), and Z = Z(X ). Proof: Fix a value satisfying val < C,I, and consider the affine space that forms the feasible region for (): {X S n n : A(X) = b and C,X = val}. (2) Since val < C,I, it is easily proven from the convexity of S n n + that X Z(X) gives a one-to-one map from the set (2) onto {Z S n n + : A(Z) = b and C,Z < C,I }, (3) where S n n + denotes the boundary of S n n +. For X in the set (2), the objective value of Z(X) is C,Z(X) = C,I + λ min(x) (X I) () = C,I + λ min(x) (val C,I ), (4) a strictly-decreasing function of λ min (X). Since the map X Z(X) is a bijection between the sets (2) and (3), solving SDP is thus equivalent to solving (). SDP has been transformed into an equivalent linearly-constrained maximization problem with concave albeit nonsmooth objective function. Virtually any subgradient method can be applied to this problem, the main cost per iteration being in computing a subgradient and projecting it onto the subspace {V : A(V) = 0 and C,V = 0}. In section 6, it is observed that the objective function has a natural smoothing, allowing almost all first-order methods to be applied, not just subgradient methods. We digress to interpret the implications of the development thus far for the linear programming problem min c T x s.t. Ax = b LP (5) x 0. LP can easily be expressed as a semidefinite program in variable X S n n constrained to have all off-diagonal entries equal to zero, where the diagonal entries correspond to the original variables x,...,x n. In particular, the standing assumption that I is feasible for SDP becomes, in the special case of LP, a standing assumption that (the vector of all ones) is feasible. The eigenvalues of X become the coordinates x,...,x n. The map X λ min (X) becomes x min j x j. Lemma 2. becomes the statement that if x satisfies Ax = b and c T x < c T, then min j x j <. Finally, Theorem 2.2 becomes the result that for any value satisfying val < c T, LP is equivalent to max x min j x j s.t. Ax = b (6) c T x = val,

4 4 J. RENEGAR in that, for example, x is optimal for (6) if and only if the projection z(x ) = + min j x (x ) is optimal for LP. j In this straightforward manner, the reader can realize the implications for LP of all results in the paper. Before leaving the simple setting of linear programming, we make observations pertinent to applying subgradient methods to solving (6), the problem equivalent to LP. The subgradients of x min j x j at x are the convex combinations of the standardbasisvectorse(k) forwhichx k = min j x j. Consequently, theprojectedsubgradients at x are the convex combinations of the vectors P k for which x k = min j x j, where P k is the k th column of the matrix projecting R n onto the nullspace of Ā = [ A c T ], that is P := I Ā T (ĀĀT ) Ā. (7) In particular, if for a subgradient method the current iterate is x, then the chosen projected subgradient can simply be any of the columns P k for which x k = min j x j. Choosing the projected subgradient in this way gives the subgradient method a combinatorial feel. If, additionally, the subgradient method does exact line searches, then the algorithm possesses distinct combinatorial structure. (In this regard it should be noted that the work required for an exact line search is only O(nlogn), dominated by the cost of sorting.) If m n, then P is not computed in its entirety, but instead the matrix M = (ĀĀT ) if formed as a preprocessing step, at cost O(m 2 n). Then, for any iterate x and an index k satisfying x k = min j x j, the projected subgradient P k is computed according to u = M Āk v = ĀT u P k = e(k) v, foracostof O(m 2 +#non zero entries in A+nlogn) periteration,whereo(nlogn) is the cost of finding a smallest coordinate of x. Now we return to the theory, expressed for SDP, but interpretable for LP in the straightforward manner explained above. Assume, henceforth, that SDP has at least one optimal solution. Thus, the equivalent problem () has at least one optimal solution. Let Xval denote any of the optimal solutions for the equivalent problem. Lemma 2.3. λ min (X val) = val opt val Proof: By Theorem2.2, Z(X val ) is optimal for SDP in particular, C,Z(X val ) = opt val. Thus, according to (4), Rearrangement completes the proof. opt val = C,I + λ min(x val ) (val C,I ). We focus on the goal of computing a matrix Z that is feasible for SDP and has objective value which is significantly better than the objective value for I, in the sense that C,Z opt val ǫ, (8)

5 EFFICIENT FIRST-ORDER METHODS FOR LP AND SDP 5 where ǫ > 0 is user-chosen. Thus, for the problem of main interest, SDP (or the special case, LP), the focus is on relative improvement in the objective value. The following proposition provides a useful characterization of the accuracy needed in approximately solving the SDP equivalent problem () so as to ensure that for the computed matrix X, the projection Z = Z(X) satisfies (8). Proposition 2.4. Let 0 ǫ <, and let val be a value satisfying val < C,I. If X is feasible for the SDP equivalent problem (), then C,Z(X) opt val ǫ (9) if and only if λ min (X val ) λ min(x) ǫ ǫ. (0) Proof: Assume X is feasible for the equivalent problem (). For Y = X,X val, we have the equality (4), that is, Thus, Hence, C,Z(Y) = C,I + λ min(y ) (val C,I ). C,Z(X) opt val = C,Z(X) C,Z(X val ) C,I C,Z(X val ) = λ min(x) λ min(xval ) λ min(xval ) = λ min(x val ) λ min(x) λ min (X) C,Z(X) opt val ǫ λ min (X val ) λ min(x) ǫ( λ min (X)) ( ǫ)(λ min (X val ) λ min(x)) ǫ( λ min (X val )) λ min (Xval ) λ min(x) ǫ ǫ ( λ min(xval )). UsingLemma2.3tosubstitute forthe rightmostoccurrenceofλ min (Xval )completes the proof. It might seem that to make use in complexity analysis of the equivalence of (0) with (9), it would be necessary to assume as input to algorithms a lower bound on opt val. Such is not the case, as is shown in the following sections. In concluding the section, we observe that the basic theory holds far more generally. In particular, let K be a closed, pointed, convex cone in R n, and assume e lies in the interior of K. For x R n, define λ min,e (x) = inf{λ R : x λe / K}. ().

6 6 J. RENEGAR It is easy to show x λ min,e (x) is a closed, concave function with finite value for all x R n. Assume additionally that e is feasible for the conic optimization problem min c, x s.t. Ax = b x K (2) (where, is any fixed inner product). The same proof as for Theorem 2.2 then shows that for any value satisfying val < c,e, (2) is equivalent to the linearlyconstrained optimization problem max λ min,e (x) s.t. Ax = b c,x = val, (3) in the sensethat ifxis optimalfor(3), then z(x) := e+ λ min,e(x) (x e)is optimal for (2), and conversely, if z is optimal for (2), then x = e+ c,e val c,e opt val (z e) is optimal for (3); moreover, z = z(x ). Likewise, Proposition 2.4 carries over with the same proof. Furthermore, analogous results are readily developed for a variety of different forms of conic optimization problems. Consider, for example, a problem min c,x s.t. Ax b K. (4) Nowassumeknownafeasiblepoint e inthe interiorofthe feasibleregion. Then, for any value satisfying val < c,e, the conic optimization problem (4) is equivalent to a problem for which there is only one linear constraint: max λ min,e (Ax b) s.t. c,x = val, (5) where e := Ae b, and where λ min,e is as defined in (). The problems are equivalent in that if x is optimal for (5), then the projection z(x ) of x from e to the boundary of the feasible region is optimal for (4) that is, z(x ) := e + λ min,e(ax b) (x e ) is optimal for (4) and, conversely, if z is optimal for (4), then x = e + c,e val c,e opt val (z e ) is optimal for (5); moreover, z = z(x ). We focus on the concrete setting of semidefinite programming (and linear programming) because the algebraic structure thereby provided is sufficient for designing provably-efficient first-order methods, in both smoothed and unsmoothed settings. We now begin validating the claim. 3. Corollaries for a Subgradient Method Continue to assume SDP has an optimal solution and I is feasible.

7 EFFICIENT FIRST-ORDER METHODS FOR LP AND SDP 7 Given ǫ > 0 and a value satisfying val < C,I, we wish to approximately solve the SDP equivalent problem max λ min (X) s.t. A(X) = b C,X = val, (6) where by approximately solve we mean that feasible X is computed for which with ǫ satisfying λ min (X val) λ min (X) ǫ ǫ ǫ ǫ. Indeed, according to Proposition 2.4, the projection Z = Z(X) will then satisfy C,Z opt val ǫ. (7) We begin by recalling a well-known complexity result for a subgradient method, interpreted for when the method is applied to solving the SDP equivalent problem (6). From this is deduced a bound on the number of iterations sufficient to obtain X whose projection Z = Z(X) satisfies (7). We observe, however, that in a certain respect, the result is disappointing. In the next section, the framework is embellished by applying the subgradient method not to (6) for only one value val, but to (6) for a small and carefully chosen sequence of values, val = val l. The embellishment results in a computational scheme which possesses the desired improvement. For specifying a subgradient method and stating a bound on its complexity, we follow Nesterov s book [2]: Subgradient Method (0) Inputs: Number of iterations: N Initial iterate: X 0 satisfying A(X 0 ) = b and C,X 0 < C,I. Let val := C,X 0. Distance upper bound: R, a value for which there exists X val satisfying X 0 X val R. Initial best iterate: X = X 0 Initial counter value: k = () Update counter: k + k (2) Iteration: Computeasubgradient λ min (X k )andorthogonallyproject it onto the subspace {V : A(V) = 0 and C,V = 0}. (8) Denoting the projection by G k, compute X k+ := X k + R N Gk G k. (3) If λ min (X k+ ) > λ min (X), then make the replacement X k+ X. (4) Check for termination: If k = N, then output X and terminate. Else, go to Step.

8 8 J. RENEGAR Theorem 3.. For Subgradient Method, the output X satisfies λ min (X val) λ min (X) R/ N, where val := C,X 0 and X 0 is the input matrix. Proof: The function X λ min (X) is Lipschitz continuous with constant : λ min (X) λ min (Y) X Y = ( λ j (X Y) 2) /2. The result is thus a simple corollaryof Theorem in Nesterov s book, by choosing the parameter values there to be h k = R/ N for k = 0,...,N. We briefly digress to the special case of linear programming. Recall for LP the linear program (5) the projected subgradients at x are the convex combinations of the columns P k for which x k = min j x j, where P is the matrix projecting R n orthogonally onto the nullspace of Ā = [ A c T ], that is, P = I ĀT (ĀĀT ) Ā. In particular, if x is the current iterate for Subgradient Method, a projected subgradient can be selected simply by computing any column P k for which x k = min j x j. R Subgradient Method then moves from x to x+ N Pk P k. The geometry is interesting in that each step is being chosen from among only the vectors R N P Pj j for j =,...,n. The geometry is made even more interesting by Theorem 3. asserting that even for the choice of steps coming from this limited set of vectors, still it holds that the final output x satisfies minx val,j min x j R/ N, j j where x val,j denotes the jth coordinate of the optimal solution x for the LP equivalent problem max x min j x j s.t. Ax = b c T x = val. Now we return to the more general setting of semidefinite programming. Below, the input matrix X 0 to Subgradient Method is required to be feasible for SDP, mainly so that the input R can be chosen as a value with clear relevance to SDP, a value we now describe. The level sets for SDP are the sets Level val = {X 0 : A(X) = b and C,X = val}, where val is any fixed value. Of course Level val = if val < opt val. If some nonempty level set is bounded, then all level sets are bounded. On the other hand, if a level set is unbounded, then either SDP has unbounded optimal value or can be made to have unbounded value with an arbitrarily small perturbation of C. Thus, in developing numerical methods for approximating optimal solutions, it is natural to focus on the case that level sets for SDP are bounded j

9 EFFICIENT FIRST-ORDER METHODS FOR LP AND SDP 9 (equivalently, the dual problem is strictly feasible). Hence, we assume the level sets are bounded. Let diam be a known value satisfying diam max{ X Y : X,Y Level val } for all val < C,I, that is, an upper bound on the diameters of all level sets for better objective values than the value for the level set containing I. Although the assumption of knowing the upper bound diam is strong, it is consistent with assumptions found throughout the literature on first-order methods, such as the requirement for Subgradient Method that the input R be an upper bound on X 0 X val, where X 0 is the input matrix. Moreover, even though the assumption of knowing diam is strong, still there are many interesting situations in which the assumption is valid, particularly when a problem is specifically modeled in such a way as to make the diameter of the level sets (for val < C,I ) be of reasonable magnitude. For example, when I is on (or near) the central path, the choice of upper bound diam = n is valid, albeit for various carefully modeled semidefinite programs in which I is explicitly made to be near the central path, stronger upper bounds hold (e.g., diam = O( n) in numerous interesting cases, some of which are displayed in the forthcoming paper [5]). In most of the literature on optimal first-order methods, the feasible region is required to be bounded, not just the level sets. By focusing on relative error (7) ratherthanabsoluteerror,weareabletorequireonlythatthelevelsetsbebounded, not the feasible region. In the following corollary regarding Subgradient Method, the choice of input N depends on the optimal value for SDP. Naturally the reader will infer that in addition to knowing the upper bound diam, our algorithmic scheme will require knowing a lower bound on opt val, but this is not the case for the scheme. The corollary is used for motivating the next step in specifying the scheme. Corollary 3.2. Assume X 0 is feasible for SDP and satisfies C,X 0 < C,I. Define val := C,X 0. Let 0 < ǫ <. If X 0 and R = diam are inputs to Subgradient Method, along with an integer N satisfying ( ) 2 diam N, (9) ǫ then for the output X, the projection Z(X) satisfies C,Z(X) opt val ǫ. Proof: Since X 0 is feasible, so is Xval (because 0 λ min(x 0 ) λ min (Xval )). Thus, X 0 Xval diam, making R = diam a valid input to Subgradient Method. For inputs as specified, Theorem 3. immediately implies the output X for Subgradient Method satisfies λ min (X val) λ min (X) ǫ Invoking Proposition 2.4 completes the proof..

10 0 J. RENEGAR The dependence ofthe iterationlowerbound (9) onǫ 2 is unfortunate but probably unavoidablewithout smoothing the objective function X λ min (X), as is done in sections 6 and 7. Likewise, a significant dependence on diam or some other meaningful quantity capturing the distance X 0 Xval probably is unavoidable. However, the dependence on is disconcerting. To understand why the dependence is disconcerting, consider that the most natural choice for the input matrix is X 0 = I λ max(π(c)) π(c), where π(c) is the orthogonal projection of C onto the subspace {V : A(V) = 0}. This is the choice for X 0 obtained by moving from I in direction π(c) until the boundary of the semidefinite cone is reached. However, even when I is on the central path (in which case the direction π(c) is tangent to the central path), it can happen that the value is of magnitude n for val = C,X 0 and X 0 = I λ max(π(c)) π(c). Thus, even for problems modeled carefully so that I is on the central path and diam is of limited size, the iteration lower bound (9) can grow significantly with n regardless of the value for ǫ. This is disconcerting. Moreover, we want an algorithm for which opt val does not explicitly figure into choosing the inputs. We already assume the upper bound diam is known. We want to avoid also assuming a lower bound on opt val is known. These matters are handled in the following section. 4. The NonSmoothed Scheme The observations concluding the preceding section raise a question: Is it possible to efficiently move from an initial feasible matrix U 0 satisfying C,U 0 < C,I, to a feasible matrix Y for which val = C,Y satisfies, say, 3? We begin this section by providing an affirmative answer, but first let us again display the pertinent optimization problem: max λ min (X) s.t. A(X) = b C,X = val. Recall that X val denotes any optimal solution of (20), an optimization problem which is equivalent to SDP (assuming val < C,I ). (20) Consider the following computational procedure: NonSmoothed SubScheme (0) Initiation: Input: AmatrixU 0 thatisfeasibleforsdpandsatisfies C,U 0 < C,I. Let val 0 = C,U 0 Let l =. () Outer Iteration Counter Step: l+ l (2) Inner Iterations: Apply Subgradient Method with inputs X 0 = U l, R = diam and N = 9diam 2. Rename the output X as V l.

11 EFFICIENT FIRST-ORDER METHODS FOR LP AND SDP (3) Check for Termination: If λ min (V l ) /3, then output Y = U l and terminate. Else, compute the projection and go to Step. U l+ := Z(V l ), let val l+ := C,U l+, Proposition 4.. NonSmoothed SubScheme outputs Y that is feasible for SDP and satisfies 3, where val := C,Y. The total number of outer iterations does not exceed log 3/2 ( 0 where val 0 = C,U 0 and U 0 is the input matrix. ), Proof: It is easily verified that all of the matrices U l and V l computed by Non- Smooth SubScheme satisfy the SDP equations A(X) = b. Moreover, U l is clearly feasible for SDP, lying in the boundary of the feasible region. Fix l to be any value attained by the counter. We now examine the effects of Steps 2 and 3. Corollary 3.2 with N = 9diam 2 shows that in Step 2, the output X from Subgradient Method satisfies that is, Observe λ min (X val l ) λ min (X) /3, λ min (X val l ) λ min (V l ) /3. (2) l = λ min(x val l ) (by Lemma 2.3) 2 3 λ min(v l ) (by (2)) Hence, if the method terminates in Step 3 that is, if λ min (V l ) /3 then the output matrix Y = U l satisfies 3, (22) where val := C,Y = val l. We have now verified that if NonSmoothed SubScheme terminates, then the output Y is indeed feasible for SDP and satisfies the desired inequality (22). On the other hand, if the method does not terminate in Step 3, it computes the matrix U l+ and its objective value, val l+. Here, observe val l+ = C,I + λ min(v l ) (val l C,I ) C,I 3 2 ( l), because val l < C,I and λ min (V l ) /3 (due to no termination). Hence, l+ 3 2 l.

12 2 J. RENEGAR Since all values val l computed by the algorithm satisfy val l opt val (as U l is feasible for SDP), it immediately follows that ( ) log 3/2 0 is an upper bound on the number of outer iterations. Specifying our overall computational scheme relying on the subgradient method, and analyzing the scheme s complexity, both are now easily accomplished: NonSmoothed Scheme (0) Inputs: A value 0 < ǫ <, and a matrix U 0 which both is feasible for SDP and satisfies C,U 0 < C,I. For example, the matrix U 0 = I λ π(c). max(π(c)) () Apply NonSmoothed SubScheme with input U 0. Let Y denote the output. (2) Apply Subgradient Method with inputs X 0 = Y, R = diam and N = (3diam/ǫ) 2. Let X denote the output. (3) Compute and output the projection Z = Z(X), then terminate. In stating the following theorem, we make explicit that I is being assumed as feasible. The generalization to assuming known a strictly feasible matrix, but not necessarily the identity, is presented in section 5. Theorem 4.2. Assume I is feasible for SDP. NonSmoothed Scheme outputs Z which is feasible for SDP and satisfies C,Z opt val ǫ. The total number of iterations of Subgradient Method is bounded above by ( 9 diam 2 + ) ( ( )) ǫ 2 + log 3/2, 0 where val 0 := C,U 0 and U 0 is the input matrix. Proof: Proposition 4. shows the output matrix Y from Step is feasible for SDP and satisfies 3, whereval = C,Y. Thus, by Corollary3.2, when X 0 = Y is input into Subgradient Method, along with R = diam and N = (3diam/ǫ) 2, the projection Z(X) of the output X satisfies C,Z(X) opt val ǫ, establishing correctness of NonSmoothed Scheme. The bound for total iterations of Subgradient Method is immediate from the outer iteration bound of Proposition 4., and the choices for the number of iterations in Step 2 of NonSmoothed SubScheme and in Step 2 of NonSmoothed Scheme.

13 EFFICIENT FIRST-ORDER METHODS FOR LP AND SDP 3 The following corollary is useful when an optimization problem is modeled so as to make I be on (or near) the central path. The proof follows standard lines in interior-point method theory, but nonetheless we include the proof for completeness. Corollary 4.3. If I is on the central path and the input matrix is chosen as U 0 = I λ max(π(c)) π(c), then the same conclusions as in Theorem 4.2 apply but now with the number of Subgradient Method iterations bounded above by ( 9 diam 2 + ) ( ) ǫ 2 + log 3/2 (n). Proof: Assume I is on the central path, that is, assume for some µ > 0 that I is the optimal solution for min C,X µ ln(det(x)) s.t. A(X) = b. Since the gradient of the objective function at X 0 is C X, a first-order optimality condition satisfied by I is that there exists a vector y for which C I = A y, where A y = i y ia i is the adjoint of A. This implies that the projection of C and µi onto the nullspace of A are identical. Consequently, C,X C,I and A(X) = b tr(x) n and A(X) = b. Hence, all X which are both feasible for SDP and have better objective value lie within the set {X 0 : tr(x) n}, a set which is contained within the ball of radius n centered at I. Thus, all feasible X for SDP satisfy that is, C,I C,X = π(c),i X π(c) n, n π(c). (23) On the other hand, the feasible matrix U 0 = I λ max(π(c)) π(c) lies distance at least from I, because the unit ball centered at I is contained in S n n and because + U 0 lies in the boundary of S n n +. Hence, Consequently, Combining (23) and (24) gives I U 0 = α π(c) π(c) for some α. C,I C,U 0 = π(c),i U 0 = α π(c) π(c). (24) C,I C,U 0 n. Substitution into Theorem 4.2 completes the proof.

14 4 J. RENEGAR It is interesting to observe that for any fixed value of ǫ, if one is able to model a family of optimization problems as semidefinite programs SDP(n) parameterized by n (the number of variables) in such a way that for every n, both I n is on (or near) the central path and for some p < /4, diam(n) = O(n p ), then the iteration bound provided by the corollary is better than the best iteration bound established for interior-point methods, i.e., O( n) iterations when ǫ is fixed. As each iteration of Subgradient Method is cheap relative to the cost of an interior-point method iteration, in this case NonSmoothed Scheme wins hands down. On the other hand, of course, if n is held fixed and ǫ goes to zero, the bound O(log(/ǫ)) on the number of iterations for interior-point methods is massively better than the bound O(/ǫ 2 ) for NonSmoothed Scheme. 5. Starting Points E I In this section it is observed that the theory and algorithms from previous sections are readily converted to the case that the starting point is a strictly-feasible matrix E I. As in the remarks closing section 2, the relevant concave function is λ min,e (X) := inf{λ R : X λe / S n n + }, (25) (the smallest eigenvalue of the matrix E /2 XE /2, where E /2 is the positive definite matrix satisfying E = E /2 E /2 ). As those remarks noted, the theory of that section is easily generalized, which for the present situation means replacing all occurrencesofλ min appearinginsection2byλ min,e, whilesimultaneouslyreplacing I by E, assuming E is strictly feasible. The algorithms and theory from sections 3 and 4, however, are not so obviously extended. At issue is that unlike X λ min (X), the function X λ min,e (X) need not be Lipschitz continuous with constant, a fact that was critical in the proof of Theorem 3.. Thankfully, the issue is easily handled by changing from the trace inner product to the inner product on S n n used in interior-point method theory: + U,V E := tr(e UE V) = tr ( (E /2 UE /2 )(E /2 UE /2 ) ). Thus, for example, when SubGradient Method computes a subgradient for iterate X k, the subgradient should be with respect to, E rather than with respect to,. Likewise, when SubGradient Method projects a subgradient onto the subspace (8), the projection should be orthogonal with respect to, E. Finally, the value diam should be replaced by a value diam E satisfying diam E max{ X Y E : X,Y Level val } for all val < C,E. With these changes, all of the results of previous sections are valid with E in place of I. For linear programming, the resulting changes to Subgradient Method are quickly described. Letting e denote the known strictly-feasible point (not necessarily the vector of all ones), and letting val be any value satisfying val < c T e, the problem

15 EFFICIENT FIRST-ORDER METHODS FOR LP AND SDP 5 equivalent to LP is max x min j x j /e j s.t. Ax = b c T x = val. In applying Subgradient Method, the relevant inner product is u,v e = j u j v j e 2 j. With respect to this inner product, the subgradients at x R n are the convex combinations of the vectors e (k) for which x k /e k = min j x j /e j, where e (k) has all coordinates equal to zero except for the k th coordinate, which is equal to e k. The, e -orthogonal projections of the vectors e (j) (j =,...,n) onto the nullspace of [ ] A Ā = c T are the columns of the matrix P e that, e -orthogonally projects R n onto the nullspace, that is, P e = I (e) 2 Ā T (Ā (e)2 Ā T ) Ā, where (e) is the diagonal matrix with j th diagonal entry equal to e j. Thus, the projected subgradients relied upon by Subgradient Method are now the convex combination of the columns of P e. Perhaps the easiest way to understand why all of the results of previous sections remain valid when I is replaced by E and, is replaced by, E is to use the standard trick in the interior-point method literature of scaling SDP to an equivalent semidefinite program for which I is feasible. The equivalent semidefinite program (in variable Y) is min Y E /2 CE /2,Y A(E /2 YE /2 ) = b Y 0 scaled SDP TheequivalenceisseenbynotingX isfeasibleforsdpifandonlyify = E /2 XE /2 is feasible for scaled SDP, and the objective value of X for SDP is the same as the objective value of Y for scaled SDP. Moreover, the inner product, E is transformed into the trace inner product, in that X,X 2 E = Y,Y 2 for Y j = E /2 X j E /2 (j =,2). Thus, for example, the, E -diameter of level set Level val (SDP) is the same as the, -diameter of level set Level val (scaled SDP). Lastly, as is straightforward but tedious to verify, each of the algorithms transforms as well. For example, consider Subgradient Method, and fix two of the inputs, R and N. Assume the algorithm is applied with, E to solving the linearly-constrained problem equivalent to SDP: max λ min,e (X) s.t. A(X) = b C,X = val.

16 6 J. RENEGAR Then X 0,X,...,X N is a possible resulting sequence if and only if Y 0,Y,...,Y N where Y j := E /2 X j E /2 is a possible resulting sequence when Subgradient Method is applied using the trace inner product to the problem equivalent to scaled SDP: max λ min (Y) s.t. A(E /2 YE /2 ) = b E /2 CE /2,Y = val. Applying any of the algorithms with strictly-feasible E and with inner product, E is, in other words, equivalent to scaling SDP, applying the algorithm with I and the trace inner product, and then unscaling the answer. We choose to assume I is feasible only to reduce notational clutter and make evident the simplicity of the main ideas. (Unfortunately, the trick of scaling does not generalize to hyperbolic programming, making the proofs in the forthcoming paper [5] necessarily more abstract than the ones here.) In closing the section, we remark that the results throughout the paper can be developed just as readily for semidefinite programs of, say, the form min x s.t. c T x m i= x ia i B. One assumes knowna strictly feasible point e, and relies upon the concavefunction X λ min,e (X) specified in (25), letting E := m i= e i A i B. The equivalent problem solvable by a subgradient method is max λ min,e ( m i= x ia i B) s.t. c T x = val, for any value satisfying val < c T e. The relevant inner product at e is m u,v e := u i A i, i= m v i A i E. i= For the special case of a linear program min c T x s.t. Ax b, letting α T i denote the i th row of A and letting e := Ae b, the equivalent problem for val < c T e is The relevant inner product is max min i e i (α T i x b i) s.t. c T x = val. u,v e = u T A T (e) 2 Av, where (e) is the diagonal matrix with i th diagonal entry equal to e i.

17 EFFICIENT FIRST-ORDER METHODS FOR LP AND SDP 7 6. Corollaries for Nesterov s First First-Order Method Assume SDP has an optimal solution and I is feasible. (In exactly the same manner as previous results generalize to an arbitrary strictly-feasible initial matrix E, so do all of the remaining results.) Recall that given ǫ > 0 and a value val satisfying val < C,I, we wish to approximately solve the SDP equivalent problem max λ min (X) s.t. A(X) = b C,X = val, where by approximately solve we mean that feasible X is computed for which with ǫ satisfying (26) λ min (X val ) λ min(x) ǫ (27) ǫ ǫ ǫ. Indeed, according to Proposition 2.4, the projection Z = Z(X) will then satisfy C,Z opt val ǫ. With [3], Nesterov initiated a huge wave of research, by displaying that some significant non-smooth optimization problems can be efficiently solved by smoothing the problem and then applying optimal first-order methods for smooth functions. In [4], he extended the approach to include some problems within the domain of semidefinite programming. Here he gave emphasis to the nonsmooth convex objective function X λ max (X), but the results trivially adapt to the concave function of interest to us, X λ min (X). For the nonsmooth concave function X λ min (X), the useful smoothing is f µ (X) := µln j e λj(x)/µ, where µ > 0 is user-chosen, and where λ (X),...,λ n (X) are the eigenvalues of X. For motivation, observe that for all X S n n, Not obvious, but which Nesterov proved, λ min (X) µ lnn f µ (X) λ min (X). (28) f µ (X) f µ (Y) µ X Y, where is the Frobenius norm. That is, the gradientoff µ is Lipschitzcontinuous, with constant /µ. The smoothed version of (26) is max f µ (X) s.t. A(X) = b C,X = val. Let Xval (µ) denote an optimal solution (which the reader should be careful to distinguish from Xval, an optimal solution for (26)). (29)

18 8 J. RENEGAR In passing, we note that the gradient of f µ at X is the the matrix ] f µ (X) = j e λ j (X)/µ Q [ λ(x) ] [ e λ (X)/µ... e λn(x)/µ Q T, where X = Q... Q T is an eigendecomposition of X. λn(x) For the special case of linear programming, the function f µ becomes f µ (x) = µln j e xj/µ, for which the gradient at x is the vector with j th coordinate equal to f µ (x) j = e xj/µ k e x k/µ. It is readily seen from (28) that for any value ǫ > 0 and for all X S n n, if µ = 2 ǫ /ln(n), then f µ (X val (µ)) f µ(x) 2 ǫ λ min (X val ) λ min(x) ǫ. Consequently, in order to compute X which is feasible for (26) and satisfies (27), it suffices to fix µ = 2 ǫ /ln(n) and compute X which is feasible for (29) and has objective value within ǫ /2 of f µ (X val (µ)). Since the objective function in (29) is smooth and the only constraints are linear equations, we can apply many first-order methods. It is only fitting that we rely on the original optimal first-order method for smooth functions, due to Nesterov [], and which we refer to as Nesterov s first first-order method, or for brevity, Nesterov s first method. Letting X 0 denote an initial feasible point, then according to Theorem in [2], Nesterov s first method produces a sequence of iterates X 0,X,... satisfying f µ (X val (µ) f µ(x k ) 4 µ(k+2) 2 X 0 X val (µ)) 2 (where we have used the fact that the Lipschitz constant for the gradient is /µ). Thus, f µ (X k ) is within ǫ /2 of the optimal value if 8 k µǫ X 0 Xval (µ)) 2, that is, if k 4 lnn X 0 Xval (µ)) 2 (using µ = 2 ǫ /ln(n)). We summarize these results in a theorem. ǫ Theorem 6.. (Nesterov) Let ǫ be any positive value, and µ = 2 ǫ /ln(n). Assume X satisfies A(X) = b and C,X = val < C,I. If Nesterov s first first-order method is applied to solving the smoothed problem (29), then the resulting iterates X 0 = X,X,X 2,... satisfy k 4 lnn X X val (µ)) ǫ 2 λ min (X val) λ min (X k ) ǫ.

19 EFFICIENT FIRST-ORDER METHODS FOR LP AND SDP 9 Recall that diam is assumed to be a known quantity satisfying diam max{ X Y : X,Y Level val } for all val < C,I. Corollary 6.2. Assume X is strictly feasible for SDP and C,X = val < C,I. For any value 0 < ǫ 2λ min (X), by letting µ := 2 ǫ /ln(n), if Nesterov s first first-order method is applied to solving the smoothed problem (29), then the resulting iterates X 0 = X,X,X 2,... satisfy k 4 lnn diam ǫ 2 λ min (X val) λ min (X k ) ǫ. Proof: Assume X, ǫ and µ are as in the statement. Then f µ (X) 0, by (28) and µ λ min /ln(n). Hence, f µ (X val (µ)) 0 (because f µ(x val (µ)) f µ(x)). Also by (28), Y / S n n + f µ (Y) < 0. Thus, Xval (µ) is feasible for SDP. Hence, X Xval (µ) diam. Substitution into Theorem 6. completes the proof. Corollary 6.3. Assume X is feasible for SDP and satisfies λ min (X) 6 and 3, (30) where val := C,X. Let 0 < ǫ <. For µ = 6ǫ/ln(n), if Nesterov s first first-order method is applied to solving the smoothed problem (29), then the resulting iterates X 0 = X,X,X 2,... satisfy k 2 lnn diam ǫ 2 C,Z(X k) opt val ǫ. Proof: Let ǫ := ǫ/3 /3. Then, by assumption, ǫ 2λ min (X), and hence Corollary 6.2 can be applied with giving k 2 lnn diam ǫ µ = 2 ǫ /ln(n) = 6 ǫ/ln(n), 2 λ min (X val ) λ min(x) ǫ 3 λ min (X val ) λ min(x) ǫ ǫ, where the last implication is due to the rightmost inequality assumed in (30). Invoking Proposition 2.4 completes the proof. 7. The Smoothed Scheme The presentation of the smoothed scheme is done in the same manner as the presentation of NonSmoothed Scheme in section 4, but now beginning with the following question: Is it possible to efficiently move from an initial matrix U 0 satisfying A(U 0 ) = b and C,U 0 < C,I, to a matrix Y satisfying the conditions of Corollary 6.3?

20 20 J. RENEGAR Before providing an affirmative answer, for ease of reference we again display the pertinent optimization problems: max λ min (X) s.t. A(X) = b C,X = val, (3) max f µ (X) s.t. A(X) = b C,X = val. Recall that Xval denotes any optimal solution for (3) a problem equivalent to SDP (assuming val < C,I ) whereas Xval (µ) denotes an optimal solution for (32) the smoothed version of (3). Consider the following computational procedure: (32) Smoothed SubScheme (0) Initiation: Input: A matrix U 0 that is feasible for SDP and satisfies both C,U 0 < C,I and λ min (U 0 ) = /6. If a matrix U is available satisfying only A(U) = b and λ min(u) C,U < C,I, then U 0 = I (U I) is acceptable input. Let val 0 = C,U 0 Let µ := /(6 lnn) and N := 2 lnn diam 2. Let l =. () Outer Iteration Counter Step: Let l+ l (2) Inner Iterations: Beginningat U l, apply N iterations ofnesterov sfirst first-order method to the smoothed problem (32), with val := val l. Let V l denote the final iterate. (3) Check for Termination: If λ min (V l ) /3, then output Y = U l and terminate. Else, compute U l+ := I and go to Step. λ min(v l ) (V l I), val l+ := C,U l+, Proposition 7.. Smoothed SubScheme terminates with a matrix Y which is feasible for SDP and satisfies λ min (Y) = 6, 3, (33) where val := C,Y. Moreover, the number of outer iterations does not exceed ( ) log 5/4 0 where val 0 = C,U 0 and U 0 is the input matrix.

21 EFFICIENT FIRST-ORDER METHODS FOR LP AND SDP 2 Proof: The proof especially the last half parallels that of Proposition 4.. Nonetheless, we include the proof in its entirety. It is easily verified that all of the matrices U l and V l computed by Smoothed SubScheme satisfy the SDP equations A(X) = b. Moreover, U l is feasible for SDP, because λ min (U l ) = /6 (by construction). Fix l 0 to be a value attained by the counter. We now examine the effects of Steps 2 and 3. Applying Corollary 6.2 with ǫ = /3 shows that in Step 2, the final iterate X N computed by Nesterov s first method satisfies λ min (X val l ) λ min (X N ) /3, that is, λ min (Xval l ) λ min (V l ) /3. (34) Observe that l = λ min(xval l ) (by Lemma 2.3) 2 3 λ min(v l ) (by (34)). Hence, if the method terminates in Step 3 that is, if λ min (V l ) /3 then the output matrix Y = U l satisfies 3, where val := C,Y = val l. We have now verified that if the Smoothed SubScheme terminates, then the output Y is indeed feasible for SDP and satisfies the inequalities (33). On the other hand, if the method does not terminate in Step 3, it computes the matrix U l+ and its objective value, val l+. Here, observe val l+ = C,I λ min(v l ) (val l C,I ) C,I 5 4 ( l), because val l < C,I and λ min (V l ) /3 (due to no termination). Hence, l+ 5 4 l. Since all values val l computed by the algorithm satisfy val l opt val (because U l is feasible for SDP), it immediately follows that ( ) log 5/4 0 is an upper bound on the number of outer iterations. Specifying our scheme based on Nesterov s first method, and analyzing the scheme s complexity, both are now easily accomplished:

22 22 J. RENEGAR Smoothed Scheme (0) Input: A value 0 < ǫ <, and U 0 S n n satisfying A(U 0 ) = b and λ min (U 0 ) = /6. λ max(π(c)) π(c). For example, the matrix U 0 = I 5 6 () Apply Smoothed SubScheme with input U 0. Let Y denote the output. 2 (2) Beginning at Y, apply lnn diam ǫ 2 iterations of Nesterov s first first-order method to the smoothed problem (32), with val := C,Y. Let X denote the output. (3) Compute and output the projection Z = Z(X), then terminate. Theorem 7.2. Assume I is feasible. Smoothed Scheme outputs Z which is feasible for SDP and satisfies C,Z opt val ǫ. (35) The total number of iterations of Nesterov s first first-order method is bounded above by 2 ( ( )) lnn diam ǫ + log 5/4, 0 where val 0 := C,U 0 and U 0 is the input matrix. Proof: Proposition 7. shows the matrix Y in Step satisfies the conditions of Corollary 6.3, which in turn shows the final output Z = Z(X) from Step 3 is feasible for SDP and satisfies (35). The bound on the total number of iterations of Nesterov s first method is a straightforward consequence of the outer iteration bound from Proposition 7., the choice for N in Smoothed SubScheme, and the number of iterates in Step 2 of Smoothed Scheme. The proof of the following corollary proceeds exactly as does the proof of Corollary 4.3. The added value is due to λ min (U 0 ) = /6 in Smoothed Scheme unlike λ min (U 0 ) = 0 in NonSmoothed Scheme and log 5/4 ( λ min(u 0) ) = log 5/4(6/5) <. Corollary 7.3. If I is on the central path and U 0 = I 5 6 λ max(π(c)) π(c), then the same conclusions as in Theorem 7.2 apply but now with the number of iterations of Nesterov s first first-order method bounded above by 2 ( ) lnn diam ǫ + log 5/4 (n) Closing Remarks Similar to the observation immediately following Corollary 4.3, we see from Corollary 7.3 that for fixed ǫ, if one models a family of problems as semidefinite programs SDP(n) where I n is on (or near) the central path and for which there exists p < /2 satisfying diam(n) = O(n p ), then Smoothed Scheme wins hands down over interior-point methods even on iteration count, let alone on total cost. Interior-point methods, of course, win hands down if n is fixed and ǫ goes to zero.

23 EFFICIENT FIRST-ORDER METHODS FOR LP AND SDP 23 Forfixedn, ouriterationboundofo(/ǫ)ismuchworsethantheboundo(/ ǫ) found in literature related to compressed sensing, where problems can be reduced to ones involving only the feasible regions {x 0 : j x j = } or {X 0 : tr(x) = } for which tractable prox functions are known. However, the approaches used there fail upon including additional constraints, such as requiring X to satisfy a specific sparsity pattern, as is relevant in statistics for fitting a concentration matrix (the inverse of a covariance matrix) to empirical data, and as is relevant in some applications of semidefinite programming to combinatorial problems pertaining to graphs. Among the obstacles is that tractable proximal operators remain unknown except for an extremely small universe of sets. In this vein, we note that for the algorithms herein, imposing a specific sparsity pattern actually reduces work, assuming the known feasible matrix is E = I. Indeed, assuming the sparsity pattern is X ij = 0 for (i,j) Zeros, and using A k,x = b i (k =,...,l) to denote the remaining constraints, projecting a subgradient G is accomplished by overwriting by 0 the entries G i,j for (i,j) Zeros, and orthogonally projecting the resulting matrix onto the subspace {V : C,V = 0 and Āk,V = 0 for k =,...l}, where C (resp.,āk)isthematrixobtainedbyoverwritingby0the(i,j) th coordinate ofc (resp., A k ), for(i,j) Zeros. If l the number of complicating constraints is small and the set Zeros is large, this approach results in significant computational savings per iteration. Moreover, the resulting iteration cost is very cheap relative to the cost of an iteration of an interior-point method, where sparsity constraints must be handled like any other constraints, due to the inner product, Xk being dependent on the iterate, unlike first-order methods where the inner product, =, I is held fixed throughout, an inner product for which sparsity constraints are ideally structured. Additional examples of the relevance of the algorithms and results, and their extensions to hyperbolic programming, are given in the forthcoming paper [5]. References [] Yurii Nesterov. A method of solving a convex programming problem with convergence rate O(/k 2 ). Soviet Mathematics Doklady, 27(2): , 983. [2] Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer, [3] Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 03():27 52, [4] Yurii Nesterov. Smoothing technique and its applications in semidefinite optimization. Mathematical Programming, 0(2): , [5] James Renegar. Efficient first-order methods for hyperbolic programming. In preparation. School of Operations Research and Information Engineering, Cornell University, Ithaca, NY, U.S.

ACCELERATED FIRST-ORDER METHODS FOR HYPERBOLIC PROGRAMMING

ACCELERATED FIRST-ORDER METHODS FOR HYPERBOLIC PROGRAMMING JAMES RENEGAR Abstract. A framework is developed for applying accelerated methods to general hyperbolic programming, including linear, second-order