Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Arkadi Nemirovski H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology Joint research with Anatoli Juditsky University J. Fourier, Grenoble London Optimization Workshop King s College, London, June 9-10, 2014

Overview Problems of interest and motivation Linear Minimization Oracles and classical Conditional Gradient Algorithm Fenchel-type representations and LMO-based Convex Optimization nonsmooth convex minimization variational inequalities with monotone operators and convex-concave saddle points

Motivation Problem of Interest: Variational Inequality with monotone operator Find x X : Φ(x), x x 0 x X VI(Φ, X) X: convex compact subset of Euclidean space E Φ : X E is monotone: Φ(x) Φ(x ), x x 0 x, x X Examples: Convex Minimization: Φ(x) f (x), x X, for a convex Lipschitz continuous function f : X R solutions to VI(Φ, X) are exactly the minimizers of f on X Convex-Concave Saddle Points: X = U V, Φ(u, v) = [Φ u (u, v) u f (u, v); Φ v (u, v) v [ f (u, v)]] for a convex-concave Lipschitz continuous f (u, v) : X R solutions to VI(Φ, X) are exactly the saddle points of f on U V. When problem s sizes make Interior Point algorithms prohibitively time consuming, First Order Methods (FOM s) become the methods of choice. Reasons: Under favorable circumstances, FOM s (a) have cheap steps and (b) exhibit nearly dimension independent sublinear convergence rate. Note: Perhaps one could survive without (b), but (a) is a must!

Proximal FOMs Find x X : Φ(x), x x 0 x X VI(Φ, X) X: convex compact subset of Euclidean space E Φ : X E is monotone: Φ(x) Φ(x ), x x 0 x, x X Fact: Most FOMs for large-scale convex optimization (Subgradient Descent, Mirror Descent, Nesterov s Fast Gradient Methods,...) are proximal algorithms. To allow for proximal methods with cheap iterations, X should admit cheap proximal setup a C 1 strongly convex distance generating function (d.g.f.) ω( ) : X R leading to easy to compute Bregman projections e argmin x X [ω(x) + e, x ] Note: If X admits cheap proximal setup, then X admits cheap Linear Minimization Oracle capable to minimize linear forms over X.

Proximal FOMs: bottlenecks Find x X : Φ(x), x x 0 x X VI(Φ, X) X: convex compact subset of Euclidean space E Φ : X E is monotone: Φ(x) Φ(x ), x x 0 x, x X In several important cases, X does not admit cheap proximal setup, but does allow for cheap LMO: Example 1: X R m n is nuclear norm ball, or spectahedron the set of symmetric psd m m matrices with unit trace. Here Bregman projection requires full singular value decomposition of an m n matrix, resp., full eigenvalue decomposition of a symmetric m m matrix. LMO is much cheaper: it reduces to computing (e.g., by Power method) the leading pair of singular vectors (resp., the leading eigenvector) of a matrix. Example 2: X is Total Variation ball in the space of m n zero mean images. Here already the simplest Bregman projection reduces to highly computationally demanding metric projection onto the TV ball. LMO is much cheaper: it reduces to solving a max flow problem on a simple mn-node network with 2mn arcs.

Illustration: LMO vs. Bregman projection Computing leading pair of singular vectors of an 8192 8192 matrix takes 64.4 sec by factor 7.5 cheaper than computing the full singular value decomposition. Computing leading eigenvector of an 8192 8192 symmetric matrix takes 10.9 sec by factor 13.0 cheaper than computing the full eigenvalue decomposition. Minimizing a linear form over the TV ball in the space of 1024 1024 images takes 55.6 sec by factor 20.6 cheaper than computing metric projection onto the ball. Platform: 4 3.40 GHz CPU, 16.0 GB RAM 64-bit Windows 7 Our goal: Solving large-scale problems with convex structure (convex minimization, convex-concave saddle points, variational inequalities with monotone operators) on LMO-represented domains.

Beyond Proximal FOMs: Conditional Gradient Conditional Gradient Algorithm Seemingly the only standard technique for handling LMO-represented domains is the Conditional Gradient Algorithm [Frank&Wolfe 58] solving smooth convex minimization problems Opt = min x X f (x) (P) CGA is the recurrence [X ] x t [ f (x t ), x + t Argmin x X f (x t ), x ] x t+1 : f (x t+1 ) f (x t + 2 t+1 [x + t x t ]) & x t+1 X, t = 1, 2,... Theorem [well known]: Let f ( ) be convex and (κ, L) smooth, κ (1, 2]: x, y X : f (y) f (x) + f (x), y x + L κ x y κ X [ X : norm on Lin(X) with the unit ball 1 [X X]] 2 Then f (x t ) Opt 22κ κ(3 κ) L, t = 2, 3,... (t + 1) κ 1 CGA was extended recently [Harchaoui,Juditsky,Nem. 13] to norm-regularized problems like min [f (x) + x ] x K K: cone with LMO-represented K {x : x 1}; f : convex and smooth.

Fenchel-Type Representations of Functions Question: How to carry out nonsmooth convex minimization and solve other smooth/nonsmooth problems with convex structure on LMO-represented domains? Proposed answer: Use Fenchel-type representations. Fenchel representation (F.r.) of a function f : R n R {+ } is f (x) = sup y [ x, y f (y)] f : proper convex lower semicontinuous. Fenchel-type representation (F-t.r.) of f is f (x) = sup y [ x, Ay + a φ(y)] φ: proper convex lower semicontinuous. Good F-t.r: Y := Dom φ is compact & φ Lip(Y ). F.r. of proper convex lower semicontinuous f exists in the nature and is unique, but usually is not available numerically. In contrast, F-t.r. s admit fully algorithmic calculus: all basic convexity-preserving operations as applied to operands given by F-t.r. s yield explicit F-t.r. of the result. Typical well-structured convex functions admit explicit good F-t.r. s (even with affine φ s).

Example: F.r. of f 1 + f 2 is given by computationally demanding inf-convolution: (f 1 + f 2 ) (y) = inf y 1 +y 2 =y [f 1 (y 1 ) + f 2 (y 2 )] In contrast, an F.-t.r. of f 1 + f 2 is readily given by F.-t.r. s of f 1, f 2 : f i (x) = inf yi Y i [ x, A i y i + a i φ i (y i )] i = 1, 2 [ ] f 1 (x) + f 2 (x) = sup x, A1 y 1 + a 1 + A 2 y 2 + a 2 [φ y=[y 1 ;y 2 ] }{{} 1 (y 1 ) + φ 2 (y 2 )] }{{} Y :=Y 1 Y 2 Ay+a φ(y)

Nonsmooth Convex Minimization via Fenchel-Type Representation When solving convex minimization problem Opt(P) = min x X f (x), good F-t.r. of the objective f (x) = [ x, Ay + a φ(y)] gives rise to the dual problem max y Y =Domφ [ Opt(P) =] Opt(D) = min y Y [ ] f (y) := φ(y) min x, Ay + a x X (P) (D) Observation: LMO for X combines with First Order oracle for φ to induce First Order oracle for f When First Order oracle for φ and LMO for X are available, (D) is well suited for solving by FOMs (e.g., proximal methods, provided Y admits cheap proximal setup). Strategy: Solve (D) and then recover a solution to (P). Question: How to recover good solution to the problem of interest (P) from information acquired when solving (D)? Proposed answer: Use accuracy certificates.

Accuracy Certificates Assume we are applying N-step FOM to a convex problem Opt = min y Y F(y), (P) and have generated search points y t Y augmented with first order information (F(y t ), F (y t )), 1 t N. An accuracy certificate for execution protocol I N = {y t, F (y t), F (y t)} N t=1 is a collection λ N = {λ N t } N t=1 of N nonnegative weights summing up to 1. Accuracy certificate λ N and execution protocol I N give rise to Resolution Res(I N, λ N N [ ) = max y Y t=1 λn t F (y t ), y t y Gap Gap(I N, λ N ) = min t N F (y t ) ] N t=1 λn t F (y t ) + Res(I N, λ N ) Res(I N, λ N ) Simple Theorem I [Nem.,Onn,Rothblum, 10]: Let y N be the best (with the smallest value of F ) of the search points y 1,..., y N, and let ŷ N = N t=1 λn t y t. Then y N, ŷ N are feasible solutions to (P) satisfying F(ŷ N ) Opt Res(I N, λ N ), F(y N ) Opt Gap(I N, λ N ) Res(I N, λ N )

Accuracy Certificates (continued) Opt(P) = min x X [f (x) := max y Y [ x, Ay + a φ(y)]] (P) Let I N = {y t Y, F(y t ), F (y t )} N t=1 be N-step execution protocol built by an FOM as applied to Opt(D) = min y Y {F(y) := φ(y) min x X x, Ay + a } (D) and let x t Argmin x X x, Ay t + a be LMO s answers obtained when mimicking the First Order oracle for F : F(y t ) = φ(y t ) x t, Ay t + a & F (y t ) = φ (y t ) A T x t Simple Theorem II [Cox,Juditsky,Nem., 13]: Let λ N be an accuracy certificate for I N and x N = N t=1 λn t x t. Then x N is feasible for (P) and f ( x N ) Opt(P) Res(I N, λ N ) (in fact, the right hand side can be replaced with Gap(I N, λ N )).

LMO-Based Nonsmooth Convex Minimization (continued) Opt(P) = min x X {f (x) = max y Y [ x, Ay + a φ(y)]} (P) [ Opt(P) =] Opt(D) = min y Y {F(y) = φ(y) min x X x, Ay + a } (D) Conclusion: Mimicking the First Order oracle for (D) via LMO for X and solving (D) by an FOM producing accuracy certificates, after N = 1, 2,... iterations we have at our disposal feasible solutions x N to the problem of interest (P) such that f ( x N ) Opt(P) Gap(I N, λ N ). Fact: A wide spectrum of FOMs allow for augmenting execution protocols by good accuracy certificates, meaning that Res(I N, λ N ) (and thus Gap(I N, λ N )) obeys the standard efficiency estimates of the algorithms in question. For some FOMs (Subgradient/Mirror Descent, Nesterov s Fast Gradient Method for smooth convex minimization, and full memory Mirror Descent Bundle Level algorithms), good certificates are readily available. Several FOMs (polynomial time Cutting Plane algorithms, like Ellipsoid and Inscribed Ellipsoid methods, and truncated memory Mirror Descent Bundle Level algorithms) can be modified in order to produce good certificates. The required modifications are costless the complexity of an iteration remains basically intact.

LMO-Based Nonsmooth Convex Minimization (continued) Opt(P) = min x X {f (x) = max y Y [ x, Ay + a φ(y)]} (P) [ Opt(P) =] Opt(D) = min y Y {F(y) = φ(y) min x X x, Ay + a } (D) Let Y be equipped with cheap proximal setup (P) can be solved by applying to (D) a proximal algorithm with good accuracy certificates (e.g., various versions of Mirror Descent) and recovering from the certificates approximate solutions to (P). With this approach, an iteration requires a single call to the LMO for X and a single computation of Bregman projection ξ argmin y Y [ ξ, y + ω(y)]. An alternative is to use F-t.r. of f and proximal setup for Y to approximate f by fδ (x) = max y Y { x, Ay + a φ(y) δω(y)} to minimize the C 1,1 function f δ ( ) over X by Conditional Gradient. Note: The alternative is just Nesterov s smoothing with smooth minimization by the LMO-based Conditional Gradient rather than by proximal Fast Gradients. Fact: When φ is affine (quite typical!), both approaches result in methods with the same iteration complexity and the same O(1/ t) efficiency estimate.

LMO-Based Nonsmooth Convex Minimization: How It Works Test problems: Matrix Completion { with uniform fit Opt = min x R p p : x nuc 1 f (x) := max(i,j) Ω x ij a ij } { [ ]} = min x R p p : x nuc 1 max y Y (i,j) Ω y ij(x ij a ij ) Y = {y = {y ij : (i, j) Ω} : (i,j) Ω y ij 1} Ω: N-element collection of cells in a p p matrix. Results, I: Restricted Memory Bundle-Level algorithm on low size (p = 512, N = 512) Matrix Completion: Memory depth 1 33 65 129 Gap 1 /Gap 1024 114 164 350 3253 Results, II: Subgradient Descent on Matrix Completion: p N Gap 1 Gap 1 /Gap 32 Gap 1 /Gap 128 Gap 1 /Gap 1024 CPU, sec 2048 8192 1.81e-1 171.2 213.8 451.4 521.3 4096 16384 3.74e-1 335.4 1060.8 1287.3 1524.8 8192 16384 2.54e-1 37.8 875.8 1183.6 3644.0 Platform: desktop PC with 4 3.40 GHz Intel Core2 CPU and 16 GB RAM, Windows 7-64 OS.

From Nonsmooth LMO-Based Convex Minimization to Variational Inequalities and Saddle Points Motivating Example Consider Matrix Completion problem as follows: Opt = min [f (u) := Au b 2,2] u: u nuc 1 u Au : R n n R m m, e.g., Au = k i=1 l iuri T 2,2 : spectral norm (largest singular value) of a matrix Fenchel-type representation of f is immediate: f (u) = max v nuc 1 v, Au b problem of interest reduces to bilinear saddle point problem min u U max v V v, Au b U = {u R n n : u nuc 1}, V = {v R m m : v nuc 1} where both U and V admit computationally cheap LMO s, but do not admit computationally cheap proximal setups Our previous approach (same as any other known approach) is inapplicable we needed Y V to be proximal-friendly... (?) How to solve convex-concave saddle point problems on products of LMO-represented domains?

Fenchel-Type Representation of Monotone Operator: Definition Definitions Fenchel-type representation: Let X E be a convex compact set in Euclidean space, and Φ : X E be a vector field on X. A Fenchel-type representation of Φ on X is Φ(x) = Ay(x) + a ( ) y Ay + a : F E: affine mapping from Euclidean space F into E y(x): strong solution to VI(G( ) A x, Y ) Y F: convex & compact, G( ) : Y F: monotone F, Y, A, a, y( ), G( ) is the data of the representation. Definition Dual operator induced by F.-t.r. ( ) is Θ(y) = G(y) A x(y) : Y F, x(y) Argmin x X Ay + a, x The v.i. VI(Θ, Y ) is called the (induced by ( )) dual to the primal v.i. VI(Φ, X).

Fenchel-Type Representation of Monotone Operator (continued) Facts: If an operator Φ : X E admits a representation on a convex compact set X E, Φ is monotone on X The dual operator Θ induced by a Fenchel-type representation of a monotone operator is monotone. Θ is bounded, provided G( ) is so. Calculus of Fenchel-type Representations: F-t.r. s of monotone operators admit fully algorithmic calculus: F-t.r. s of operands of basic monotonicity-preserving operations: summation with nonnegative coefficients, direct summation, affine substitution of variables can be straightforwardly converted to an F-t.r. of the result. An affine monotone operator admits explicit F-t.r. on every compact domain. A good F-t.r. f (x) = min y Y [ x, Ay + a φ(y)] of convex function f : X R induces an F-t.r. of a subgradient field of f, provided φ C 1 (Y ).

A Digression: Variational Inequalities with Monotone Operators: Accuracy Measures Find x X : Φ(x), x x 0 x X VI(Φ, X) A natural measure of (in)accuracy of a candidate solution x X to VI(Φ, X) is the dual gap function ε vi ( x Φ, X) = sup x X Φ(x), x x When VI(Φ, X) comes from convex-concave saddle point problem: X = U V for convex compact sets U, V, and Φ(u, v) = [Φ u (u, v) u f (u, v); Φ v (u, v) v [ f (u, v)]] for Lipschitz continuous convex-concave function f (u, v) : X = U V R, another natural accuracy measure is the saddle point inaccuracy ε sad ( x = [ū; v] f, U, V ) := max v V f (ū, v) min u U f (u, v) Explanation: Convex-concave saddle point problem gives rise to two dual to each other convex programs ] Opt(P) = min u U [f (u) := max v V f (u, v) (P) Opt(D) = max v V [f (v) := min u U f (u, v)] (D) with equal optimal values: Opt(P) = Opt(D). ε sad (ū, v f, U, V ) is the sum of non-optimalities, in terms of respective objectives, of ū U as a solution to (P) and v V as a solution to (D).

Why Accuracy Certificates Certify Accuracy? Fact: Let v.i. VI(Ψ, Z ) with monotone operator Ψ and convex compact Z be solved by N-step FOM, let I N = {z i Z, Ψ(z i )} N i=1 be execution protocol, and λ N = {λ i 0} N i=1, i λ i = 1, be an accuracy certificate. Then z N = N i=1 λ iz i is a feasible solution to VI(Ψ, Z ), and ε vi (z N Ψ, Z ) Res(I N, λ N N ) := max z Z i=1 λ i Ψ(z i ), z i z When Ψ is associated with convex-concave saddle point problem min u U max v V f (u, v), we also have ε sad (z N f, U, V ) Res(I N, λ N ). Fact: Let Ψ be a bounded vector field on a convex compact domain Z. For every N = 1, 2,..., a properly designed N-step proximal FOM (Mirror Descent) as applied to VI(Ψ, Z ) generates an execution protocol I N and accuracy certificate λ N such that Res(I N, λ N ) O(1/ N) If Ψ is Lipschitz continuous on Z, then for properly selected N-step FOM (Mirror Prox) the efficiency estimate improves to Res(I N, λ N ) O(1/N). In both cases, factors hidden in O( ) are explicitly given by parameters of proximal setup and the magnitude of Ψ (first case), or of the Lipschitz constant of Ψ (second case).

Solving Monotone Variational Inequalities on LMO-Represented Domains In order to solve a primal v.i. { VI(Φ, X) given a F-t.r. y(x) Y Φ(x) = Ay(x) + a, where G(y(x)) A x, y y(x) 0 y Y we solve the dual v.i. VI(Θ, Y ), Θ(y) = G(y) A x(y), where x(y) Argmin x X x, Ay + a Note: Computing Θ(y) reduces to computing G(y), multiplying by A and A, and a single call to the LMO representing X. Theorem [Juditsky,Nem., 13]: Let I N = {y i Y, Θ(y i )} N i=1 be execution protocol of a FOM applied to the dual v.i. VI(Θ, Y ), and λ N = {λ i 0} N i=1, i λ i = 1, be associated accuracy certificate. Then x N = N i=1 λ ix(y i ) is a feasible solution to the primal v.i. VI(Φ, X) and ε vi (x T Φ, X) Res(I N, λ N N ) := max y Y i=1 λ i Θ(y i ), y i y If Φ is associated with bilinear convex-concave saddle point problem min u U max v V [f (u, v) = a, u + b, v + v, Au ], then also ε sad (x N f, U, V ) Res(I N, λ N )

How it Works As applied to Motivating Example Opt = min u R n n, u nuc 1 [ f (u) := Au b 2,2 ] = min u R n n, u nuc 1 max v R m m, v nuc 1 v, Au b, Au = k i=1 l i ur T i our approach results in a method yielding in N = 1, 2,... steps feasible approximate solutions u N to the problem of interest and lower bounds Opt N on Opt such that Gap N f (u N ) Opt N O(1) A 2,2 / N Iteration count N 1 65 129 193 257 321 385 449 512 m = 512 Gap N 0.1269 0.0239 0.0145 0.0103 0.0075 0.0063 0.0042 0.0040 0.0040 n = 1024 Gap 1 /Gap N 1.00 5.31 8.78 12.38 17.03 20.20 29.98 31.41 31.66 k = 2 cpu, sec 0.2 9.5 27.6 69.1 112.6 218.1 326.2 432.6 536.4 m = 1024 Gap N 0.1329 0.0196 0.0119 0.0075 0.0053 0.0041 0.0036 0.0034 0.0027 n = 2048 Gap 1 /Gap N 1.00 6.79 11.21 17.81 25.09 32.29 37.23 38.70 50.06 k = 2 cpu, sec 0.7 38.0 101.1 206.3 314.1 508.9 699.0 884.9 1070.0 m = 2048 Gap N 0.1239 0.0222 0.0139 0.0108 0.0086 0.0041 0.0037 0.0035 0.0035 n = 4096 Gap 1 /Gap N 1.00 5.57 8.93 11.48 14.40 30.48 33.14 35.76 35.77 k = 2 cpu, sec 2.2 103.5 257.6 496.9 742.5 1147.8 1564.4 1981.4 2401.0 m = 4096 Gap N 0.1193 0.0232 0.0134 0.0108 0.0054 0.0040 0.0035 0.0034 0.0034 n = 8192 Gap 1 /Gap N 1.00 5.14 8.90 11.08 22.00 29.83 33.93 34.85 35.14 k = 2 cpu, sec 6.5 289.9 683.8 1238.1 1816.0 2724.5 3648.3 4572.2 5490.8 m = 8192 Gap N 0.11959 0.02136 0.01460 0.01011 0.00853 n = 16384 Gap 1 /Gap N 1.00 5.60 8.19 11.82 14.01 k = 2 cpu, sec 21.7 920.4 2050.2 3492.4 4902.2 Platform: 4 x 3.40 GHz desktop with 16 GB RAM, 64 bit Windows 7 OS. Note: The design dimension of the largest instance is 2 28 = 268 435 456.

References Bruce Cox, Anatoli Juditsky, Arkadi Nemirovski, Dual subgradient algorithms for large-scale nonsmooth learning problems to appear in Mathematical Programming Series B, arxiv:1302.2349, Aug. 2013 Anatoli Juditsky, Arkadi Nemirovski, Solving variational inequalities with monotone operators on domains given by Linear Minimization Oracles submitted to Mathematical Programming, arxiv:1312.1073, Dec. 2013