Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles

Similar documents
Difficult Geometry Convex Programs and Conditional Gradient Algorithms

Solving variational inequalities with monotone operators on domains given by Linear Minimization Oracles

Dual subgradient algorithms for large-scale nonsmooth learning problems

Stochastic Semi-Proximal Mirror-Prox

Convex Stochastic and Large-Scale Deterministic Programming via Robust Stochastic Approximation and its Extensions

Saddle Point Algorithms for Large-Scale Well-Structured Convex Optimization

1 First Order Methods for Nonsmooth Convex Large-Scale Optimization, I: General Purpose Methods

Online First-Order Framework for Robust Convex Optimization

Semi-proximal Mirror-Prox for Nonsmooth Composite Minimization

Conditional Gradient Algorithms for Norm-Regularized Smooth Convex Optimization

Primal-dual first-order methods with O(1/ǫ) iteration-complexity for cone programming

Lecture 5 : Projections

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Dual Proximal Gradient Method

Conditional Gradient Algorithms for Norm-Regularized Smooth Convex Optimization

Conditional Gradient (Frank-Wolfe) Method

Gradient Sliding for Composite Optimization

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems

Dual methods for the minimization of the total variation

6 First-Order Methods for Nonsmooth Convex Large-Scale Optimization, II: Utilizing Problem s Structure

A New Look at First Order Methods Lifting the Lipschitz Gradient Continuity Restriction

2 First Order Methods for Nonsmooth Convex Large-Scale Optimization, II: Utilizing Problem s Structure

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

ADVANCES IN CONVEX OPTIMIZATION: CONIC PROGRAMMING. Arkadi Nemirovski

Universal Gradient Methods for Convex Optimization Problems

Solving large Semidefinite Programs - Part 1 and 2

Optimization and Optimal Control in Banach Spaces

Convex Optimization Conjugate, Subdifferential, Proximation

ALGORITHMS FOR MINIMIZING DIFFERENCES OF CONVEX FUNCTIONS AND APPLICATIONS

arxiv: v2 [math.oc] 12 Jun 2015

Optimization for Machine Learning

A direct formulation for sparse PCA using semidefinite programming

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex concave saddle-point problems

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Tutorial: Mirror Descent Algorithms for Large-Scale Deterministic and Stochastic Convex Optimization

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Lecture 23: November 19

A Unified Approach to Proximal Algorithms using Bregman Distance

Accelerated Proximal Gradient Methods for Convex Optimization

LEARNING IN CONCAVE GAMES

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Composite nonlinear models at scale

Lecture 6: Conic Optimization September 8

Lagrange duality. The Lagrangian. We consider an optimization program of the form

Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä. New Proximal Bundle Method for Nonsmooth DC Optimization

Adaptive discretization and first-order methods for nonsmooth inverse problems for PDEs

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

LMI MODELLING 4. CONVEX LMI MODELLING. Didier HENRION. LAAS-CNRS Toulouse, FR Czech Tech Univ Prague, CZ. Universidad de Valladolid, SP March 2009

Complexity bounds for primal-dual methods minimizing the model of objective function

Selected Examples of CONIC DUALITY AT WORK Robust Linear Optimization Synthesis of Linear Controllers Matrix Cube Theorem A.

Lecture Note 5: Semidefinite Programming for Stability Analysis

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

Bregman Divergence and Mirror Descent

Convex Optimization Lecture 16

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION

Primal-dual subgradient methods for convex problems

Proximal methods. S. Villa. October 7, 2014

Projection methods to solve SDP

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Convex Optimization Theory. Athena Scientific, Supplementary Chapter 6 on Convex Optimization Algorithms

L. Vandenberghe EE236C (Spring 2016) 18. Symmetric cones. definition. spectral decomposition. quadratic representation. log-det barrier 18-1

Dual and primal-dual methods

Convex Optimization Under Inexact First-order Information. Guanghui Lan

A Quantum Interior Point Method for LPs and SDPs

EE/ACM Applications of Convex Optimization in Signal Processing and Communications Lecture 17

CS675: Convex and Combinatorial Optimization Spring 2018 The Ellipsoid Algorithm. Instructor: Shaddin Dughmi

Proximal Methods for Optimization with Spasity-inducing Norms

Some Properties of the Augmented Lagrangian in Cone Constrained Optimization

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

Math 273a: Optimization Subgradients of convex functions

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Duality and dynamics in Hamilton-Jacobi theory for fully convex problems of control

m i,j=1 M ijx T i x j = m

Iteration-complexity of first-order penalty methods for convex programming

Math 273a: Optimization Subgradient Methods

Splitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

What can be expressed via Conic Quadratic and Semidefinite Programming?

In English, this means that if we travel on a straight line between any two points in C, then we never leave C.

Hypothesis Testing via Convex Optimization

An adaptive accelerated first-order method for convex optimization

arxiv: v2 [math.oc] 21 Nov 2017

ELE539A: Optimization of Communication Systems Lecture 15: Semidefinite Programming, Detection and Estimation Applications

The proximal mapping

Learning with Submodular Functions: A Convex Optimization Perspective

Lecture: Smoothing.

DISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 4. Subgradient

Stochastic Proximal Gradient Algorithm

Subgradient Method. Ryan Tibshirani Convex Optimization

4. Convex optimization problems

Constrained optimization

An Optimal Affine Invariant Smooth Minimization Algorithm.

Copositive Programming and Combinatorial Optimization

Big Data Analytics: Optimization and Randomization

15. Conic optimization

10. Unconstrained minimization

Lecture 24 November 27

Transcription:

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Arkadi Nemirovski H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology Joint research with Anatoli Juditsky University J. Fourier, Grenoble London Optimization Workshop King s College, London, June 9-10, 2014

Overview Problems of interest and motivation Linear Minimization Oracles and classical Conditional Gradient Algorithm Fenchel-type representations and LMO-based Convex Optimization nonsmooth convex minimization variational inequalities with monotone operators and convex-concave saddle points

Motivation Problem of Interest: Variational Inequality with monotone operator Find x X : Φ(x), x x 0 x X VI(Φ, X) X: convex compact subset of Euclidean space E Φ : X E is monotone: Φ(x) Φ(x ), x x 0 x, x X Examples: Convex Minimization: Φ(x) f (x), x X, for a convex Lipschitz continuous function f : X R solutions to VI(Φ, X) are exactly the minimizers of f on X Convex-Concave Saddle Points: X = U V, Φ(u, v) = [Φ u (u, v) u f (u, v); Φ v (u, v) v [ f (u, v)]] for a convex-concave Lipschitz continuous f (u, v) : X R solutions to VI(Φ, X) are exactly the saddle points of f on U V. When problem s sizes make Interior Point algorithms prohibitively time consuming, First Order Methods (FOM s) become the methods of choice. Reasons: Under favorable circumstances, FOM s (a) have cheap steps and (b) exhibit nearly dimension independent sublinear convergence rate. Note: Perhaps one could survive without (b), but (a) is a must!

Proximal FOMs Find x X : Φ(x), x x 0 x X VI(Φ, X) X: convex compact subset of Euclidean space E Φ : X E is monotone: Φ(x) Φ(x ), x x 0 x, x X Fact: Most FOMs for large-scale convex optimization (Subgradient Descent, Mirror Descent, Nesterov s Fast Gradient Methods,...) are proximal algorithms. To allow for proximal methods with cheap iterations, X should admit cheap proximal setup a C 1 strongly convex distance generating function (d.g.f.) ω( ) : X R leading to easy to compute Bregman projections e argmin x X [ω(x) + e, x ] Note: If X admits cheap proximal setup, then X admits cheap Linear Minimization Oracle capable to minimize linear forms over X.

Proximal FOMs: bottlenecks Find x X : Φ(x), x x 0 x X VI(Φ, X) X: convex compact subset of Euclidean space E Φ : X E is monotone: Φ(x) Φ(x ), x x 0 x, x X In several important cases, X does not admit cheap proximal setup, but does allow for cheap LMO: Example 1: X R m n is nuclear norm ball, or spectahedron the set of symmetric psd m m matrices with unit trace. Here Bregman projection requires full singular value decomposition of an m n matrix, resp., full eigenvalue decomposition of a symmetric m m matrix. LMO is much cheaper: it reduces to computing (e.g., by Power method) the leading pair of singular vectors (resp., the leading eigenvector) of a matrix. Example 2: X is Total Variation ball in the space of m n zero mean images. Here already the simplest Bregman projection reduces to highly computationally demanding metric projection onto the TV ball. LMO is much cheaper: it reduces to solving a max flow problem on a simple mn-node network with 2mn arcs.

Illustration: LMO vs. Bregman projection Computing leading pair of singular vectors of an 8192 8192 matrix takes 64.4 sec by factor 7.5 cheaper than computing the full singular value decomposition. Computing leading eigenvector of an 8192 8192 symmetric matrix takes 10.9 sec by factor 13.0 cheaper than computing the full eigenvalue decomposition. Minimizing a linear form over the TV ball in the space of 1024 1024 images takes 55.6 sec by factor 20.6 cheaper than computing metric projection onto the ball. Platform: 4 3.40 GHz CPU, 16.0 GB RAM 64-bit Windows 7 Our goal: Solving large-scale problems with convex structure (convex minimization, convex-concave saddle points, variational inequalities with monotone operators) on LMO-represented domains.

Beyond Proximal FOMs: Conditional Gradient Conditional Gradient Algorithm Seemingly the only standard technique for handling LMO-represented domains is the Conditional Gradient Algorithm [Frank&Wolfe 58] solving smooth convex minimization problems Opt = min x X f (x) (P) CGA is the recurrence [X ] x t [ f (x t ), x + t Argmin x X f (x t ), x ] x t+1 : f (x t+1 ) f (x t + 2 t+1 [x + t x t ]) & x t+1 X, t = 1, 2,... Theorem [well known]: Let f ( ) be convex and (κ, L) smooth, κ (1, 2]: x, y X : f (y) f (x) + f (x), y x + L κ x y κ X [ X : norm on Lin(X) with the unit ball 1 [X X]] 2 Then f (x t ) Opt 22κ κ(3 κ) L, t = 2, 3,... (t + 1) κ 1 CGA was extended recently [Harchaoui,Juditsky,Nem. 13] to norm-regularized problems like min [f (x) + x ] x K K: cone with LMO-represented K {x : x 1}; f : convex and smooth.

Fenchel-Type Representations of Functions Question: How to carry out nonsmooth convex minimization and solve other smooth/nonsmooth problems with convex structure on LMO-represented domains? Proposed answer: Use Fenchel-type representations. Fenchel representation (F.r.) of a function f : R n R {+ } is f (x) = sup y [ x, y f (y)] f : proper convex lower semicontinuous. Fenchel-type representation (F-t.r.) of f is f (x) = sup y [ x, Ay + a φ(y)] φ: proper convex lower semicontinuous. Good F-t.r: Y := Dom φ is compact & φ Lip(Y ). F.r. of proper convex lower semicontinuous f exists in the nature and is unique, but usually is not available numerically. In contrast, F-t.r. s admit fully algorithmic calculus: all basic convexity-preserving operations as applied to operands given by F-t.r. s yield explicit F-t.r. of the result. Typical well-structured convex functions admit explicit good F-t.r. s (even with affine φ s).

Example: F.r. of f 1 + f 2 is given by computationally demanding inf-convolution: (f 1 + f 2 ) (y) = inf y 1 +y 2 =y [f 1 (y 1 ) + f 2 (y 2 )] In contrast, an F.-t.r. of f 1 + f 2 is readily given by F.-t.r. s of f 1, f 2 : f i (x) = inf yi Y i [ x, A i y i + a i φ i (y i )] i = 1, 2 [ ] f 1 (x) + f 2 (x) = sup x, A1 y 1 + a 1 + A 2 y 2 + a 2 [φ y=[y 1 ;y 2 ] }{{} 1 (y 1 ) + φ 2 (y 2 )] }{{} Y :=Y 1 Y 2 Ay+a φ(y)

Nonsmooth Convex Minimization via Fenchel-Type Representation When solving convex minimization problem Opt(P) = min x X f (x), good F-t.r. of the objective f (x) = [ x, Ay + a φ(y)] gives rise to the dual problem max y Y =Domφ [ Opt(P) =] Opt(D) = min y Y [ ] f (y) := φ(y) min x, Ay + a x X (P) (D) Observation: LMO for X combines with First Order oracle for φ to induce First Order oracle for f When First Order oracle for φ and LMO for X are available, (D) is well suited for solving by FOMs (e.g., proximal methods, provided Y admits cheap proximal setup). Strategy: Solve (D) and then recover a solution to (P). Question: How to recover good solution to the problem of interest (P) from information acquired when solving (D)? Proposed answer: Use accuracy certificates.

Accuracy Certificates Assume we are applying N-step FOM to a convex problem Opt = min y Y F(y), (P) and have generated search points y t Y augmented with first order information (F(y t ), F (y t )), 1 t N. An accuracy certificate for execution protocol I N = {y t, F (y t), F (y t)} N t=1 is a collection λ N = {λ N t } N t=1 of N nonnegative weights summing up to 1. Accuracy certificate λ N and execution protocol I N give rise to Resolution Res(I N, λ N N [ ) = max y Y t=1 λn t F (y t ), y t y Gap Gap(I N, λ N ) = min t N F (y t ) ] N t=1 λn t F (y t ) + Res(I N, λ N ) Res(I N, λ N ) Simple Theorem I [Nem.,Onn,Rothblum, 10]: Let y N be the best (with the smallest value of F ) of the search points y 1,..., y N, and let ŷ N = N t=1 λn t y t. Then y N, ŷ N are feasible solutions to (P) satisfying F(ŷ N ) Opt Res(I N, λ N ), F(y N ) Opt Gap(I N, λ N ) Res(I N, λ N )

Accuracy Certificates (continued) Opt(P) = min x X [f (x) := max y Y [ x, Ay + a φ(y)]] (P) Let I N = {y t Y, F(y t ), F (y t )} N t=1 be N-step execution protocol built by an FOM as applied to Opt(D) = min y Y {F(y) := φ(y) min x X x, Ay + a } (D) and let x t Argmin x X x, Ay t + a be LMO s answers obtained when mimicking the First Order oracle for F : F(y t ) = φ(y t ) x t, Ay t + a & F (y t ) = φ (y t ) A T x t Simple Theorem II [Cox,Juditsky,Nem., 13]: Let λ N be an accuracy certificate for I N and x N = N t=1 λn t x t. Then x N is feasible for (P) and f ( x N ) Opt(P) Res(I N, λ N ) (in fact, the right hand side can be replaced with Gap(I N, λ N )).

LMO-Based Nonsmooth Convex Minimization (continued) Opt(P) = min x X {f (x) = max y Y [ x, Ay + a φ(y)]} (P) [ Opt(P) =] Opt(D) = min y Y {F(y) = φ(y) min x X x, Ay + a } (D) Conclusion: Mimicking the First Order oracle for (D) via LMO for X and solving (D) by an FOM producing accuracy certificates, after N = 1, 2,... iterations we have at our disposal feasible solutions x N to the problem of interest (P) such that f ( x N ) Opt(P) Gap(I N, λ N ). Fact: A wide spectrum of FOMs allow for augmenting execution protocols by good accuracy certificates, meaning that Res(I N, λ N ) (and thus Gap(I N, λ N )) obeys the standard efficiency estimates of the algorithms in question. For some FOMs (Subgradient/Mirror Descent, Nesterov s Fast Gradient Method for smooth convex minimization, and full memory Mirror Descent Bundle Level algorithms), good certificates are readily available. Several FOMs (polynomial time Cutting Plane algorithms, like Ellipsoid and Inscribed Ellipsoid methods, and truncated memory Mirror Descent Bundle Level algorithms) can be modified in order to produce good certificates. The required modifications are costless the complexity of an iteration remains basically intact.

LMO-Based Nonsmooth Convex Minimization (continued) Opt(P) = min x X {f (x) = max y Y [ x, Ay + a φ(y)]} (P) [ Opt(P) =] Opt(D) = min y Y {F(y) = φ(y) min x X x, Ay + a } (D) Let Y be equipped with cheap proximal setup (P) can be solved by applying to (D) a proximal algorithm with good accuracy certificates (e.g., various versions of Mirror Descent) and recovering from the certificates approximate solutions to (P). With this approach, an iteration requires a single call to the LMO for X and a single computation of Bregman projection ξ argmin y Y [ ξ, y + ω(y)]. An alternative is to use F-t.r. of f and proximal setup for Y to approximate f by fδ (x) = max y Y { x, Ay + a φ(y) δω(y)} to minimize the C 1,1 function f δ ( ) over X by Conditional Gradient. Note: The alternative is just Nesterov s smoothing with smooth minimization by the LMO-based Conditional Gradient rather than by proximal Fast Gradients. Fact: When φ is affine (quite typical!), both approaches result in methods with the same iteration complexity and the same O(1/ t) efficiency estimate.

LMO-Based Nonsmooth Convex Minimization: How It Works Test problems: Matrix Completion { with uniform fit Opt = min x R p p : x nuc 1 f (x) := max(i,j) Ω x ij a ij } { [ ]} = min x R p p : x nuc 1 max y Y (i,j) Ω y ij(x ij a ij ) Y = {y = {y ij : (i, j) Ω} : (i,j) Ω y ij 1} Ω: N-element collection of cells in a p p matrix. Results, I: Restricted Memory Bundle-Level algorithm on low size (p = 512, N = 512) Matrix Completion: Memory depth 1 33 65 129 Gap 1 /Gap 1024 114 164 350 3253 Results, II: Subgradient Descent on Matrix Completion: p N Gap 1 Gap 1 /Gap 32 Gap 1 /Gap 128 Gap 1 /Gap 1024 CPU, sec 2048 8192 1.81e-1 171.2 213.8 451.4 521.3 4096 16384 3.74e-1 335.4 1060.8 1287.3 1524.8 8192 16384 2.54e-1 37.8 875.8 1183.6 3644.0 Platform: desktop PC with 4 3.40 GHz Intel Core2 CPU and 16 GB RAM, Windows 7-64 OS.

From Nonsmooth LMO-Based Convex Minimization to Variational Inequalities and Saddle Points Motivating Example Consider Matrix Completion problem as follows: Opt = min [f (u) := Au b 2,2] u: u nuc 1 u Au : R n n R m m, e.g., Au = k i=1 l iuri T 2,2 : spectral norm (largest singular value) of a matrix Fenchel-type representation of f is immediate: f (u) = max v nuc 1 v, Au b problem of interest reduces to bilinear saddle point problem min u U max v V v, Au b U = {u R n n : u nuc 1}, V = {v R m m : v nuc 1} where both U and V admit computationally cheap LMO s, but do not admit computationally cheap proximal setups Our previous approach (same as any other known approach) is inapplicable we needed Y V to be proximal-friendly... (?) How to solve convex-concave saddle point problems on products of LMO-represented domains?

Fenchel-Type Representation of Monotone Operator: Definition Definitions Fenchel-type representation: Let X E be a convex compact set in Euclidean space, and Φ : X E be a vector field on X. A Fenchel-type representation of Φ on X is Φ(x) = Ay(x) + a ( ) y Ay + a : F E: affine mapping from Euclidean space F into E y(x): strong solution to VI(G( ) A x, Y ) Y F: convex & compact, G( ) : Y F: monotone F, Y, A, a, y( ), G( ) is the data of the representation. Definition Dual operator induced by F.-t.r. ( ) is Θ(y) = G(y) A x(y) : Y F, x(y) Argmin x X Ay + a, x The v.i. VI(Θ, Y ) is called the (induced by ( )) dual to the primal v.i. VI(Φ, X).

Fenchel-Type Representation of Monotone Operator (continued) Facts: If an operator Φ : X E admits a representation on a convex compact set X E, Φ is monotone on X The dual operator Θ induced by a Fenchel-type representation of a monotone operator is monotone. Θ is bounded, provided G( ) is so. Calculus of Fenchel-type Representations: F-t.r. s of monotone operators admit fully algorithmic calculus: F-t.r. s of operands of basic monotonicity-preserving operations: summation with nonnegative coefficients, direct summation, affine substitution of variables can be straightforwardly converted to an F-t.r. of the result. An affine monotone operator admits explicit F-t.r. on every compact domain. A good F-t.r. f (x) = min y Y [ x, Ay + a φ(y)] of convex function f : X R induces an F-t.r. of a subgradient field of f, provided φ C 1 (Y ).

A Digression: Variational Inequalities with Monotone Operators: Accuracy Measures Find x X : Φ(x), x x 0 x X VI(Φ, X) A natural measure of (in)accuracy of a candidate solution x X to VI(Φ, X) is the dual gap function ε vi ( x Φ, X) = sup x X Φ(x), x x When VI(Φ, X) comes from convex-concave saddle point problem: X = U V for convex compact sets U, V, and Φ(u, v) = [Φ u (u, v) u f (u, v); Φ v (u, v) v [ f (u, v)]] for Lipschitz continuous convex-concave function f (u, v) : X = U V R, another natural accuracy measure is the saddle point inaccuracy ε sad ( x = [ū; v] f, U, V ) := max v V f (ū, v) min u U f (u, v) Explanation: Convex-concave saddle point problem gives rise to two dual to each other convex programs ] Opt(P) = min u U [f (u) := max v V f (u, v) (P) Opt(D) = max v V [f (v) := min u U f (u, v)] (D) with equal optimal values: Opt(P) = Opt(D). ε sad (ū, v f, U, V ) is the sum of non-optimalities, in terms of respective objectives, of ū U as a solution to (P) and v V as a solution to (D).

Why Accuracy Certificates Certify Accuracy? Fact: Let v.i. VI(Ψ, Z ) with monotone operator Ψ and convex compact Z be solved by N-step FOM, let I N = {z i Z, Ψ(z i )} N i=1 be execution protocol, and λ N = {λ i 0} N i=1, i λ i = 1, be an accuracy certificate. Then z N = N i=1 λ iz i is a feasible solution to VI(Ψ, Z ), and ε vi (z N Ψ, Z ) Res(I N, λ N N ) := max z Z i=1 λ i Ψ(z i ), z i z When Ψ is associated with convex-concave saddle point problem min u U max v V f (u, v), we also have ε sad (z N f, U, V ) Res(I N, λ N ). Fact: Let Ψ be a bounded vector field on a convex compact domain Z. For every N = 1, 2,..., a properly designed N-step proximal FOM (Mirror Descent) as applied to VI(Ψ, Z ) generates an execution protocol I N and accuracy certificate λ N such that Res(I N, λ N ) O(1/ N) If Ψ is Lipschitz continuous on Z, then for properly selected N-step FOM (Mirror Prox) the efficiency estimate improves to Res(I N, λ N ) O(1/N). In both cases, factors hidden in O( ) are explicitly given by parameters of proximal setup and the magnitude of Ψ (first case), or of the Lipschitz constant of Ψ (second case).

Solving Monotone Variational Inequalities on LMO-Represented Domains In order to solve a primal v.i. { VI(Φ, X) given a F-t.r. y(x) Y Φ(x) = Ay(x) + a, where G(y(x)) A x, y y(x) 0 y Y we solve the dual v.i. VI(Θ, Y ), Θ(y) = G(y) A x(y), where x(y) Argmin x X x, Ay + a Note: Computing Θ(y) reduces to computing G(y), multiplying by A and A, and a single call to the LMO representing X. Theorem [Juditsky,Nem., 13]: Let I N = {y i Y, Θ(y i )} N i=1 be execution protocol of a FOM applied to the dual v.i. VI(Θ, Y ), and λ N = {λ i 0} N i=1, i λ i = 1, be associated accuracy certificate. Then x N = N i=1 λ ix(y i ) is a feasible solution to the primal v.i. VI(Φ, X) and ε vi (x T Φ, X) Res(I N, λ N N ) := max y Y i=1 λ i Θ(y i ), y i y If Φ is associated with bilinear convex-concave saddle point problem min u U max v V [f (u, v) = a, u + b, v + v, Au ], then also ε sad (x N f, U, V ) Res(I N, λ N )

How it Works As applied to Motivating Example Opt = min u R n n, u nuc 1 [ f (u) := Au b 2,2 ] = min u R n n, u nuc 1 max v R m m, v nuc 1 v, Au b, Au = k i=1 l i ur T i our approach results in a method yielding in N = 1, 2,... steps feasible approximate solutions u N to the problem of interest and lower bounds Opt N on Opt such that Gap N f (u N ) Opt N O(1) A 2,2 / N Iteration count N 1 65 129 193 257 321 385 449 512 m = 512 Gap N 0.1269 0.0239 0.0145 0.0103 0.0075 0.0063 0.0042 0.0040 0.0040 n = 1024 Gap 1 /Gap N 1.00 5.31 8.78 12.38 17.03 20.20 29.98 31.41 31.66 k = 2 cpu, sec 0.2 9.5 27.6 69.1 112.6 218.1 326.2 432.6 536.4 m = 1024 Gap N 0.1329 0.0196 0.0119 0.0075 0.0053 0.0041 0.0036 0.0034 0.0027 n = 2048 Gap 1 /Gap N 1.00 6.79 11.21 17.81 25.09 32.29 37.23 38.70 50.06 k = 2 cpu, sec 0.7 38.0 101.1 206.3 314.1 508.9 699.0 884.9 1070.0 m = 2048 Gap N 0.1239 0.0222 0.0139 0.0108 0.0086 0.0041 0.0037 0.0035 0.0035 n = 4096 Gap 1 /Gap N 1.00 5.57 8.93 11.48 14.40 30.48 33.14 35.76 35.77 k = 2 cpu, sec 2.2 103.5 257.6 496.9 742.5 1147.8 1564.4 1981.4 2401.0 m = 4096 Gap N 0.1193 0.0232 0.0134 0.0108 0.0054 0.0040 0.0035 0.0034 0.0034 n = 8192 Gap 1 /Gap N 1.00 5.14 8.90 11.08 22.00 29.83 33.93 34.85 35.14 k = 2 cpu, sec 6.5 289.9 683.8 1238.1 1816.0 2724.5 3648.3 4572.2 5490.8 m = 8192 Gap N 0.11959 0.02136 0.01460 0.01011 0.00853 n = 16384 Gap 1 /Gap N 1.00 5.60 8.19 11.82 14.01 k = 2 cpu, sec 21.7 920.4 2050.2 3492.4 4902.2 Platform: 4 x 3.40 GHz desktop with 16 GB RAM, 64 bit Windows 7 OS. Note: The design dimension of the largest instance is 2 28 = 268 435 456.

References Bruce Cox, Anatoli Juditsky, Arkadi Nemirovski, Dual subgradient algorithms for large-scale nonsmooth learning problems to appear in Mathematical Programming Series B, arxiv:1302.2349, Aug. 2013 Anatoli Juditsky, Arkadi Nemirovski, Solving variational inequalities with monotone operators on domains given by Linear Minimization Oracles submitted to Mathematical Programming, arxiv:1312.1073, Dec. 2013