Stochastic Semi-Proximal Mirror-Prox

Size: px

Start display at page:

Download "Stochastic Semi-Proximal Mirror-Prox"

Brett Ward
5 years ago
Views:

1 Stochastic Semi-Proximal Mirror-Prox Niao He Georgia Institute of echnology Zaid Harchaoui NYU, Inria Abstract We present a direct extension of the Semi-Proximal Mirror-Prox algorithm [3] for minimizing convex composite problems with objectives consisted of three parts a convex loss represented by stochastic oracle, a proximal-friendly regularizer and another LMO-friendly regularizer. he algorithm leverages stochastic oracle, proximal operators, and linear minimization over problems domain and attains optimality in two-fold: i) optimal in the number of calls to the stochastic oracle representing the objective, ii) optimal in the number of calls to linear minimization oracle representing problem s domain in the smooth saddle point case. 1 Introduction Last decade demonstrates significant interests in minimizing composite functions of the form: min x X 1 n n i=1 f i(x) + h(x) where f i are convex continuously differentiable functions and h is a convex but perhaps not differentiable function. Such problems arise ubiquitously in machine learning, where the first term usually refers to empirical risk and h is a regularizer. he mainstream algorithms are devoted to address two challenges, i) large number of summation, ii) nonsmoothness of regularization term. o deal with the finite sum, a variety of stochastic and incremental gradient methods have been developed that use only one or a mini batch of components at a time. o handle the nonsmooth regularization and utilize the underlying structure, two kinds of algorithms have been widely studied the proximal type and conditional gradient type. Proximal type methods ([10, 1, 12]) require computation of a composite proximal operator at each iteration, i.e. solving problems of the form, min x X { 1 2 x ξ, x + αh(x) }, given input vector ξ and positive scalar α. Function h that admits easy-to-compute proximal operators, are called proximal-friendly. In contrast, conditional gradient type methods [2, 11] operate with the linear minimization oracle (LMO) at each iteration, i.e., solving problems of the form, min x X { ξ, x + αh(x)}, much cheaper than computing proximal operators. For instance, in the case of nuclear-norm, LMO only requires computing the leading pair of singular vectors, which is by orders of magnitude faster than full singular value decomposition required when computing proximal operator. Such h, are called LMO-friendly. he scope of this paper. Despite of the much success in this regime, few work has been devoted to the situation where the regularization is given by a mixture of proximal friendly and LMO-friendly components. Such mixtures are often introduced to promote several desired properties of the solution simultaneously, such as sparsity and low rank. While recent work [3, 7] considers only deterministic case, in this paper, we focus on the stochastic setting and aio solve problems such as (i) stochastic composite optimization (ii) stochastic composite saddle point problem min x=[x1,x 2] X E ξ [f(x, ξ)] + h 1 (x 1 ) + h 2 (x 2 ) (1) min x1 X 1 max x2 X 2 E ξ [Φ(x 1, x 2, ξ)] + h 1 (x 1 ) h 2 (x 2 ) (2) 1

2 where E[f(x, ξ)] is convex in x X; E[Φ(x 1, x 2, ξ)] is convex in x 1 X 1 and concave in x 2 X 2 ; h 1 (x 1 ) is proximal-friendly and h 2 (x 2 ) is LMO-friendly. Such problems indeed arise from many sparse and low-rank regularized models used in recommendation systems and social network prediction. o address them in a unified manner, we will focus on a broad class of variational inequalities with mixed structures, which further generalize problems (1) and (2). Our contribution. We develop a stochastic variant of the Semi-Proximal Mirror-Prox proposed in [3] that supports and leverages stochastic oracles, proximal operators and linear minimization oracles and achieves the best of three worlds. he algorithm extends the usual conditional gradient to stochastic setting, and enjoys, in terms of LMO calls, a O(1/ 2 ) complexity for smooth case (e.g f and Φ in (1) and (2) are smooth) as well as a O(1/ 4 ) complexity for general nonsmooth case. In both situation, the algorithm attains the optimal O(1/ 2 ) complexity of the number of stochastic oracles. o our best knowledge, the algorithm as well as theoretical results presented seeo be novel. 2 Stochastic Semi-Proximal Mirror-Prox 2.1 he situation Structured Variational Inequalities. We consider the variational inequality VI(X, F ): Find x X : F (x), x x 0, x X with domain X and operator F that satisfy the assumptions (A.1) (A.4) below. (A.1) Set X E u E v is closed convex and its projection P X = {u : x = [u; v] X} U, where U is convex and closed, E u, E v are Euclidean spaces; (A.2) he function ω( ) : U R is continuously differentiable and also 1-strongly convex w.r.t. some norm. his defines the Bregman distance V u (u ) = ω(u ) ω(u) ω (u), u u 1 2 u u 2 ; (A.3) he operator F (x = [u, v]) : X E u E v is monotone and of form F (u, v) = [F u (u); F v ] with F v E v being a constant and F u (u) E u satisfying the condition for some L <, M < ; u, u U : F u (u) F u (u ) L u u + M (A.4) he linear form F v, v of [u; v] E u E v is bounded from below on X and is coercive on X w.r.t. v: whenever [u t ; v t ] X, t = 1, 2,... is a sequence such that {u t } is bounded and v t 2 as t, we have F v, v t, t. Semi-structured Variational Inequalities. he class of semi-structured variational inequalities allows to go beyond Assumptions (A.1) (A.4), by assuming more sub-structure. his structure is consistent with what we call a semi-proximal setup, which encompasses both the regular proximal setup and the regular linear minimization setup as special cases. Indeed, we assume further (S.1) Proximal setup for X: we assume that E u = E u1 E u2, E v = E v1 E v2, and U U 1 U 2, X = X 1 X 2 with X i E ui E vi and P i X = {u i : [u i ; v i ] X i } U i for i = 1, 2, where U 1 is convex and closed, U 2 is convex and compact. We also assume that ω(u) = ω 1 (u 1 ) + ω 2 (u 2 ) and u = u 1 Eu1 + u 2 Eu2, with ω 2 ( ) : U 2 R continuously differentiable such that ω 2 (u 2) ω 2 (u 2 ) + ω 2 (u 2 ), u 2 u 2 + L0 κ u 2 u 2 κ E u2, u 2, u 2 U 2 ; for a particular 1 < κ 2 and L 0 <. Furthermore, we assume that the Eu2 -diameter of U 2 is bounded by some D > 0. (S.2) Partition of F : operator F induced by the above partition of X 1 and X 2 can be written as F (x) = [F u (u); F v ] with F u (u) = [F u1 (u 1, u 2 ); F u2 (u 1, u 2 )], F v = [F v1 ; F v2 ]. (S.3) Proximal mapping on X 1 : we assume that for any η 1 E u1 and α > 0, we have at our disposal easy-to-compute prox-mappings of the form, Prox ω1 (η 1, α) := argmin x1=[u 1;v 1] X 1 {ω 1 (u 1 ) + η 1, u 1 + α F v1, v 1 }. 2

3 (S.4) Linear minimization oracle for X 2 : we assume that we we have at our disposal Composite Linear Minimization Oracle (LMO), which given any input η 2 E u2 and α > 0, returns an optimal solution to the minimization problem with linear form, that is, LMO(η 2, α) := argmin x2=[u 2;v 2] X 2 { η 2, u 2 + α F v2, v 2 }. Stochastic Oracles. We are interested in the situation where we only have access to noisy information on F u (u). More specifically, we assume that the operator is represented by the following stochastic oracle, such that for any u U, it returns a vector g(u, ξ) satisfying (C.1) Unbiasedness and bounded variance: E[g(u, ξ)] = F u (u), E[ g(u, ξ) F u (u) 2 ] σ 2, (C.2) Light tail: E [ exp{ g(u, ξ) F u (u) 2 /σ 2 } ] exp{1} for some σ > 0. where is the dual norm same as in (A.3). Note that by Jensen s inequality, (C.2) implies (C.1). 2.2 Stochastic Semi-Proximal Mirror-Prox We present our Stochastic Semi-Proximal Mirror-Prox in Algoirthm 1. he algorithm blends composite Mirror Prox and composite Conditional Gradient to exploit the best efficiency out of the mixed structure. At each iteration t, we estimate the operator F by taking the average of vectors returned by the stochastic oracle. For sub-domain X 2 given by LMO, instead of computing exactly the prox-mapping, we mimick the prox-mapping via a conditional gradient algorithm (denoted as CCG) directly from [3]. For the sub-domain X 1, we compute the prox-mapping as it is. Algorithm 1 Stochastic Semi-Proximal Mirror-Prox Algorithm for Semi-VI(X, F ) Input: stepsizes γ t > 0, accuracies t 0, t = 1, 2,... [1] Initialize x 1 = [x 1 1; x 1 2] X, where x 1 1 = [u 1 1; v1]; 1 x 1 2 = [u 1 2; v2]. 1 for t = 1, 2,..., do [2] Set u t = [u t 1; u t 2], compute [g1; t g2] t = 1 mt j=1 g(ut, ξj t) and yt = [y1; t y2] t that y1 t := [û t 1; v 1] t = Prox ω1 (γ t g1 t ω 1(u t 1), γ t ) y2 t := [û t 2; v 2] t = CCG(X 2, ω 2 ( ) + γ t g2 t ω 2(u t 2),, γ t F v2 ; t ) [3] Set û t = [û t 1; û t 2], compute [ĝ t 1; ĝ t 2] = 1 2mt j=+1 g(ût, ξ t j ) and xt+1 = [x t+1 1 ; x t+1 2 ] that x t+1 1 := [u t+1 1 ; v t+1 x t+1 2 := [u t+1 2 ; v t+1 end for Output: x := [ū ; v ] = ( γ t) 1 γ ty t 1 ] = Prox ω1 (γ t ĝ1 t ω 1(u t 1), γ t ) 2 ] = CCG(X 2, ω 2 ( ) + γ t ĝ2 t ω 2(u t 2),, γ t F v2 ; t ) he CCG routine [3] is designed to solve smooth semi-linear problems: min x=[u;v] X {φ + (u, v) = φ(u) + c, v }. For the input pair (X, φ, c; ), the algorithm works as follows: (a) x 1 := [u 1 ; v 1 ] X; (b) x t := [û t ; v t ] = argmin x=[u;v] X { φ(u t ), u + c, v } compute δ t = φ + (x t ), x t x t, return if δ t (c) x t+1 := [u t+1 ; v t+1 ] s.t. φ + (x t+1 ) φ + (x t + 2 t+1 ( xt x t )) We provide the convergence analysis in the next section. 3 Convergence Results 3.1 Main Results (CCG(X, φ, c; )) Let us denote the resolution of the execution protocal I = {y t, F (y t )} and a collection λ = {λ t } with λ t = γ t / γ t on any set X X by Res(X I, λ ) = sup x X λ t F (y t, y t x. he resolution can be used to certify the accuracy of the approximate solution x = λ ty t to generic convex problems not limited to variational inequalities (see [8] for details). We arrive at the following 3

4 heorem 3.1. Let the stepsizes γ t satisfy 0 < γ t ( 3L) 1 and Θ[X ] = sup [u;v] X V u 1(u) for any X X. For a sequence of inexact prox-mappings with inexactness t 0 and batch size > 0, we have under assumption (C.1) that E[Res(X I, λ )] M 0 ( ) := ( γ t) 1 ( 2Θ[X ] γ 2 t σ2 γ2 t (M 2 + 2σ2 ) + 2 ) t. Moreover, if assumption (C.2) holds, then for any Λ > 0, Prob { Res(X I, λ ) M 0 ( ) + ΛM 1 ( ) } exp{ Λ 2 /3} + exp{ Λ} where M 1 ( ) = ( ( γ t) 1 7 γt 2 + 3Θ[X] 2σ2 ). As a corollary, we can derive (based on immediate results in [8]) (i) when F is a monotone vector field, the resulting efficiency estimate takes place for the dual gap of variational inequalities, i.e. E[ VI ( x X, F )] M 0 ( ); (ii) when F stems frohe (sub)gradient of a convex minimization problem min x X f(x) (for instance, problem (1)) with optimal solution being x, the resulting efficiency estimate takes place for the suboptimality, i.e. E[f( x ) f(x )] M 0 ( ); (iii) when F stems from a convex-concave saddle point problem min x 1 X 1 max x 2 X 2 Φ(x 1, x 2 ) (for instance, problem (2)), along with two induced convex optimization problems Opt(P ) = min x1 X 1 [ Φ(x 1 ) = sup x2 X 2 Φ(x 1, x 2 ) ] (P ) Opt(D) = max x2 X 2 [ Φ(x 2 ) = inf x1 X 1 Φ(x 1, x 2 ) ] (D) the resulting efficiency estimate is inherited both by primal and dual suboptimality gaps, i.e. E[Φ( x 1 ) Opt(P )] M 0( ) and E[Opt(D) Φ( x 2 )] M 0( ). 3.2 When F u is Lipschitz continuous (M = 0) In the case when F u is a Lipschitz continuous monotone operator with L > 0 and M = 0, we have Proposition 3.1. Under the assumptions (A.1) (A.4), (S.1) (S.4) with M = 0 and the proximal setup on U 2 being Euclidean setup. Setting stepsize γ t = ( 2L) 1 and batch size = O(γ 2 t σ 2 /Θ[X]), t = 1,...,, for the Stochastic Semi-Proximal Mirror-Prox algorithm to return an stochastic -solution to the variational inequality V I(X, F ) represented by stochastic oracle in (C.1) (C.2), the total number of stochastic oracle calls required does not exceed N SO = O(1)σ 2 Θ[X]/ 2 and the total number of calls to the Linear Minimization Oracle does not exceed N LMO = O(1)L 2 D 2 Θ[X]/ 2 where σ 2, D, Θ[X] are defined previously. 3.3 When F u is bounded (L = 0) In the case when F u is a uniformly bounded monotone operator with M > 0 and L = 0, we have Proposition 3.2. Under same assumptions with L = 0. Setting stepsize γ t = O(1) Θ[X] batch size = O(1) σ2 M 2, t = 1,...,, for the Stochastic Semi-Proximal Mirror-Prox algoritho return an stochastic -solution to the variational inequality V I(X, F ), the total number of stochastic oracle calls required does not exceed N SO = O(1)σ 2 Θ[X]/ 2 and the total number of calls to the Linear Minimization Oracle does not exceed N LMO = O(1)M 4 D 2 Θ[X]/ 4. Discussion. he stochastic variant of Semi-Proximal Mirror-Prox algorithm is designed to solve a broad class of variational inequalities allowing to cover the stochastic composite minimization and saddle point problems in (1) and (2). he algorithm enjoys the optimal complexity bounds, i.e. O(1/ 2 ), both in terms of the number of calls to stochastic oracle (see [9]) and the number of calls to linear minimization oracle (see [6]) when the underlying problem is in the saddle point form and associated monotone operator is Lipschitz continuous. In the general nonsmooth situation, the algorithm enjoys still optimal complexity bound O(1/ 2 ) in terms of the number of calls to stochastic oracles, but that of the linear minimization oracle becomes O(1/ 4 ). M and 4

5 References [1] Amir Beck and Marc eboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1): , [2] Zaid Harchaoui, Anatoli Juditsky, and Arkadi Nemirovski. Conditional gradient algorithms for normregularized smooth convex optimization. Mathematical Programming, pages 1 38, [3] Niao He and Zaid Harchaoui. Semi-proximal mirror-prox for nonsmooth composite minimization. Neural Information Processing Systems (NIPS), [4] Niao He, Anatoli Juditsky, and Arkadi Nemirovski. Mirror prox algorithm for multi-term composite minimization and semi-separable problems. Computational Optimization and Applications, 61(2): , [5] Anatoli Juditsky, Arkadi Nemirovski, Claire auvel, et al. Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems, 1(1):17 58, [6] Guanghui Lan. he complexity of large-scale convex programming under a linear optimization oracle. arxiv, [7] Cun Mu, Yuqian Zhang, John Wright, and Donald Goldfarb. Scalable robust matrix recovery: Frankwolfe meets proximal methods. arxiv preprint arxiv: , [8] Arkadi Nemirovski, Shmuel Onn, and Uriel G Rothblum. Accuracy certificates for computational problems with convex structure. Mathematics of Operations Research, 35(1):52 78, [9] A. S. Nemirovsky and D. B. Yudin. Problem complexity and method efficiency in optimization. Wiley- Interscience Series in Discrete Mathematics, [10] Yu Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, 140(1): , [11] Yurii Nesterov. Complexity bounds for primal-dual methods minimizing the model of objective function. echnical report, Université catholique de Louvain, Center for Operations Research and Econometrics (CORE), [12] Neal Parikh and Stephen Boyd. Proximal algorithms. Foundations and rends in Optimization, 1(3): ,

6 4 echnical Details. 4.1 Problems (1) and (2) in the Form of Semi-Structured Variational Inequality he stochastic composite optimization in the form of (1) can be written as min E ξ [f(u, ξ)] + h 1 (u 1 ) + h 2 (u 2 ) u=[u 1,u 2] U=U 1 U 2 min E ξ[f(u, ξ)] + v 1 + v 2. u X,v 1 h 1(u 1),v 2 h 2(u 2) he variational inequality VI(X, F ) associated with the above problem is given by X = {x = [u = [u 1, u 2 ]; v = [v 1, v 2 ]] : u 1 U 1, u 2 U 2, v 1 h 1 (u 1 ), v 2 h 2 (u 2 )}, F (x = [u = [u 1, u 2 ]; v = [v 1, v 2 ]]) = [[ u1 E ξ [f(u, ξ)], u2 E ξ [f(u, ξ)]]; [1, 1]]. Clearly, the above VI(X, F ) satisfies the assumptions (A.1) (A.4), (S.1) (S.4). Similarly, for the stochastic composite saddle point problem in the form of (2) and its reformulation min u 1 U 1 min u 1 U 1,v 1 h 1(v 1) max E ξ [Φ(u 1, u 2, ξ)] + h 1 (u 1 ) h 2 (u 2 ) u 2 U 2 max E ξ[φ(u 1, u 2, ξ)] + v 1 v 2 u 2 U 2,v 2 h 2(v 2) the variational inequality VI(X, F ) associated with the above problem is given by X = {x = [u = [u 1, u 2 ]; v = [v 1, v 2 ]] : u 1 U 1, u 2 U 2, v 1 h 1 (u 1 ), v 2 h 2 (u 2 )}, F (x = [u = [u 1, u 2 ]; v = [v 1, v 2 ]]) = [[ u1 E ξ [Φ(u 1, u 2, ξ)], u2 E ξ [Φ(u 1, u 2, ξ)]]; [1, 1]]. Clearly, the above VI(X, F ) also satisfies the assumptions (A.1) (A.4), (S.1) (S.4). 4.2 Proof of heorem 3.1 Proof. he proof builds upon [5] and [4]. First of all, we show that Lemma 4.1. For any 0, x = [u; v] X and ξ = [η; ζ] E, let [u ; v ] = P x(ξ), where P x(ξ) = { x = [û; v] X : η + ω (û) ω (u), û s + ζ, v w [s; w] X}, we have for all [s; w] X, η, u s + ζ, v w V u (s) V u (s) V u (u ) +. (3) 1 0. When applying Lemma 4.1 with [u; v] = x t, ξ = [γ τ g t ; γ τ F v ], [u ; v ] = y t, and [s; w] = [u t+1 ; v t+1 ] = x t+1 we obtain: γ τ [ g t, û t u t+1 + F v, v t v t+1 ] V u t(u t+1 ) Vût(u t+1 ) V u t(u τ ) + τ (4) and applying Lemma 4.1 with [u; v] = x τ, ξ = γ τ [ĝ t, F v ], [u ; v ] = x t+1, and [s; w] = z X we get: γ τ [ ĝ t, u t+1 s + F v, v t+1 w ] V u t(s) V u t+1(s) V u t(u t+1 ) + τ. (5) Adding (5) to (4) we obtain for every z = [s; w] X for any [s, w] X, where γ t [ ĝ t, û t s + F v, v t w ] V u t(s) V u t+1(s) + σ t + 2 t, (6) σ t := γ t ĝ t g t, û t u t+1 Vût(u t+1 ) V u t(û t ). 6

7 Let t = F u (û t ) ĝ t, then for any z = [s, w] X, we have γ t F (y t ), y t z Θ[X] + σ t + 2 t + γ t t, û t s (7) Let e t = g t g t and ê t = ĝ t F u (û t ) = t, hen we have Hence, ĝ t g t 2 = (ĝ t F u (û t ) + (F u (û t ) g t ) + (g t g t ) 2 (ê t + L û t u t + M + e t ) 2 3L 2 û t u t 2 + 3M 2 + 3(e t + ê t ) 2 σ t γ2 t 2 ĝt g t ût u t+1 2 Vût(u t+1 ) V u t(û t ) γ2 t 2 ĝt g t ût u t 2. Since the stepsize γ t satisfy that 3γ 2 t L 1, we further have Define a special sequence ũ t such that ũ 1 = u 1 ; σ t 3γ2 t 2 [M 2 + (e t + ê t ) 2 ]. (8) ũ t+1 = argmin{ γ t t, u + Vũt(u)} u P ux he sequence defined above satisfies the following relation (see Corollary 2 in [5] for details): for any z = [s, w] X, γ t t, ũ t s Θ[X] + Combining (7), (8), (9), we end up with Res(X I, λ ) ( γ t ) (2Θ[X] Under Assumption (C.1), we have t γ 2 t 2 t 2 = Θ[X] + 7γ 2 t 2 [M 2 + (e 2 t + ê 2 t )] + t 2 t + γ 2 t 2 êt (9) ) γ t t, û t ũ t (10) E[ t F t ] = 0, E[e 2 t G t 1 ] σ2, and E[ê 2 t F t ] σ2. where F t = σ(ξ 1 1,..., ξ 1 2,..., ξ t 1,..., ξ t ) and G t = σ(ξ 1 1,..., ξ 1 2,..., ξ t 1,..., ξ t 2 ). One can further show that E[ t, û t ũ t ] = 0. It follows from (10) that E[Res(X I, λ )] ( γ t ) (2Θ[X] 1 + which proves the first part of the theorem. 7γ 2 t 2 [M 2 + 2σ2 ] + ) 2 t (11) 3 0. Under Assumption (C.2), we have Let C 1 = [ { E exp E[exp{e 2 t /(σ/ ) 2 }] exp{1} and E[exp{ê 2 t /(σ/ ) 2 }] exp{1}. 1 γ 2 t σ2 C 1, it follows from convexity and the above equation that }] [ γt 2 (e 2 t + ê 2 1 t ) E C 1 7 γ 2 t σ 2 exp { (e 2 t + ê 2 t )/(σ/ ) 2} ] exp{2}.

8 Applying Markov s inequality, we obtain: ( ) Λ > 0 : Prob γt 2 (e 2 t + ê 2 t ) (2 + Λ)C 1 exp{ Λ}. (12) Let ζ t = t, û t ũ t. We showed earlier that E[ζ t ] = 0. since û t ũ t 2 2Θ[X], then we also have E[exp{ζ 2 t /(2 2Θ[X]σ/ ) 2 }] exp{1} Applying the relation exp{x} x + exp{9x 2 /16}, one has for any s 0, [ { }] [ { 9s 2 }] { E exp s γ t ζ t E 16 exp γt 2 ζt 2 9s 2 exp 16 By Markov s inequality, one has Λ > 0 : Prob γ t ζ t 3ΛΘ[X] Combing equation (10), (12), and (13), we arrive at where σ 2 γ 2 t 8σ 2 Θ[X] 2 γ 2 t } exp{ Λ 2 /2} (13) Λ > 0 Prob ( Res(X I, λ ) M 0 ( ) + ΛM 1 ( ) ) exp{ Λ} + exp{ Λ 2 /2} M 0 ( ) = ( ( γ t) 1 2Θ[X] + 7γ 2 t 2 [M 2 + 2σ2 ] + M 1 ( ) = ( ( γ t) 1 ) 7 γt 2 + 3Θ[X] 2σ2. γ 2 t σ2 2 t ), 4.3 Proof of Proposition 3.1 Proof. Let us fix as the number of Mirror Prox steps, and since M = 0, from heorem 3.1, the efficiency estimate of the variational inequality implies that = Θ[X] γ2 t σ2 E[ VI ( x X, F )] 2Θ[X] + 7 γ t + 2 t Let us fix t for each t = 1,...,, then from Proposition 3.1 in [3], it takes at most s = O(1)( L0Dκ Θ[X] ) 1/(κ 1) calls to the LMO oracles to generate a point such that s t. Moreover, we have E[ VI ( x X, F )] O(1) LΘ[X]. herefore, to ensure E[ VI ( x X, F ) ] for a given accuracy > 0, the number of Mirror Prox steps is at most O( LΘ[X] ). herefore, the number of stochastic oracle calls used is at most N SO = = O(γt 2 2 σ 2 /Θ[X]) = O(1) σ2 Θ[X]. Moreover, the number of linear 2 ( ) L minimization oracle calls on X 2 needed is at most N LMO = s = O(1) 0L κ D 1/(κ 1)Θ[X]. κ In particular, if κ = 2 and L 0 = 1, this becomes N LMO = O(1) L2 D 2 Θ[X] Proof of Proposition 3.2 Proof. Let us fix as the number of Mirror Prox steps, and let, from t = Θ[X] for each t = 1,...,, since = O(1)σ 2 /M 2, from heorem 3.1, the efficiency estimate of the variational inequality implies that Θ[X]M E[ VI ( x X, F )] O(1).. κ 8

9 herefore, to ensure E[ VI ( x X, F ) ] for a given accuracy > 0, the number of Mirror Prox steps is at most O( M 2 Θ[X] ). Hence, the number of stochastic oracle calls used is at most 2 N SO = = O(1)σ 2 /M 2 = O(1) σ2 Θ[X]. From Proposition 3.1 in [3], it takes at most 2 s = O(1) D2 Θ[X] calls to the LMO oracles to generate a point such that s t. Hence, the number of linear minimization oracle calls on X 2 needed is at most N LMO = s = O(1) D2 2 Θ[X] = O(1) D2 M 4 Θ[X]. 4 9

Semi-proximal Mirror-Prox for Nonsmooth Composite Minimization

Semi-proximal Mirror-Prox for Nonsmooth Composite Minimization Niao He nhe6@gatech.edu GeorgiaTech Zaid Harchaoui zaid.harchaoui@inria.fr NYU, Inria arxiv:1507.01476v1 [math.oc] 6 Jul 2015 August 13, 2018