Stochastic model-based minimization under high-order growth

Size: px
Start display at page:

Download "Stochastic model-based minimization under high-order growth"

Transcription

1 Stochastic model-based minimization under high-order growth Damek Davis Dmitriy Drusvyatskiy Kellie J. MacPhee Abstract Given a nonsmooth, nonconvex minimization problem, we consider algorithms that iteratively sample and minimize stochastic convex models of the objective function. Assuming that the one-sided approximation quality and the variation of the models is controlled by a Bregman divergence, we show that the scheme drives a natural stationarity measure to zero at the rate Ok /4 ). Under additional convexity and relative strong convexity assumptions, the function values converge to the minimum at the rate of Ok /2 ) and Õk ), respectively. We discuss consequences for stochastic proximal point, mirror descent, regularized Gauss-Newton, and saddle point algorithms. Introduction Common stochastic optimization algorithms proceed as follows. Given an iterate x t, the method samples a model of the objective function formed at x t and declares the next iterate to be a minimizer of the model regularized by a proximal term. Stochastic proximal point, proximal subgradient, and Gauss-Newton type methods are common examples. Let us formalize this viewpoint, following [5]. Namely, consider the optimization problem min F x) := fx) + rx)..) x R d where the function r : R d R { } is closed and convex and the only access to f : R d R is by sampling a stochastic one-sided model. That is, for every point x, there exists a family of models f x, ξ) of f, indexed by a random variable ξ P. This setup immediately motivates the following algorithm, analyzed in [5]: Sample ξ t P, Set x t+ = argmin x { f xt x, ξ t ) + rx) + } x x t 2 2 2,.2) School of Operations Research and Information Engineering, Cornell University, Ithaca, NY 4850; people.orie.cornell.edu/dsd95/. Department of Mathematics, University of Washington, Seattle, WA 9895; sites.math.washington.edu/ ddrusv. Research of Drusvyatskiy was partially supported by the AFOSR YIP award FA and by the NSF DMS 6585 and CCF awards. Department of Mathematics, University of Washington, Seattle, WA 9895; sites.math.washington.edu/ kmacphee.

2 where > 0 is an appropriate control sequence that governs the step-size of the algorithm. Some thought shows that convergence guarantees of the method.2) should rely at least on two factors: i) control over the approximation quality, f x, ξ) f ), and ii) growth/stability properties of the individual models f x, ξ). With this in mind, the paper [5] isolates the following assumptions: E ξ [f x x, ξ)] = fx) and E ξ [f x y, ξ) fy)] τ 2 y x 2 2 x, y,.3) and there exists a square integrable function L ) satisfying f x x, ξ) f x y, ξ) Lξ) x y 2 x, y..4) Condition.3) simply says that in expectation, the model f x, ξ) must globally lower bound f ) up to a quadratic error, while agreeing with f at the base point x; when.3) holds, the paper [5] calls the assignment x, y, ξ) f x y, ξ) a stochastic one-sided model of f. Property.4), in contrast, asserts a Lipschitz type property of the individual models f x, ξ). The main result of [5] shows that under these assumption, the scheme.2) drives a natural stationarity measure of the problem to zero at the rate Ok /4 ). Indeed, the stationarity measure is simply the gradient of the Moreau envelope F x) := inf y { F y) + 2 y x 2 2},.5) where > 0 is a smoothing parameter on the order of τ. The assumptions.3) and.4) are perfectly aligned with existing literature. Indeed, common first-order algorithms rely on global Lipschitz continuity of the objective function or of its gradient; see for example the monographs [5,3,33]. Recent work [2,8,26,29,30], in contrast, has emphasized that global Lipschitz assumptions can easily fail for well-structured problems. Nonetheless, these papers show that it is indeed possible to develop efficient algorithms even without the global Lipschitz assumption. The key idea, originating in [2, 29, 30], is to model errors in approximation by a Bregman divergence, instead of a norm. The ability to deal with problems that are not globally Lipschitz is especially important in stochastic nonconvex settings, where line-search strategies that exploit local Lipschitz continuity are not well-developed. Motivated by the recent work on relative continuity/smoothness [2,29,30], we extend the results of [5] to non-globally Lipschitzian settings. Formally, we simply replace the squared norm 2 2 in the displayed equations.2)-.5) by a Bregman divergence D Φ y, x) = Φy) Φx) Φx), y x, generated by a Legendre function Φ. With this modification and under mild technical conditions, we will show that algorithm.2) drives the gradient of the Bregman envelope.5) to zero at the rate Ok /4 ), where the size of the gradient is measured in the local norm induced by Φ. As a consequence, we obtain new convergence guarantees for stochastic proximal The stated assumption A4) in [5] is stronger than.4); however, a quick look at the arguments shows that property.4) suffices to obtain essentially the same convergence guarantees. 2

3 point, mirror descent 2, and regularized Gauss-Newton methods, as well as for an elementary algorithm for stochastic saddle point problems. Perhaps the most important application arena is when the functional components of the problem grow at a polynomial rate. In this setting, we present a simple Legendre function Φ that satisfies the necessary assumptions for the convergence guarantees to take hold. We also note that the stochastic mirror descent algorithm that we present here does not require mini-batching the gradients, in contrast to the previous seminal work [24]. When the stochastic models f x, ξ) are themselves convex and globally under-estimate f in expectation, we prove that the scheme drives the expected functional error to zero at the rate Ok /2 ). The rate improves to Õk ) when the regularizer r is µ-strongly convex relative to Φ in the sense of [30]. In the special case of mirror descent, these guarantees extend the results for convex unconstrained problems in [29] to the proximal setting. Even specializing to the proximal subgradient method, the convergence guarantees appear to be different from those available in the literature. Namely, previous complexity estimates [7,20] depend on the largest norms of the subgradients of r along the iterate sequence, whereas Theorems 7.2 and 7.4 replace this dependence only by the initial error rx 0 ) inf r. The outline of the manuscript is as follows. Section 2 reviews the relevant concepts of convex analysis, focusing on Legendre functions and the Bregman divergence. Section 3 introduces the problem class and the algorithmic framework. This section also interprets the assumptions made for the stochastic proximal point, mirror descent, and regularized Gauss-Newton methods, as well as for a stochastic approximation algorithm for saddle point problems. Section 4 discusses the stationarity measure we use to quantify the rate of convergence. Section 5 contains the complete convergence analysis of the stochastic model-based algorithm. Section 6 presents a specialized analysis for the mirror descent algorithm when f is smooth and the stochastic gradient oracle has finite variance. Finally, in Section 7 we prove convergence rates in terms of function values for stochastic model-based algorithms under relative strong) convexity assumptions. 2 Legendre functions and the Bregman divergence Throughout, we follow standard notation from convex analysis, as set out for example by Rockafellar [37]. The symbol R d will denote an Euclidean space with inner product, and the induced norm x 2 = x, x. For any set Q R d, we let int Q and cl Q denote the interior and closure of Q, respectively. Whenever Q is convex, the set ri Q is the interior of Q relative to its affine hull. The effective domain of any function f : R d R { }, denoted by dom f, consists of all points where f is finite. Abusing notation slightly, we will use the symbol dom f) to denote the set of all points where f is differentiable. This work analyzes stochastic model-based minimization algorithms, where the errors are controlled by a Bregman divergence. For wider uses of the Bregman divergence in first-order methods, we refer the interested reader to the expository articles of Bubeck [0], Juditsky-Nemirovski [27], and Teboulle [40]. 2 This work appears on arxiv a month after a preprint of Zhang and He [42], who provide similar convergence guarantees specifically for the stochastic mirror descent algorithm. The results of the two papers were obtained independently and are complementary to each other. 3

4 Henceforth, we fix a Legendre function Φ: R d R { }, meaning:. Convexity) Φ is proper, closed, and strictly convex. 2. Essential smoothness) The domain of Φ has nonempty interior, Φ is differentiable on intdom Φ), and for any sequence {x k } intdom Φ) converging to a boundary point of dom Φ, it must be the case that Φx k ). Typical examples of Legendre functions are the squared Euclidean norm Φx) = 2 x 2 2, the Shannon entropy Φx) = d i= x i logx i ) with dom Φ = R d +, and the Burge function Φx) = d i= logx i) with dom Φ = R d ++. For more examples, we refer the reader to the articles [, 3, 22, 39] and the recent survey [40]. We will often use the observation that the subdifferential of a Legendre function Φ is empty on the boundary of its domain [37, Theorem 26.]: Φx) = for all x / intdom Φ). The Legendre function Φ induces the Bregman divergence D Φ y, x) := Φy) Φx) Φx), y x, for all x intdom Φ), y dom Φ. Notice that since Φ is strictly convex, equality D Φ y, x) = 0 holds for some x, y intdom Φ) if and only if y = x. Analysis of algorithms based on the Bregman divergence typically relies on the following three point inequality; see e.g. [4, Property ]. Lemma 2. Three point inequality). Consider a closed convex function g : R d R {+ } satisfying ridom g) intdom Φ). Then for any point z intdom Φ), any minimizer z + of the problem min gx) + D Φ x, z), x lies in intdom Φ), is unique, and satisfies the inequality: gx) + D Φ x, z) gz + ) + D Φ z +, z) + D Φ x, z + ) x dom Φ. Recall that a function f : R d R { } is called ρ-weakly convex if the perturbed function f + ρ is convex [34]. By analogy, we will say that f is ρ-weakly convex relative to Φ if the perturbed function f + ρφ is convex. This notion is closely related to the relative smoothness condition introduced in [2, 30]. Relative weak convexity, like its classical counterpart, can be caracterized through generalized derivatives. Recall that the Fréchet subdifferential of a function f at a point x dom f, denoted ˆ fx), consists of all vectors v R d satisfying fy) fx) + v, y x + o y x ) as y x. The limiting subdifferential of f at x, denoted fx), consists of all vectors v R d such that there exist sequences x k R d and v k ˆ fx k ) satisfying x k, fx k ), v k ) x, fx), v). 4

5 Lemma 2.2 Subdifferential characterization). The following are equivalent for any locally Lipschitz function f : R d R.. The function f is ρ-weakly convex relative to Φ. 2. For any x intdom Φ), y dom Φ and any v ˆ fx), the inequality holds: fy) fx) + v, y x ρd Φ y, x). 2.) 3. For any x intdom Φ) dom f), and any y dom Φ, the inequality holds: fy) fx) + fx), y x ρd Φ y, x). 2.2) If f and Φ are C 2 -smooth on intdom Φ), then the three properties above are all equivalent to 2 fx) ρ 2 Φx) x intdom Φ). 2.3) Proof. Define the perturbed function g := f +ρφ. We prove the implications 2 3 in order. To this end, suppose holds. Since g is convex, the subgradient inequality holds: gy) gx) + w, y x for all x, y R d, w gx). 2.4) Taking into account that Φ is differentiable on intdom Φ), we deduce ˆ gx) = ˆ fx) + ρ Φx) for all x intdom Φ); see e.g. [38, Exercise 8.8]. Rewriting 2.4) with this in mind immediately yields 2. The implication 2 3 is immediate since ˆ fx) = { fx)}, whenever f is differentiable at x. Suppose 3 holds. Fix an arbitrary point x intdom Φ) dom f). Algebraic manipulation of inequality 2.2) yields the equivalent description gy) gx) + fx) + ρ Φx), y x for all y dom Φ. 2.5) It follows that the vector fx)+ρ Φx) lies in the convex subdifferential of g at x. Since f is locally Lipschitz continuous, Rademacher s theorem shows that dom f) has full measure in R d. In particular, we deduce from 2.5) that the convex subdifferential of g is nonempty on a dense subset of intdom g). Taking limits, it quickly follows that the convex subdifferential of g is nonempty at every point x intdom g) Using [9, Exercise 3..2a)], we conclude that g is convex on intdom g). Moreover, appealing to the sum rule [38, Exercise 0.0], we deduce that gx) = for all x / intdom Φ), since Φx) = for all x / intdom Φ). Therefore g is a globally monotone map globally. Appealing to [38, Theorem 2.7], we conclude that g is a convex function. Thus item holds. This completes the proof of the equivalences 2 3. Finally suppose that f and Φ are C 2 -smooth on intdom Φ). Clearly, if f is ρ-weakly convex relative to Φ, then second-order characterization of convexity of the function g = f + ρφ directly implies 2.3). Conversely, 2.3) immediately implies that g is convex on the interior of its domain. The same argument using [38, Theorem 2.7], as in the implication 3, shows that g is convex on all of R d. 5

6 Notice that the setup so far has not relied on any predefined norm. Let us for the moment make the common assumption that Φ is -strongly convex relative to some norm on R d, which implies D Φ y, x) 2 y x ) Then using Lemma 2.2, we deduce that to check that f is ρ-weakly convex relative to Φ, it suffices to verify the inequality fy) fx) + v, y x ρ y x 2 for 2 all x, y dom Φ, v fx). Recall that a function f : R d R is called ρ-smooth if it satisfies: fy) fx) ρ y x for all x, y R d, where is the dual norm. Thus any ρ-smooth function f is automatically ρ-weakly convex relative to Φ. Our main result will not require Φ to be -strongly convex; however, we will impose this assumption in Section 6 where we augment our guarantees for the stochastic mirror descent algorithm under a differentiability assumption. 3 The problem class and the algorithm We are now ready to introduce the problem class considered in this paper. interested in the optimization problem We will be min x F x) := fx) + rx) 3.) where f : R d R is a locally Lipschitz function, r : R d R { } is a closed function having a convex domain, Φ: R d R { } is some Legendre function satisfying the compatibility conditions: ridom r) intdom Φ) and r + Φ)x) = for all x / intdom Φ). 3.2) The first two items are standard and mild. The third stipulates that r must be compatible with Φ. In particular, the inclusion ridom r) intdom Φ) automatically implies 3.2), whenever r is convex [37, Theorem 23.8], or more generally whenever a standard qualification condition holds. 3 To simplify notation, henceforth set U := intdom Φ). 3 Qualification condition: rx) N dom Φ x) = {0}, for all x dom r dom Φ; see [38, Proposition 8.2, Corollary 0.9]. 6

7 3. Assumptions and the Algorithm We now specify the model-based algorithms we will analyze. Fix a probability space Ω, F, P ) and equip R d with the Borel σ algebra. To each point x dom f and each random element ξ Ω, we associate a stochastic one-sided model f x, ξ) of the function f. Namely, we assume that there exist τ, ρ, L > 0 satisfying the following properties. A) Sampling) It is possible to generate i.i.d. realizations ξ,..., ξ T P A2) One-sided accuracy) There is a measurable function x, y, ξ) f x y, ξ) defined on U U Ω satisfying both E ξ [f x x, ξ)] = fx), x U dom r and E ξ [f x y, ξ) fy)] τd Φ y, x), x, y U dom r. 3.3) A3) Weak convexity of the models) The functions f x, ξ) + r ) are ρ-weakly convex relative to Φ for all x U dom r, and a.e. ξ Ω. A4) Lipschitzian property) There exists a square integrable function L: Ω R + such that for all x, y U dom r, the following inequalities hold: f x x, ξ) f x y, ξ) Lξ) D Φ y, x), 3.4) E ξ [Lξ) 2 ] L. Some comments are in order. Assumption A) is standard and is necessary for all sampling based algorithms. Assumption A2) specifies the accuracy of the models. That is, we require the model in expectation to agree with f at the basepoint, and to globally lower-bound f up to an error controlled by the Bregman divergence. Assumption A3) is very mild, since in most practical circumstances the function f x, ξ) + r ) is convex, i.e. ρ = 0. The final Assumption A4) controls the order of growth of the individual models f x y, x) as the argument y moves away from x. Notice that the assumptions A)-A4) do not involve any norm on R d. However, when Φ is -strongly convex relative to some norm, the properties 3.3) and 3.4) are implied by standard assumptions. Namely 3.3) holds if the error in the model approximation satisfies E ξ [f x y, ξ) fy)] τ 2 y x 2, x, y U. Similarly 3.4) will hold as long as for every x U dom r and a.e. ξ Ω the models f x, ξ) are Lξ)-Lipschitz continuous on U in the norm. The use of the Bregman divergence allows for much greater flexibility as it can, for example, model higher order growth of the functions in question. To illustrate, let us look at the following example where the Lipschitz constant Lξ) of the models f x, ξ) is bounded by a polynomial. 7

8 Example 3. Bregman divergence under polynomial growth). Consider a degree n univariate polynomial n pu) = a i u i, i=0 with coefficients a i 0. Suppose now that the one-sided Lipschitz constants of the models satisfy the growth property: f x x, ξ) f x y, ξ) x y 2 Lξ) p x 2 ) + p y 2 ) 2 for all distinct x, y R d. Motivated by [29, Proposition 5.], the following proposition constructs a Bregman divergence that is well-adapted to the polynomial p ). We defer its proof to Appendix A.. In particular, with the choice of the Legendre function Φ in 3.5), the required estimate 3.4) holds. Proposition 3.2. Define the convex function Then for all x, y R d, we have Φx) = and therefore the estimate 3.4) holds. n i=0 a i 3i + 7 i + 2 D Φ y, x) p x 2) + p y 2 ) 2 ) x i ) x y 2 2, The final ingredient we need before stating the algorithm is an estimate on the weak convexity constant of F. The following simple lemma shows that Assumptions A2) and A3) imply that F itself is τ + ρ)-weakly convex relative to Φ. Lemma 3.3. The function F is τ + ρ)-weakly convex relative to Φ. Proof. We first show that the function g := F +ρ+τ)φ is convex on ridom F ). To this end, fix arbitrary points x, y ridom g), and note the equality ridom g) = U ridom r) [37, Theorem 6.5]. Choose 0, ) and set x = x + )y. Taking into account A3), we deduce g x) = f x) + r x) + ρ + τ)φ x) = E ξ [f x x, ξ) + r x) + ρφ x)] + τφ x) E ξ [f x x, ξ) + rx) + ρφx)) + )f x y, ξ) + ry) + ρφy))] + τφ x) = E ξ [f x x, ξ) + rx)] + )E ξ [f x y, ξ) + ry)] + τφ x) + ρφx) + )ρφy) = E ξ [f x x, ξ) + rx) τd Φ x, x)] + )E ξ [f x y, ξ) + ry) τd Φ y, x)] + τφ x) + D Φ x, x)) + )τφ x) + D Φ y, x)) + ρφx) + )ρφy). 3.6) Now observe Φ x) + D Φ x, x) = Φx) ) Φ x), x y, 8

9 and similarly Φ x) + D Φ y, x) = Φy) Φ x), y x. Hence algebraic manipulation of the two equalities above yields the expression τφ x) + D Φ x, x)) + )τφ x) + D Φ y, x)) = τφx) + )τφy). Continuing with 3.6), we obtain g x) fx) + rx) + )fy) + ry)) + τφx) + )τφy) + ρφx) + )ρφy) = [fx) + rx) + τ + ρ)φx)] + )[fy) + ry) + τ + ρ)φy)] gx) + )gy). We have thus verified that g is convex on ridom g). Appealing to 3.2) and the sum rule [38, Exercise 0.0], we deduce that the subdifferential gx) is empty at every point in x / ridom g), and therefore g is a globally monotone map. Using [38, Theorem 2.7], we conclude that g is a convex function, as needed. In light of Lemma 3.3, we also make the following additional assumption on the solvability of the Bregman proximal subproblems. A5) Solvability) The convex problems { min F y) + } y D Φy, x) and { min f x y, ξ) + ry) + } y D Φy, x), admit a minimizer for any < τ + ρ), any x U, and a.e. ξ Ω. 4 The minimizers vary measurably in x, ξ) U Ω. Assumption A5) is very mild. In particular, it holds automatically if i) Φ is strongly convex with respect to some norm, or if ii) the functions f x, ξ) + r ) + ρd Φ, x) and F + τ + ρ)φ are bounded from below and Φ has bounded sublevel sets [40, Lemma 2.3]. We are now ready to state the stochastic model-based algorithm we analyze Algorithm. Algorithm : Stochastic Model Based Minimization Data: x 0 U dom r, real < τ + ρ), a nonincreasing sequence { } t 0 0, ), and iteration count T. Step t = 0,..., T : Sample ξ t P Set x t+ = argmin x { } f xt x, ξ t ) + rx) + D Φ x, x t ), Sample t {0,..., T } according to the discrete probability distribution Return x t Pt = t) ρ. 4 Note the minimizers are automatically unique by Lemma 2. 9

10 3.2 Examples Before delving into the convergence analysis of Algorithm, in this section we illustrate the algorithmic framework on four examples. In all cases, assumptions A) and A5) are self-explanatory. Therefore, we only focus on verifying A2)-A4). For simplicity, we also assume that r ) is convex in all examples. Stochastic Bregman-proximal point. Suppose that the models x, y, ξ) f x y, ξ) satisfy E ξ [f x y, ξ)] = fy) x, y U dom r. With this choice of the models, Algorithm becomes the stochastic Bregman-proximal point method. Analysis of the deterministic version of the method for convex problems goes back to [3,4,22]. Observe that Assumption A2) holds trivially. Assumption A3) and Assumption A4) should be verified in particular circumstances, depending on how the models are generated. In particular, one can verify Assumption A4) under polynomial growth of the Lipschitz constant, by appealing to Example 3.. Stochastic mirror descent. Suppose that the models x, y, ξ) f x y, ξ) are given by f x y, ξ) = fx) + Gx, ξ), y x, for some measurable mapping G: U Ω R d satisfying E ξ [Gx, ξ)] fx) for all x U dom r. Algorithm then becomes the stochastic mirror descent algorithm, classically studied in [6,3] in the convex setting and more recently analyzed in [2,29,30] under convexity and relative continuity assumptions. Assumption A2) simply says that f is τ-weakly convex relative to Φ, while Assumption A3) holds trivially with ρ = 0. Assumption A4) is directly implied by the relative continuity condition of Lu [29]. Namely it suffices to assume that there is a square integrable function L: Ω R ++ satisfying Gx, ξ) Lξ) Dy, x) y x x, y U, where is an arbitrary norm on R d, and is the dual norm. We refer to [29] for more details on this condition and examples. Gauss-Newton method with Bregman regularization. that f has the composite form In the next example, suppose fx) = E ξ [hcx, ξ), ξ)], for some measurable function hy, ξ) that is convex in y for a.e. ξ Ω and a measurable map cx, ξ) that is C -smooth in x for a.e. ξ Ω. We may then use the convex models f x y, ξ) := h cx, ξ) + cx, ξ)y x), ξ), ξ), which automatically satisfy A3) with ρ = 0. Algorithm then becomes a stochastic Gauss- Newton method with Bregman regularization. 0

11 In the Euclidean case Φ = 2 2, the method reduces to the stochastic prox-linear algorithm, introduced in [2] and further analyzed in [5]. The deterministic prox-linear method has classical roots, going back at least to [,23,36], while a more modern complexity theoretic perspective appears in [2,8,9,28,32]. Even in the deterministic setting, to make progress, one typically assumes that h and c are globally Lipschitz. More generally and in line with our current work, one may introduce a different Legendre function Φ. For example, in the case of polynomial growth, the following propositions construct Legendre functions that are compatible with Assumptions A2) and A4). We defer their proofs to Appendix A.3. In the two propositions, we assume that the outer functions h, ξ) are globally Lipschitz, while the inner maps c, ξ) may have a high order of growth. It is possible to also analyze the setting when h, ξ) has polynomial growth, but the resulting statements and assumptions become much more cumbersome; we therefore omit that discussion. Proposition 3.4 Satisfying A2)). Suppose there are square integrable functions L, L 2 : Ω R + and a univariate polynomial pu) = n i=0 a iu i with nonnegative coefficients satisfying hv, ξ) hw, ξ) L ξ) v w, v w 2 cx, ξ) cy, ξ) op L 2 ξ)p x 2 ) + p y 2 )) x y. x y 2 Define the Legendre function Φx) := n a i 3i+7) i=0 x i+2 i+2 2. Then assumption A2) holds with τ := 4E [L 3 ξ)l 2 ξ)]. Proposition 3.5 Satisfying A4)). Suppose there are square integrable functions L, L 2 : Ω R + and a univariate polynomial qu) = n i=0 b iu i with nonnegative coefficients satisfying hv, ξ) hw, ξ) v w 2 L ξ) v w, Then with the Legendre function Φx) = n 2L ξ)l 2 ξ). cx, ξ) op L 2 ξ) q x 2 ) x, ξ. i=0 b i i+2 x i+2 2, assumption A4) holds with Lξ) = To construct a Bregman function compatible with both A2) and A4) simultaneously, one may simply add the two Legendre functions constructed in Propositons 3.4 and 3.5. Stochastic saddle point problems. As the final example, suppose that f is given in the stochastic conjugate form [ ] fx) = E sup gx, w, ξ) w W, where W is some auxiliary set and g : R d W Ω R is some function. Thus we are interested in solving the stochastic saddle-point problem [ ] inf E sup gx, w, ξ) + rx). 3.7) x w W

12 Such problems appear often in data science, where the variation of w in the uncertainty set W makes the loss function robust. One popular example is adversarial training [25]. In this setting, we have gx, w, ξ) = Lx + w, y, ξ), where L, ) is a loss function, y encodes the observed data, and w varies over some uncertainty set W, such as an l p -ball. In order to apply our algorithmic framework, we must have access to stochastic one-sided models f x, ξ) of f. It is quite natural to construct such models by using one-sided stochastic models g x, w, ξ) of g. Indeed, it is appealing to simply set f x y, ξ) = g x y, ŵx, ξ), ξ) for any ŵx, ξ) argmax w g x x, w, ξ). 3.8) All of the model types in the previous examples could now serve as the models g x, w, ξ), provided they meet the conditions outlined below. Formally, to ensure that A)-A5) hold for the models f x y, ξ), we must make the following assumptions:. The mapping x, ξ) sup w W gx, w, ξ) is measurable and has finite first moment for every fixed x U dom r. 2. The function g x, w, ξ) is ρ-weakly convex relative to Φ, for every fixed x U dom r, w W, and a.e. ξ Ω. 3. There exists a mapping ŵ : U Ω R m satisfying ŵx, ξ) argmax g x x, w, ξ), w for all x U dom r and a.e. ξ Ω with the property that the functions x, y, ξ) g x y, ŵx, ξ), ξ) and x, y, ξ) gy, ŵx, ξ), ξ) are measurable. 4. For all x, y U dom r, we have E ξ [g x x, ŵx, ξ), ξ)] = E ξ [gx, ŵx, ξ), ξ)] and E [g x y, ŵx, ξ), ξ) gy, ŵx, ξ), ξ)] τd Φ y, x). 5. There exists a square integrable function L: Ω R + such that g x x, ŵx, ξ), ξ) g x y, ŵx, ξ), ξ) Lξ) D Φ y, x), for all x, y U dom r. Given these assumptions, let us define f x y, ξ) as in 3.8) We now verify properties A2)-A4). Property A2) follows from Property 4, which implies that E [f x x, ξ)] = fx) and ] E ξ [f x y, ξ) fy)] = E ξ [g x y, ŵx, ξ), ξ) sup gy, w, ξ) w W E ξ [g x y, ŵx, ξ), ξ) gy, ŵx, ξ), ξ)] τd Φ y, x). Property A3) follows directly from Property 2. Finally, A4) follows from Property 5. 2

13 4 Stationarity measure In this section, we introduce a natural stationarity measure that we will use to describe the convergence rate of Algorithm. The stationarity measure is simply the size of the gradient of an appropriate smooth approximation of the problem 3.). This idea is completely analogous to the Euclidean setting [5, 6]. Setting the stage, for any > 0, define the Φ-envelope { F Φ x) := inf fy) + } y D Φy, x), and the associated Φ-proximal map prox Φ fx) := argmin y { F y) + } D Φy, x). Note that in the Euclidean setting Φ = 2 2, these two constructions reduce to the standard Moreau envelope and the proximity map; see for example the monographs [35, 38] or the note [7] for recent perspectives. We will measure the convergence guarantees of Algorithm based on the rate at which the quantity E[D Φ prox Φ F x t ), x t ) ] 4.) tends to zero for some fixed > 0. The significance of this quantity becomes apparent after making slightly stronger assumptions on the Legendre function Φ. In this section only, suppose that Φ: R d R {+ } is -strongly convex with respect to some norm and that Φ is twice differentiable at every point in intdom Φ). With these assumptions, the following result shows that the Φ-envelope is differentiable, with a meaningful gradient. Indeed, this result follows quickly from [4]. For the sake of completeness, we present a self-contained argument in Appendix A.4. Theorem 4. Smoothness of the Φ-envelope). For any positive < τ + ρ), the envelope is differentiable at any point x intdom Φ) with gradient given by F Φ F Φ x) := 2 Φx) x prox Φ F x) ). In light of Theorem 4., for any point x intdom Φ), we may define the local norm y x := 2 Φx)y. Then a quick computation shows that the dual norm is given by v x = 2 Φx) v. Therefore appealing to Theorem 4., for any positive < τ + ρ) and x intdom Φ) we obtain the estimate D Φ prox Φ F x), x) 2 F Φ x) x. Thus the square root of the Bregman divergence, which we will show tends to zero along the iterate sequence at a controlled rate, bounds the local norm of the gradient F Φ. 3

14 5 Convergence analysis We now present convergence analysis of Algorithm under Assumptions A)-A5). Henceforth, let {x t } t 0 be the iterates generated by Algorithm and let {ξ t } t 0 be the corresponding samples used. For each index t 0, define the Bregman-proximal point ˆx t = prox Φ F x t ). To simplify notation, we will use the symbol E t [ ] to denote the expectation conditioned on all the realizations ξ 0, ξ,..., ξ t. The entire argument of Theorem 5.2 our main result relies on the following lemma. Lemma 5.. For each iteration t 0, the iterates of Algorithm satisfy E t [D Φ ˆx t, x t+ )] +ηtτ ηt/ ρ D Φ ˆx t, x t ) + Lηt)2 + ηt E 4 ρ) ρ t [rx t ) rx t+ )]. Proof. Taking into account assumption A3), we may apply the three point inequality in Lemma 2. with the convex function g = f xt, ξ t )+r )+ρd Φ, x t ) and with ρ)d Φ, x t ) replacing the Bregman divergence. Thus for any point x intdom Φ), we obtain the estimate f xt x, ξ t )+rx)+ D Φ x, x t ) f xt x t+, ξ t )+rx t+ )+ ) D Φ x t+, x t )+ ρ D Φ x, x t+ ). Setting x = ˆx t, rearranging terms, and taking expectations, we deduce E ξ [f xt ˆx t, ξ t )+rˆx t ) f xt x t+, ξ t ) rx t+ )] E t [ ρ)d Φ ˆx t, x t+ ) D Φ ˆx t, x t ) + D Φ x t+, x t )]. 5.) 5.2) We seek to upper bound the left-hand-side of 5.2). Using assumptions A2) and A4), we obtain: E t [f xt ˆx t, ξ t ) f xt x t+, ξ t )] E t [f xt ˆx t, ξ t ) f xt x t, ξ t ) + Lξ) ] D Φ x t+, x t ) = E t [f xt ˆx t, ξ t ) fˆx t )] + E t [Lξ) ] D Φ x t+, x t ) fx t ) + fˆx t ) τd Φ ˆx t, x t ) + E t [Lξ) ] D Φ x t+, x t ) fx t ) + fˆx t ). By the definition of ˆx t as the Bregman-proximal point, we have 5.3) fˆx t ) + rˆx t ) + D Φˆx t, x t ) fx t ) + rx t ). 5.4) The right hand side of 5.2) is thus upper bounded by τd Φ ˆx t, x t ) + E t [Lξ) ] D Φ x t+, x t ) fx t ) rx t+ ) + fˆx t ) + rˆx t ) τd Φ ˆx t, x t ) + E t [Lξ) ] D Φ x t+, x t ) + rx t ) rx t+ )) + fˆx t ) + rˆx t ) fx t ) rx t ) τ ) [ D Φ ˆx t, x t ) + E t Lξ) ] D Φ x t+, x t ) + rx t ) rx t+ )) 4

15 where the last inequality follows from 5.4). Combining this estimate with 5.2), we obtain E t [ ρ)d Φ ˆx t, x t+ ) D Φ ˆx t, x t ) + D Φ x t+, x t )] τ ) [ D Φ ˆx t, x t ) + E t Lξ) ] D Φ x t+, x t ) + rx t ) rx t+ )). Multiplying through by and rearranging yields ρ)e t [D Φ ˆx t, x t+ )] + τ η ) [ t D Φ ˆx t, x t ) + E t Lξ) ] D Φ x t+, x t ) D Φ x t+, x t ) + E t [rx t ) rx t+ )]. 5.5) Now define γ := E t [D Φ x t+, x t )]. Note that Cauchy-Schwarz implies E t [ Lξ) ] D Φ x t+, x t ) Lγ. Using this estimate in 5.5), we obtain ρ)e t [D Φ ˆx t, x t+ )] + τ η ) t D Φ ˆx t, x t ) + Lγ γ 2 + E t [rx t ) rx t+ )]. Maximizing the right hand side in γ i.e. taking γ = Lηt ), yields the guarantee 2 ρ)e t [D Φ ˆx t, x t+ )] + τ η ) t D Φ ˆx t, x t ) + L) 2 + E t [rx t ) rx t+ )]. 4 Dividing through by ρ completes the proof. We can now prove our main theorem. Theorem 5.2 Convergence rate). The point x t returned by Algorithm satisfies: E [ )] D Φ prox Φ F x t ), x t 2 F Φx 0) min F + L2 T ηt 2 η 0 4 ρ) + rx η 0 ρ) 0) inf r). τ + ρ) T ρ T ρ T ρ Proof. Using the definitions of x t+ and ˆx t along with Lemma 5., we obtain [ E t F Φ x t+ ) ] E t [F ˆx t ) + ] D Φˆx t, x t+ ) E t [F ˆx t ) + + τ )) D Φ ˆx t, x t ) + Lη )] t) 2 ρ) 4 + ρ) E t [rx t ) rx t+ ))] = F Φ x t ) + η ) t τ + ρ / D Φ ˆx t, x t ) + L) 2 ρ 4 ρ) + ρ) E t [rx t ) rx t+ )]. 5

16 Recursing and applying the tower rule for expectations, we obtain E [ F Φ x T + ) ] F Φ x 0 ) + + ηt ) τ + ρ / E[D Φ ˆx t, x t )] + Lη ) t) 2 ρ 4 ρ) ρ) E [rx t) rx t+ )]. 5.6) Taking into account that is nonincreasing yields the inequality ρ) rx t) rx t+ )) η 0 η 0 ρ) rx 0) inf r). See the auxiliary Lemma A. for a verification. Combining this bound with 5.6), using the inequality E [F x T + )] min F, and rearranging, we conclude ) τ ρ or equivalently ρ E[D Φˆx t, x t )] F Φ x 0 ) min F + L 2 + η 0 η 0 ρ) rx 0) inf r), ρ E [D Φˆx t, x t )] 2 F Φx 0) min F ) 2 L 2 + τ + ρ) τ + ρ) + 2 η 0 τ + ρ)) η 0 ρ) rx 0) inf r). η 2 t 4 ρ) η 2 t 4 ρ) Dividing through by T result follows. ρ and recognizing the left-hand-side as E[D Φˆx t, x t )], the As an immediate corollary of Theorem 5.2, we have the following rate of convergence when the stepsize is constant. Corollary 5.3 Convergence rate for constant stepsize). For some α > 0, set = for all indices t =,..., T. Then the point x t returned by Algorithm 2 satisfies: E [ )] 2 F D Φ prox Φ Φ F x t ), x x 0) min F ) + L2 α 2 + rx 0) inf r)) 4 t ρ+α τ + ρ) 6 Mirror descent: smoothness and finite variance +α T + ) ρ T + + α. T + Assumptions A)-A5) are reasonable for the examples described in Section 3.2, being in line with standard conditions in the literature. However, in the special case that f is smooth and we apply stochastic mirror descent, Assumption A4) is nonstandard. Ideally, one would 6

17 like to replace this assumption with a bound on the variance of the stochastic estimator of the gradient. In this section, we show that this is indeed possible by slightly modifying the argument in Section 5. Henceforth, let Φ be a Legendre function and set U := intdom Φ). In this section, we make the following assumptions: B) Sampling) It is possible to generate i.i.d. realizations ξ,..., ξ T P B2) Stochastic gradient) There is a measurable mapping G : U Ω R d satisfying E ξ [Gx, ξ)] = fx), x U dom r. B3) Relative Smoothness) There exist real τ, M 0, such that τd Φ y, x) fy) fx) fx), y x MD Φ y, x) x, y U dom r. B4) Relative convexity) The function r is ρ-weakly convex relative to Φ. B5) Strong convexity of Φ) The Legendre function Φ is -strongly convex with respect to some norm. B6) Finite variance) The following variance is finite: [ E ξ Gx, ξ) fx) 2] σ 2 2 <. Henceforth, we denote by f x, ξ) the linear models f x y, ξ) := fx) + Gx, ξ), y x, which are built from the stochastic gradient estimator G. With this notation in hand, let us compare Assumptions B)-B5) with Assumptions A)-A4). Evidently, Assumptions B) and A) are identical. Upon taking expectations, Assumptions B2) and B3) imply the stochastic one-sided accuracy property A2) for the linear models f x, ξ), while B4) directly implies A3). Assumptions B5) and B6) replace the Lipschitzian property A4). Finally, we reiterate that the relative smoothness property in B3) was recently introduced in [2, 30] for smooth convex minimization, and extended to smooth nonconvex problems in [8] and to nonsmooth stochastic problems in [26, 29]. This property allows for higher order growth than the standard Lipschitz gradient assumptions, commonly analyzed in the literature. We refer the reader to [2,30] for various examples of Bregman functions that arise in applications. 7

18 For the sake of clarity, Algorithm 2 instantiates Algorithm in our setting. Algorithm 2: Mirror descent for smooth minimization Data: x 0 U dom r, positive < τ + ρ), a sequence { } t 0 0, iteration count T Step t = 0,..., T : Sample ξ t P Set x t+ = argmin x { } Gx t, ξ t ), x + rx) + D Φ x, x t ), Sample t {0,..., T } according to the discrete probability distribution Return x t Pt = t) ρ. ) +M, and As in Section 5, the convergence analysis relies on the following key lemma. We let {x t } t 0 be the iterates generated by Algorithm 2 and let {ξ t } t 0 be the corresponding samples used. For each index t 0, we continue to use the notation ˆx t = prox Φ F x) and let E t[ ] to denote the expectation conditioned on all the realizations ξ 0, ξ,..., ξ t. Lemma 6.. For each iteration t 0, the iterates of Algorithm 2 satisfy E t [D Φ ˆx t, x t+ )] + τ / ρ) D Φ ˆx t, x t ) + 4 σ ) 2 M + )) ρ). Proof. Following the initial steps of the proof of Lemma 5., we arrive at the estimate 5.2), namely E t [ ρ)d Φ ˆx t, x t+ ) D Φ ˆx t, x t ) + D Φ x t+, x t )] E t [f xt ˆx t, ξ t ) + rˆx t ) f xt x t+, ξ t ) rx t+ )]. 6.) We now seek to bound the right-hand side of 6.) using B3)-B6). following bound will be useful: To that end, the f xt x t+, ξ t ) = fx t, ξ t ) + Gx t, ξ t ), x t+ x t fx t, ξ t ) + fx t ), x t+ x t Gx t, ξ t ) fx t ) x t+ x t. Taking expectations of both sides and applying Cauchy-Schwarz and B3)-B6), we obtain E t [f xt x t+, ξ t )] E t [fx t ) + fx t ), x t+ x t ] E t [ Gx t, ξ t ) fx t ) x t+ x t ] E t [fx t+ ) MD Φ x t+, x t )] E t [ Gx t, ξ t ) fx t ) 2 ] E t [ x t+ x t 2 ] [ E t [fx t+ ) MD Φ x t+, x t )] σ E 2 t x t+ x t 2] E t [fx t+ ) MD Φ x t+, x t )] σ E t [D Φ x t+, x t )]. 6.2) 8

19 Continuing, add f xt ˆx t, ξ t ) to both sides of 6.2), rearrange, and apply B3) to obtain E t [f xt ˆx t, ξ t ) f xt x t+, ξ t )] E t [f xt ˆx t, ξ t ) fx t+ ) + MD Φ x t+, x t )] + σ E t [D Φ x t+, x t )] E t [fˆx t ) fx t+ ) + τd Φ ˆx t, x t ) + MD Φ x t+, x t )] + σ E t [D Φ x t+, x t )]. 6.3) On the other hand, by the definition of ˆx t we have fˆx t ) + rˆx t ) + D Φˆx t, x t ) fx t+ ) + rx t+ ) + D Φx t+, x t ). Inserting this equation into 6.3), we obtain E t [fˆx t ) + rˆx t ) fx t+ ) rx t+ )] E t [ M + ) DΦ x t+, x t ) + τ ) DΦ ˆx t, x t ) ] + σ E t [D Φ x t+, x t )]. 6.4) Combining 6.4) with 6.) gives the estimate E t [ ρ)d Φ ˆx t, x t+ ) D Φ ˆx t, x t ) + D Φ x t+, x t )] M + ) Et [D Φ x t+, x t )] + τ ) DΦ ˆx t, x t ) + σ E t [D Φ x t+, x t )], Multiplying through by and rearranging, we obtain [ E t ηt ρ)d Φ ˆx t, x t+ ) + )) M + DΦ x t+, x t ) ] + τ )) D Φ ˆx t, x t ) + σ Et [D Φ x t+, x t )]. Now define γ := E t [D Φ x t+, x t )], and rewrite the above as E t [ ρ)d Φ ˆx t, x t+ )] + τ ηt ) DΦ ˆx t, x t ) + σ γ M + )) γ 2. Maximizing the right hand side in γ, i.e. taking γ = ση t 2 M+ )), we conclude as desired. E t [ ρ)d Φ ˆx t, x t+ )] ) + τ ηt DΦ ˆx t, x t ) + 4 σ ) 2 M + ), With Lemma 6. at hand, we can now establish a convergence rate of Algorithm 2. Theorem 6.2. The point x t E [ D Φ prox Φ F x t ), x t )] τ+ρ)) returned by Algorithm 2 satisfies: F Φx 0) min F ) T ρ + σ2 T ηt 2 M+/)) ρ) 4 T ρ. 9

20 Proof. Using Lemma 6., we obtain [ E t F Φ x t+ ) ] E t [fˆx t ) + ] D Φˆx t, x t+ ) E t [fˆx t ) + + τ /) D Φ ˆx t, x t ) + )] ρ) 4 σ ) 2 M + /) = F Φ x t ) + η ) t τ + ρ / σ ) 2 D Φ ˆx t, x t ) + ρ 4 M + /)) ρ) Recursing and applying the tower rule for expectations, we obtain E [ F Φ x T + ) ] F Φ x 0 )+ ηt τ + ρ / ρ Rearranging and using the fact that E [F x T + )] min F, we obtain / τ ρ ρ or equivalently ) E [D Φ ˆx t, x t )] F Φ x 0 ) min F + σ2 4 ) ) σ ) 2 E [D Φ ˆx t, x t )] + 4 M + /)) ρ) η 2 t M + /)) ρ) ρ E [D Φˆx t, x t )] 2 F Φx 0) min F ) τ + ρ) σ τ + ρ)) η 2 t M + /)) ρ). Dividing through by T result follows. ρ and recognizing the left-hand-side as E[D Φˆx t, x t )], the As an immediate corollary, we obtain a convergence rate for Algorithm 2 with a constant stepsize. Corollary 6.3. For some α > 0, set = the point x t returned by Algorithm 2 satisfies: M+ +α T + E [ )] D Φ prox Φ 2 F Φ F x t ), x t x 0) min F ) + σα 2 )2 τ + ρ)) for all indices t =,..., T. Then M + ρ + T + α T + 7 Rates in function value for convex problems In this final section, we examine convergence rates for stochastic model based minimization under convexity assumptions and prove rates of converge on function values. To this end, we will use the following definition from [30]. A function g : R d R { } is µ-strongly convex relative to Φ if the function g µφ is convex. Notice that µ = 0 corresponds to plain convexity of g. In this section, we make the following assumptions: C) Sampling) It is possible to generate i.i.d. realizations ξ,..., ξ T P 20 ).

21 C2) One-sided accuracy) There is a measurable function x, y, ξ) f x y, ξ) defined on U U Ω satisfying both E ξ [f x x, ξ)] = fx), x U dom r and E ξ [f x y, ξ)] fy), x, y U dom r. 7.) C3) Convexity of the models) The exists some µ 0 such that the functions f x, ξ) + r ) are µ-strongly convex relative to Φ for all x U dom r and a.e. ξ Ω. C4) Lipschitz property) There exists a square integrable function L: Ω R + such that for all x, y U dom r, the following inequalities holds: f x x, ξ) f x y, ξ) Lξ) D Φ y, x), 7.2) E ξ [Lξ) 2 ] L. C5) Solvability) The convex problems { min F y) + } y D Φy, x) and { min f x y, ξ) + ry) + } y D Φy, x), admit a minimizer for any > 0, any x U, and a.e. ξ Ω. The minimizers vary measurably in x, ξ) U Ω. Thus the only difference between assumptions C)-C5) and A)-A5) is that in expectation the stochastic models f, ξ) are global under-estimators C2) and the functions f, ξ) + r ) are relatively strongly convex, instead of weakly convex C3). Note that under assumptions C)-C5), the objective function F is µ-strongly convex relative to Φ; the argument is completely analogous to that of Lemma 3.3. Henceforth, we let {x t } t 0 be the iterates generated by Algorithm with τ = ρ = 0) and let {ξ t } t 0 be the corresponding samples used. For each index t 0, we continue to use the notation ˆx t = prox Φ F x) and let E t[ ] to denote the expectation conditioned on all the realizations ξ 0, ξ,..., ξ t. We need the following key lemma, which identifies the Bregman divergence D Φ x, x t ), between the iterates and an optimal solution, as a useful potential function. Notice that this is in contrast to the nonconvex setting, where it was the envelope F Φ x t) that served as an appropriate potential function. Lemma 7.. For each iteration t 0, the iterates of Algorithm satisfy E t [ + µ)d Φ x, x t+ )] D Φ x, x t ) + L) 2 where x is any minimizer of F. 4 + E t [rx t ) rx t+ )] F x t ) F x )), 2

22 Proof. Appealing to the three point inequality in Lemma 2. and C3), we deduce that all points x dom r satisfy f xt x, ξ t )+rx)+ D Φ x, x t ) f xt x t+, ξ t )+rx t+ )+ D Φ x t+, x t )+ + µ) D Φ x, x t+ ). Setting x = x, rearranging terms, and taking expectations, we deduce 7.3) E t [ + µ)d Φ x, x t+ ) D Φ x, x t ) + D Φ x t+, x t )] E t [f xt x, ξ t ) + rx ) f xt x t+, ξ t ) rx t+ )]. 7.4) We seek to upper bound the right-hand-side of 7.4). Assumptions C2) and C4) imply: E t [f xt x, ξ t ) f xt x t+, ξ t )] E t [f xt x, ξ t ) f xt x t, ξ t ) + Lξ) ] D Φ x t+, x t ) = E t [f xt x, ξ t ) fx )] + E t [Lξ) ] D Φ x t+, x t ) fx t ) + fx ) E t [Lξ) ] D Φ x t+, x t ) fx t ) + fx ). The left hand side of 7.4) is therefore upper bounded by E t [Lξ) ] D Φ x t+, x t ) fx t ) rx t+ ) + fx ) + rx ) = E t [Lξ) ] D Φ x t+, x t ) + rx t ) rx t+ )) F x t ) F x )). Putting everything together, we arrive at E t [ + µ)d Φ x, x t+ ) D Φ x, x t ) + D Φ x t+, x t )] E t [Lξ) ] D Φ x t+, x t ) + rx t ) rx t+ )) F x t ) F x )) Multiplying through by and rearranging yields E t [ + µ)d Φ x, x t+ )] D Φ x, x t ) + E t [ Lξ) ] D Φ x t+, x t ) D Φ x t+, x t ) + E t [rx t ) rx t+ )] F x t ) F x )). Now define γ := E t [D Φ x t+, x t )]. By Cauchy-Schwarz, we have that E t [ Lξ) ] D Φ x t+, x t ) Lγ. Thus we obtain E t [ + µ)d Φ x, x t+ )] D Φ x, x t ) + Lγ γ 2 + E t [rx t ) rx t+ )] F x t ) F x )) D Φ x, x t ) + L) 2 + E t [rx t ) rx t+ )] F x t ) F x )), 4 where the last inequality follows by maximizing the right-hand-side in γ. We are now ready to prove convergence guarantees in the case that µ = 0. 22

23 Theorem 7.2 Convergence rate under convexity). For all T > 0, we have [ ) ] E F T x t F x ) D Φx, x 0 ) + t L) 2 + η 4 0 rx 0 ) inf r) ηt T η, t where x is any minimizer of F. Proof. Lower-bounding the left-hand-side of Lemma 7. by zero D Φ x, x t+ ), we deduce [F x t ) F x )] L) 2 + E t [rx t ) rx t+ )] + E t [D Φ x, x t ) D Φ x, x t+ )] 4 Applying the tower rule for expectations yields E [F x t ) F x )] [ ] [ ] L) 2 + E rx t ) rx t+ )) + E D Φ x, x t ) D Φ x, x t+ )). 4 Using Jensen s inequality, telescoping and using the auxiliary Lemma A., we conclude [ ) ] E F T x t F x ) D Φx, x 0 ) + t L) 2 + η 4 0 rx 0 ) inf r) T η, t as claimed. As an immediate corollary of Theorem 5.2, we have the following rate of convergence when the stepsize is constant. Corollary 7.3 Convergence rate under convexity for constant stepsize). For any α > 0 and corresponding constant stepsize = α T +, we have [ E F T + ) ] x t F x ) D Φx, x 0 ) + αl)2 + αrx 4 0 ) inf r) α, T + where x is any minimizer of F. The final result of this section proves that Algorithm, with an appropriate choice of stepsize, drives the expected error in function values to zero at the rate Õ ), whenever k µ > 0. Theorem 7.4 Convergence rate strongly convex case). Suppose that = t 0. Then for all T > 0, we have [ ) ] E F x t F x ) + µd Φ x, x T + ) T + where x is any minimizer of F. 23 µt+) for all L 2 +logt +)) 4µ + rx 0 ) inf r + µd Φ x, x 0 ) T +

24 Proof. Using Lemma 7. and the law of total expectation, we have E [F x t ) F x )] η [ tl E rx t ) rx t+ )) + D Φ x, x t ) + η ] tµ) D Φ x, x t+ ) Setting =, averaging, and applying Jensen s inequality yields µt+) [ ] E F T + ) x t F x ) T + + T + L 2 4µt + ) + E [rx 0) rx T + )] T + E [µt + )D Φ x, x t ) µt + 2)D Φ x, x t+ )] L 2 +logt +)) + rx 4µ 0 ) inf r + µd Φ x, x 0 ) [ T + ] T + 2) E T + ) µd Φx, x T + ), where the last inequality follows from telescoping the terms in the sum and using the lower bound rx T + ) inf r. This completes the proof. A Proofs of auxilliary results A. Proof of Proposition 3.2 Let us write for the two functions Φx) := n i=0 a i i + 2 x i+2 Φ = Φ + Φ, The result [29, Equation 25)] yields the estimate 2 and Φx) := n i=0 3a i x i+2 2. D Φy, x) 2 n a i x i 2 x y 2 2 x, y. i=0 Thus the proof will be complete once we establish the inequality, D Φy, x) 2 n a i y i 2 x y 2 2 x, y. A.) i=0 To this end, fix an index i, and set η := 3i + 2) and Φ i x) := 3a i x i+2 2. We will show D Φi y, x) a i 2 y i 2 x y 2 2, 24

25 which together with the identity, D Φy, x) = n i=0 D Φi y, x), completes the proof of A.). A quick computation shows that D Φi y, x) = 3a i y i i + ) x i+2 2 i + 2) x i 2 x, y ). Let us consider two cases. First suppose that η /i x 2 y 2. In this case, [29, Proposition 5.] implies D Φi y, x) a iη 2 x i 2 x y 2 2 a i 2 y i 2 x y 2 2, as desired. Now suppose that y 2 η /i x 2. We will show that D Φi y, x) η D Φi x, y), which will complete the proof since η D Φi x, y) a i 2 y i x y 2 2, by [29, Proposition 5.]. To that end, we compute y, x) = 3a D Φi i y i i + ) x i+2 2 i + 2) x i 2 x, y ) x i i + ) y i+2 2 i + 2) y i 2 x, y ) η x, y) = a i D Φi i+2 η i + )) y i η i + 2) y i 2 x, y η i + )) x i i + 2) x i 2 x, y ) = η i + )) y i 2 y 2 η i + 2) + x, y i + 2) x i η i + )) 2 x, y. Let us show that the last inequality is true: First, we upper bound the right hand side i + 2) x i 2 x, y i + 2) η +i)/i y i+2. Next, we lower bound the left hand side: η i + )) y i 2 y 2 η i + 2) + η i + )) η η i + 2) i + )) η /i η i + )) ) = η i + + i+2) ) y i+2 η /i 2. Therefore, we need only verify that η satisfies i + 2) η +i)/i η i + 2) i + + ) η /i ) x, y ) y i+2 2 i + 2) η +i)/i η /i i + ) i + 2) 2i + 2) η /i η i + ))), which holds by the definition of η. Thus the result is proved. 25 )

26 A.2 An auxiliary lemma on sequences. Lemma A.. Consider any nonincreasing sequence {a t } t 0 R ++ and any sequence {b t } t 0 R. Then for any index T N, we have where we set b = inf t 0 b t. Proof. We successively deduce a t b t b t+ ) = as claimed. a t b t b t+ ) a 0 b 0 b ), a t [b t b ) b t+ b )] T = a 0 b 0 b ) a T b T + b ) + a t+ a t ) b t+ b ) a 0 b 0 b ), A.3 Proofs of Propositions 3.4 and 3.5 Proof of Proposition 3.4. Using the fundamental theorem of calculus and convexity of the function x p x 2 ) we compute cx, ξ) + cx, ξ)y x) cy, ξ) 2 = cx + ty x), ξ) cx, ξ)) y x) dt Hence, we deduce 0 0 cx + ty x), ξ) cx, ξ) op y x 2 dt L 2 ξ) y x 2 2 L 2 ξ) y x p x + ty x) 2 ) + p x 2 )) t dt 2L 2ξ) y x 2 2 p x 2 ) + p y 2 )). 3 t)p x 2 ) + tp y 2 )) + p x 2 )) t dt h cx, ξ) + cx, ξ)y x), ξ) h cy, ξ), ξ) L ξ) cx, ξ) + cx, ξ)y x) cy, ξ) L ξ)l 2 ξ) y x 2 2 p x 2 ) + p y 2 )) 4 3 L ξ)l 2 ξ) D Φ y, x), where the last inequality follows form Proposition 3.2. Taking expectations yields the claimed guarantee. 26

27 Proof of Proposition 3.5. We successively compute h cx, ξ), ξ) h cx, ξ) + cx, ξ)y x), ξ) = L ξ) cx, ξ)y x) 2 L ξ)l 3 ξ) q x 2 ) y x 2 2L ξ)l 3 ξ) D Φ y, x), where the last line follows from [29, Equation 25)]. The result follows. A.4 Proof of Theorem 4. First we rewrite F Φ, using the definition of the Bregman divergence, as F Φ x) = inf {F y) + Φy) } Φx), y y Φx) + Φx), x = sup { F Φx), y + ) } Φ y) y Φx) + Φx), x = F + ) ) Φ Φx) Φx) + Φx), x. Note that F + Φ is closed and ρ + τ)) -strongly convex. Thus the conjugate F + Φ) is differentiable. By the chain and sum rules for differentiation, we have F Φ x) = [ 2 Φx) F + ) ] ) Φ Φx) + 2 Φx)x = 2 Φx) x [ F + Φ ) ] Φx) )) The sub)gradient of a convex conjugate function is simply the set of maximizers in the supremum defining the conjugate, so that [ F + ) ] ) Φ Φx) = argmax { F Φx), y + ) } Φ y) y { = argmin F y) + } y D Φy, x) = prox Φ F x). Putting everything together, we obtain, F Φ x) = 2 Φx) x ˆx), as desired. References [] A. Auslender and M. Teboulle. Interior gradient and proximal methods for convex and conic optimization. SIAM J. Optim., 63): , [2] H.H. Bauschke, J. Bolte, and M. Teboulle. A descent lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications. Math. Oper. Res., 422): ,

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

GEOMETRIC APPROACH TO CONVEX SUBDIFFERENTIAL CALCULUS October 10, Dedicated to Franco Giannessi and Diethard Pallaschke with great respect

GEOMETRIC APPROACH TO CONVEX SUBDIFFERENTIAL CALCULUS October 10, Dedicated to Franco Giannessi and Diethard Pallaschke with great respect GEOMETRIC APPROACH TO CONVEX SUBDIFFERENTIAL CALCULUS October 10, 2018 BORIS S. MORDUKHOVICH 1 and NGUYEN MAU NAM 2 Dedicated to Franco Giannessi and Diethard Pallaschke with great respect Abstract. In

More information

Optimization and Optimal Control in Banach Spaces

Optimization and Optimal Control in Banach Spaces Optimization and Optimal Control in Banach Spaces Bernhard Schmitzer October 19, 2017 1 Convex non-smooth optimization with proximal operators Remark 1.1 (Motivation). Convex optimization: easier to solve,

More information

Composite nonlinear models at scale

Composite nonlinear models at scale Composite nonlinear models at scale Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with D. Davis (Cornell), M. Fazel (UW), A.S. Lewis (Cornell) C. Paquette (Lehigh), and S. Roy (UW)

More information

Efficiency of minimizing compositions of convex functions and smooth maps

Efficiency of minimizing compositions of convex functions and smooth maps Efficiency of minimizing compositions of convex functions and smooth maps D. Drusvyatskiy C. Paquette Abstract We consider global efficiency of algorithms for minimizing a sum of a convex function and

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Local strong convexity and local Lipschitz continuity of the gradient of convex functions

Local strong convexity and local Lipschitz continuity of the gradient of convex functions Local strong convexity and local Lipschitz continuity of the gradient of convex functions R. Goebel and R.T. Rockafellar May 23, 2007 Abstract. Given a pair of convex conjugate functions f and f, we investigate

More information

A New Look at First Order Methods Lifting the Lipschitz Gradient Continuity Restriction

A New Look at First Order Methods Lifting the Lipschitz Gradient Continuity Restriction A New Look at First Order Methods Lifting the Lipschitz Gradient Continuity Restriction Marc Teboulle School of Mathematical Sciences Tel Aviv University Joint work with H. Bauschke and J. Bolte Optimization

More information

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order

More information

Non-smooth Non-convex Bregman Minimization: Unification and new Algorithms

Non-smooth Non-convex Bregman Minimization: Unification and new Algorithms Non-smooth Non-convex Bregman Minimization: Unification and new Algorithms Peter Ochs, Jalal Fadili, and Thomas Brox Saarland University, Saarbrücken, Germany Normandie Univ, ENSICAEN, CNRS, GREYC, France

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Convex Optimization Conjugate, Subdifferential, Proximation

Convex Optimization Conjugate, Subdifferential, Proximation 1 Lecture Notes, HCI, 3.11.211 Chapter 6 Convex Optimization Conjugate, Subdifferential, Proximation Bastian Goldlücke Computer Vision Group Technical University of Munich 2 Bastian Goldlücke Overview

More information

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods Renato D.C. Monteiro B. F. Svaiter May 10, 011 Revised: May 4, 01) Abstract This

More information

Radial Subgradient Descent

Radial Subgradient Descent Radial Subgradient Descent Benja Grimmer Abstract We present a subgradient method for imizing non-smooth, non-lipschitz convex optimization problems. The only structure assumed is that a strictly feasible

More information

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,

More information

Chapter 2 Convex Analysis

Chapter 2 Convex Analysis Chapter 2 Convex Analysis The theory of nonsmooth analysis is based on convex analysis. Thus, we start this chapter by giving basic concepts and results of convexity (for further readings see also [202,

More information

Optimality, identifiability, and sensitivity

Optimality, identifiability, and sensitivity Noname manuscript No. (will be inserted by the editor) Optimality, identifiability, and sensitivity D. Drusvyatskiy A. S. Lewis Received: date / Accepted: date Abstract Around a solution of an optimization

More information

Active sets, steepest descent, and smooth approximation of functions

Active sets, steepest descent, and smooth approximation of functions Active sets, steepest descent, and smooth approximation of functions Dmitriy Drusvyatskiy School of ORIE, Cornell University Joint work with Alex D. Ioffe (Technion), Martin Larsson (EPFL), and Adrian

More information

Subdifferential representation of convex functions: refinements and applications

Subdifferential representation of convex functions: refinements and applications Subdifferential representation of convex functions: refinements and applications Joël Benoist & Aris Daniilidis Abstract Every lower semicontinuous convex function can be represented through its subdifferential

More information

Inexact alternating projections on nonconvex sets

Inexact alternating projections on nonconvex sets Inexact alternating projections on nonconvex sets D. Drusvyatskiy A.S. Lewis November 3, 2018 Dedicated to our friend, colleague, and inspiration, Alex Ioffe, on the occasion of his 80th birthday. Abstract

More information

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems O. Kolossoski R. D. C. Monteiro September 18, 2015 (Revised: September 28, 2016) Abstract

More information

Dedicated to Michel Théra in honor of his 70th birthday

Dedicated to Michel Théra in honor of his 70th birthday VARIATIONAL GEOMETRIC APPROACH TO GENERALIZED DIFFERENTIAL AND CONJUGATE CALCULI IN CONVEX ANALYSIS B. S. MORDUKHOVICH 1, N. M. NAM 2, R. B. RECTOR 3 and T. TRAN 4. Dedicated to Michel Théra in honor of

More information

Non-smooth Non-convex Bregman Minimization: Unification and New Algorithms

Non-smooth Non-convex Bregman Minimization: Unification and New Algorithms JOTA manuscript No. (will be inserted by the editor) Non-smooth Non-convex Bregman Minimization: Unification and New Algorithms Peter Ochs Jalal Fadili Thomas Brox Received: date / Accepted: date Abstract

More information

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex concave saddle-point problems

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex concave saddle-point problems Optimization Methods and Software ISSN: 1055-6788 (Print) 1029-4937 (Online) Journal homepage: http://www.tandfonline.com/loi/goms20 An accelerated non-euclidean hybrid proximal extragradient-type algorithm

More information

EE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1

EE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1 EE 546, Univ of Washington, Spring 2012 6. Proximal mapping introduction review of conjugate functions proximal mapping Proximal mapping 6 1 Proximal mapping the proximal mapping (prox-operator) of a convex

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

Iterative Convex Optimization Algorithms; Part One: Using the Baillon Haddad Theorem

Iterative Convex Optimization Algorithms; Part One: Using the Baillon Haddad Theorem Iterative Convex Optimization Algorithms; Part One: Using the Baillon Haddad Theorem Charles Byrne (Charles Byrne@uml.edu) http://faculty.uml.edu/cbyrne/cbyrne.html Department of Mathematical Sciences

More information

Metric Spaces and Topology

Metric Spaces and Topology Chapter 2 Metric Spaces and Topology From an engineering perspective, the most important way to construct a topology on a set is to define the topology in terms of a metric on the set. This approach underlies

More information

Convex Optimization Notes

Convex Optimization Notes Convex Optimization Notes Jonathan Siegel January 2017 1 Convex Analysis This section is devoted to the study of convex functions f : B R {+ } and convex sets U B, for B a Banach space. The case of B =

More information

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Benjamin Grimmer Abstract We generalize the classic convergence rate theory for subgradient methods to

More information

An accelerated algorithm for minimizing convex compositions

An accelerated algorithm for minimizing convex compositions An accelerated algorithm for minimizing convex compositions D. Drusvyatskiy C. Kempton April 30, 016 Abstract We describe a new proximal algorithm for minimizing compositions of finite-valued convex functions

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html

More information

Expanding the reach of optimal methods

Expanding the reach of optimal methods Expanding the reach of optimal methods Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with C. Kempton (UW), M. Fazel (UW), A.S. Lewis (Cornell), and S. Roy (UW) BURKAPALOOZA! WCOM

More information

Optimality Conditions for Nonsmooth Convex Optimization

Optimality Conditions for Nonsmooth Convex Optimization Optimality Conditions for Nonsmooth Convex Optimization Sangkyun Lee Oct 22, 2014 Let us consider a convex function f : R n R, where R is the extended real field, R := R {, + }, which is proper (f never

More information

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order

More information

Gradient Sliding for Composite Optimization

Gradient Sliding for Composite Optimization Noname manuscript No. (will be inserted by the editor) Gradient Sliding for Composite Optimization Guanghui Lan the date of receipt and acceptance should be inserted later Abstract We consider in this

More information

Duality and dynamics in Hamilton-Jacobi theory for fully convex problems of control

Duality and dynamics in Hamilton-Jacobi theory for fully convex problems of control Duality and dynamics in Hamilton-Jacobi theory for fully convex problems of control RTyrrell Rockafellar and Peter R Wolenski Abstract This paper describes some recent results in Hamilton- Jacobi theory

More information

Optimality, identifiability, and sensitivity

Optimality, identifiability, and sensitivity Noname manuscript No. (will be inserted by the editor) Optimality, identifiability, and sensitivity D. Drusvyatskiy A. S. Lewis Received: date / Accepted: date Abstract Around a solution of an optimization

More information

Identifying Active Constraints via Partial Smoothness and Prox-Regularity

Identifying Active Constraints via Partial Smoothness and Prox-Regularity Journal of Convex Analysis Volume 11 (2004), No. 2, 251 266 Identifying Active Constraints via Partial Smoothness and Prox-Regularity W. L. Hare Department of Mathematics, Simon Fraser University, Burnaby,

More information

arxiv: v1 [math.oc] 24 Mar 2017

arxiv: v1 [math.oc] 24 Mar 2017 Stochastic Methods for Composite Optimization Problems John C. Duchi 1,2 and Feng Ruan 2 {jduchi,fengruan}@stanford.edu Departments of 1 Electrical Engineering and 2 Statistics Stanford University arxiv:1703.08570v1

More information

Douglas-Rachford splitting for nonconvex feasibility problems

Douglas-Rachford splitting for nonconvex feasibility problems Douglas-Rachford splitting for nonconvex feasibility problems Guoyin Li Ting Kei Pong Jan 3, 015 Abstract We adapt the Douglas-Rachford DR) splitting method to solve nonconvex feasibility problems by studying

More information

Efficient Methods for Stochastic Composite Optimization

Efficient Methods for Stochastic Composite Optimization Efficient Methods for Stochastic Composite Optimization Guanghui Lan School of Industrial and Systems Engineering Georgia Institute of Technology, Atlanta, GA 3033-005 Email: glan@isye.gatech.edu June

More information

A SET OF LECTURE NOTES ON CONVEX OPTIMIZATION WITH SOME APPLICATIONS TO PROBABILITY THEORY INCOMPLETE DRAFT. MAY 06

A SET OF LECTURE NOTES ON CONVEX OPTIMIZATION WITH SOME APPLICATIONS TO PROBABILITY THEORY INCOMPLETE DRAFT. MAY 06 A SET OF LECTURE NOTES ON CONVEX OPTIMIZATION WITH SOME APPLICATIONS TO PROBABILITY THEORY INCOMPLETE DRAFT. MAY 06 CHRISTIAN LÉONARD Contents Preliminaries 1 1. Convexity without topology 1 2. Convexity

More information

On duality theory of conic linear problems

On duality theory of conic linear problems On duality theory of conic linear problems Alexander Shapiro School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, Georgia 3332-25, USA e-mail: ashapiro@isye.gatech.edu

More information

DUALIZATION OF SUBGRADIENT CONDITIONS FOR OPTIMALITY

DUALIZATION OF SUBGRADIENT CONDITIONS FOR OPTIMALITY DUALIZATION OF SUBGRADIENT CONDITIONS FOR OPTIMALITY R. T. Rockafellar* Abstract. A basic relationship is derived between generalized subgradients of a given function, possibly nonsmooth and nonconvex,

More information

Downloaded 09/27/13 to Redistribution subject to SIAM license or copyright; see

Downloaded 09/27/13 to Redistribution subject to SIAM license or copyright; see SIAM J. OPTIM. Vol. 23, No., pp. 256 267 c 203 Society for Industrial and Applied Mathematics TILT STABILITY, UNIFORM QUADRATIC GROWTH, AND STRONG METRIC REGULARITY OF THE SUBDIFFERENTIAL D. DRUSVYATSKIY

More information

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE CONVEX ANALYSIS AND DUALITY Basic concepts of convex analysis Basic concepts of convex optimization Geometric duality framework - MC/MC Constrained optimization

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Convex Analysis and Economic Theory AY Elementary properties of convex functions

Convex Analysis and Economic Theory AY Elementary properties of convex functions Division of the Humanities and Social Sciences Ec 181 KC Border Convex Analysis and Economic Theory AY 2018 2019 Topic 6: Convex functions I 6.1 Elementary properties of convex functions We may occasionally

More information

First-order optimality conditions for mathematical programs with second-order cone complementarity constraints

First-order optimality conditions for mathematical programs with second-order cone complementarity constraints First-order optimality conditions for mathematical programs with second-order cone complementarity constraints Jane J. Ye Jinchuan Zhou Abstract In this paper we consider a mathematical program with second-order

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

FIXED POINT ITERATIONS

FIXED POINT ITERATIONS FIXED POINT ITERATIONS MARKUS GRASMAIR 1. Fixed Point Iteration for Non-linear Equations Our goal is the solution of an equation (1) F (x) = 0, where F : R n R n is a continuous vector valued mapping in

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

Sequential Unconstrained Minimization: A Survey

Sequential Unconstrained Minimization: A Survey Sequential Unconstrained Minimization: A Survey Charles L. Byrne February 21, 2013 Abstract The problem is to minimize a function f : X (, ], over a non-empty subset C of X, where X is an arbitrary set.

More information

A Greedy Framework for First-Order Optimization

A Greedy Framework for First-Order Optimization A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Second order forward-backward dynamical systems for monotone inclusion problems

Second order forward-backward dynamical systems for monotone inclusion problems Second order forward-backward dynamical systems for monotone inclusion problems Radu Ioan Boţ Ernö Robert Csetnek March 6, 25 Abstract. We begin by considering second order dynamical systems of the from

More information

Convex Functions and Optimization

Convex Functions and Optimization Chapter 5 Convex Functions and Optimization 5.1 Convex Functions Our next topic is that of convex functions. Again, we will concentrate on the context of a map f : R n R although the situation can be generalized

More information

Convex Analysis Notes. Lecturer: Adrian Lewis, Cornell ORIE Scribe: Kevin Kircher, Cornell MAE

Convex Analysis Notes. Lecturer: Adrian Lewis, Cornell ORIE Scribe: Kevin Kircher, Cornell MAE Convex Analysis Notes Lecturer: Adrian Lewis, Cornell ORIE Scribe: Kevin Kircher, Cornell MAE These are notes from ORIE 6328, Convex Analysis, as taught by Prof. Adrian Lewis at Cornell University in the

More information

Stochastic subgradient method converges on tame functions

Stochastic subgradient method converges on tame functions Stochastic subgradient method converges on tame functions arxiv:1804.07795v3 [math.oc] 26 May 2018 Damek Davis Dmitriy Drusvyatskiy Sham Kakade Jason D. Lee Abstract This work considers the question: what

More information

Tame variational analysis

Tame variational analysis Tame variational analysis Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with Daniilidis (Chile), Ioffe (Technion), and Lewis (Cornell) May 19, 2015 Theme: Semi-algebraic geometry

More information

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 36 Why proximal? Newton s method: for C 2 -smooth, unconstrained problems allow

More information

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Arkadi Nemirovski H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology Joint research

More information

A characterization of essentially strictly convex functions on reflexive Banach spaces

A characterization of essentially strictly convex functions on reflexive Banach spaces A characterization of essentially strictly convex functions on reflexive Banach spaces Michel Volle Département de Mathématiques Université d Avignon et des Pays de Vaucluse 74, rue Louis Pasteur 84029

More information

NONSMOOTH VARIANTS OF POWELL S BFGS CONVERGENCE THEOREM

NONSMOOTH VARIANTS OF POWELL S BFGS CONVERGENCE THEOREM NONSMOOTH VARIANTS OF POWELL S BFGS CONVERGENCE THEOREM JIAYI GUO AND A.S. LEWIS Abstract. The popular BFGS quasi-newton minimization algorithm under reasonable conditions converges globally on smooth

More information

Some Properties of the Augmented Lagrangian in Cone Constrained Optimization

Some Properties of the Augmented Lagrangian in Cone Constrained Optimization MATHEMATICS OF OPERATIONS RESEARCH Vol. 29, No. 3, August 2004, pp. 479 491 issn 0364-765X eissn 1526-5471 04 2903 0479 informs doi 10.1287/moor.1040.0103 2004 INFORMS Some Properties of the Augmented

More information

MOSCO STABILITY OF PROXIMAL MAPPINGS IN REFLEXIVE BANACH SPACES

MOSCO STABILITY OF PROXIMAL MAPPINGS IN REFLEXIVE BANACH SPACES MOSCO STABILITY OF PROXIMAL MAPPINGS IN REFLEXIVE BANACH SPACES Dan Butnariu and Elena Resmerita Abstract. In this paper we establish criteria for the stability of the proximal mapping Prox f ϕ =( ϕ+ f)

More information

Convex Analysis Background

Convex Analysis Background Convex Analysis Background John C. Duchi Stanford University Park City Mathematics Institute 206 Abstract In this set of notes, we will outline several standard facts from convex analysis, the study of

More information

Math 273a: Optimization Convex Conjugacy

Math 273a: Optimization Convex Conjugacy Math 273a: Optimization Convex Conjugacy Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Convex conjugate (the Legendre transform) Let f be a closed proper

More information

The Proximal Gradient Method

The Proximal Gradient Method Chapter 10 The Proximal Gradient Method Underlying Space: In this chapter, with the exception of Section 10.9, E is a Euclidean space, meaning a finite dimensional space endowed with an inner product,

More information

THE INVERSE FUNCTION THEOREM

THE INVERSE FUNCTION THEOREM THE INVERSE FUNCTION THEOREM W. PATRICK HOOPER The implicit function theorem is the following result: Theorem 1. Let f be a C 1 function from a neighborhood of a point a R n into R n. Suppose A = Df(a)

More information

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 30 Notation f : H R { } is a closed proper convex function domf := {x R n

More information

Accelerated Proximal Gradient Methods for Convex Optimization

Accelerated Proximal Gradient Methods for Convex Optimization Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS

More information

Maximal Monotone Inclusions and Fitzpatrick Functions

Maximal Monotone Inclusions and Fitzpatrick Functions JOTA manuscript No. (will be inserted by the editor) Maximal Monotone Inclusions and Fitzpatrick Functions J. M. Borwein J. Dutta Communicated by Michel Thera. Abstract In this paper, we study maximal

More information

(convex combination!). Use convexity of f and multiply by the common denominator to get. Interchanging the role of x and y, we obtain that f is ( 2M ε

(convex combination!). Use convexity of f and multiply by the common denominator to get. Interchanging the role of x and y, we obtain that f is ( 2M ε 1. Continuity of convex functions in normed spaces In this chapter, we consider continuity properties of real-valued convex functions defined on open convex sets in normed spaces. Recall that every infinitedimensional

More information

A convergence result for an Outer Approximation Scheme

A convergence result for an Outer Approximation Scheme A convergence result for an Outer Approximation Scheme R. S. Burachik Engenharia de Sistemas e Computação, COPPE-UFRJ, CP 68511, Rio de Janeiro, RJ, CEP 21941-972, Brazil regi@cos.ufrj.br J. O. Lopes Departamento

More information

Convexity in R n. The following lemma will be needed in a while. Lemma 1 Let x E, u R n. If τ I(x, u), τ 0, define. f(x + τu) f(x). τ.

Convexity in R n. The following lemma will be needed in a while. Lemma 1 Let x E, u R n. If τ I(x, u), τ 0, define. f(x + τu) f(x). τ. Convexity in R n Let E be a convex subset of R n. A function f : E (, ] is convex iff f(tx + (1 t)y) (1 t)f(x) + tf(y) x, y E, t [0, 1]. A similar definition holds in any vector space. A topology is needed

More information

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014 Convex Optimization Dani Yogatama School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA February 12, 2014 Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12,

More information

Convex Functions. Pontus Giselsson

Convex Functions. Pontus Giselsson Convex Functions Pontus Giselsson 1 Today s lecture lower semicontinuity, closure, convex hull convexity preserving operations precomposition with affine mapping infimal convolution image function supremum

More information

Primal-dual subgradient methods for convex problems

Primal-dual subgradient methods for convex problems Primal-dual subgradient methods for convex problems Yu. Nesterov March 2002, September 2005 (after revision) Abstract In this paper we present a new approach for constructing subgradient schemes for different

More information

On the convexity of piecewise-defined functions

On the convexity of piecewise-defined functions On the convexity of piecewise-defined functions arxiv:1408.3771v1 [math.ca] 16 Aug 2014 Heinz H. Bauschke, Yves Lucet, and Hung M. Phan August 16, 2014 Abstract Functions that are piecewise defined are

More information

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9 MAT 570 REAL ANALYSIS LECTURE NOTES PROFESSOR: JOHN QUIGG SEMESTER: FALL 204 Contents. Sets 2 2. Functions 5 3. Countability 7 4. Axiom of choice 8 5. Equivalence relations 9 6. Real numbers 9 7. Extended

More information

A Geometric Framework for Nonconvex Optimization Duality using Augmented Lagrangian Functions

A Geometric Framework for Nonconvex Optimization Duality using Augmented Lagrangian Functions A Geometric Framework for Nonconvex Optimization Duality using Augmented Lagrangian Functions Angelia Nedić and Asuman Ozdaglar April 15, 2006 Abstract We provide a unifying geometric framework for the

More information

Lecture Notes on Iterative Optimization Algorithms

Lecture Notes on Iterative Optimization Algorithms Charles L. Byrne Department of Mathematical Sciences University of Massachusetts Lowell December 8, 2014 Lecture Notes on Iterative Optimization Algorithms Contents Preface vii 1 Overview and Examples

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Real Analysis Math 131AH Rudin, Chapter #1. Dominique Abdi

Real Analysis Math 131AH Rudin, Chapter #1. Dominique Abdi Real Analysis Math 3AH Rudin, Chapter # Dominique Abdi.. If r is rational (r 0) and x is irrational, prove that r + x and rx are irrational. Solution. Assume the contrary, that r+x and rx are rational.

More information

3.1 Convexity notions for functions and basic properties

3.1 Convexity notions for functions and basic properties 3.1 Convexity notions for functions and basic properties We start the chapter with the basic definition of a convex function. Definition 3.1.1 (Convex function) A function f : E R is said to be convex

More information

Iteration-complexity of first-order penalty methods for convex programming

Iteration-complexity of first-order penalty methods for convex programming Iteration-complexity of first-order penalty methods for convex programming Guanghui Lan Renato D.C. Monteiro July 24, 2008 Abstract This paper considers a special but broad class of convex programing CP)

More information

Near-Potential Games: Geometry and Dynamics

Near-Potential Games: Geometry and Dynamics Near-Potential Games: Geometry and Dynamics Ozan Candogan, Asuman Ozdaglar and Pablo A. Parrilo January 29, 2012 Abstract Potential games are a special class of games for which many adaptive user dynamics

More information

Subgradients. subgradients and quasigradients. subgradient calculus. optimality conditions via subgradients. directional derivatives

Subgradients. subgradients and quasigradients. subgradient calculus. optimality conditions via subgradients. directional derivatives Subgradients subgradients and quasigradients subgradient calculus optimality conditions via subgradients directional derivatives Prof. S. Boyd, EE392o, Stanford University Basic inequality recall basic

More information

The local equicontinuity of a maximal monotone operator

The local equicontinuity of a maximal monotone operator arxiv:1410.3328v2 [math.fa] 3 Nov 2014 The local equicontinuity of a maximal monotone operator M.D. Voisei Abstract The local equicontinuity of an operator T : X X with proper Fitzpatrick function ϕ T

More information

Selected Examples of CONIC DUALITY AT WORK Robust Linear Optimization Synthesis of Linear Controllers Matrix Cube Theorem A.

Selected Examples of CONIC DUALITY AT WORK Robust Linear Optimization Synthesis of Linear Controllers Matrix Cube Theorem A. . Selected Examples of CONIC DUALITY AT WORK Robust Linear Optimization Synthesis of Linear Controllers Matrix Cube Theorem A. Nemirovski Arkadi.Nemirovski@isye.gatech.edu Linear Optimization Problem,

More information

BISTA: a Bregmanian proximal gradient method without the global Lipschitz continuity assumption

BISTA: a Bregmanian proximal gradient method without the global Lipschitz continuity assumption BISTA: a Bregmanian proximal gradient method without the global Lipschitz continuity assumption Daniel Reem (joint work with Simeon Reich and Alvaro De Pierro) Department of Mathematics, The Technion,

More information

The proximal mapping

The proximal mapping The proximal mapping http://bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/37 1 closed function 2 Conjugate function

More information

Optimality Conditions for Constrained Optimization

Optimality Conditions for Constrained Optimization 72 CHAPTER 7 Optimality Conditions for Constrained Optimization 1. First Order Conditions In this section we consider first order optimality conditions for the constrained problem P : minimize f 0 (x)

More information

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

More information

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces. Math 350 Fall 2011 Notes about inner product spaces In this notes we state and prove some important properties of inner product spaces. First, recall the dot product on R n : if x, y R n, say x = (x 1,...,

More information

A user s guide to Lojasiewicz/KL inequalities

A user s guide to Lojasiewicz/KL inequalities Other A user s guide to Lojasiewicz/KL inequalities Toulouse School of Economics, Université Toulouse I SLRA, Grenoble, 2015 Motivations behind KL f : R n R smooth ẋ(t) = f (x(t)) or x k+1 = x k λ k f

More information

Strongly convex functions, Moreau envelopes and the generic nature of convex functions with strong minimizers

Strongly convex functions, Moreau envelopes and the generic nature of convex functions with strong minimizers University of Wollongong Research Online Faculty of Engineering and Information Sciences - Papers: Part B Faculty of Engineering and Information Sciences 206 Strongly convex functions, Moreau envelopes

More information