Pavel Dvurechensky Alexander Gasnikov Alexander Tiurin. July 26, 2017

Size: px
Start display at page:

Download "Pavel Dvurechensky Alexander Gasnikov Alexander Tiurin. July 26, 2017"

Transcription

1 Randomized Similar Triangles Method: A Unifying Framework for Accelerated Randomized Optimization Methods Coordinate Descent, Directional Search, Derivative-Free Method) Pavel Dvurechensky Alexander Gasnikov Alexander Tiurin July 6, 017 Abstract In this paper, we consider smooth convex optimization problems with simple constraints and inexactness in the oracle information such as value, partial or directional derivatives of the objective function We introduce a unifying framework, which allows to construct dierent types of accelerated randomized methods for such problems and to prove convergence rate theorems for them We focus on accelerated random block-coordinate descent, accelerated random directional search, accelerated random derivative-free method and, using our framework, provide their versions for problems with inexact oracle information Our contribution also includes accelerated random block-coordinate descent with inexact oracle and entropy proximal setup as well as derivative-free version of this method Keywords: convex optimization, accelerated random block-coordinate descent, accelerated random directional search, accelerated random derivative-free method, inexact oracle, complexity, accelerated gradient descent methods, rst-order methods, zero-order methods AMS Classication: 90C5, 90C30, 90C06, 90C56, 68Q5, 65K05, 49M7, 68W0, 65Y0, 68W40 Weierstrass Institute for Applied Analysis and Stochastics, Berlin; Institute for Information Transmission Problems RAS, Moscow, paveldvurechensky@wias-berlinde Moscow Institute of Physics and Technology, Moscow; Institute for Information Transmission Problems RAS, Moscow, gasnikov@yandexru National Research University Higher School of Economics, Moscow, alexandertiurin@gmailcom The results obtained in this paper were presented in December, ru/php/seminarsphtml?option_lang=rus&presentid=16180) and in June, lccclthse/indexphp?mact=reglerseminars,cntnt01,abstractbio,0&cntnt01abstractid=889& cntnt01returnid=116) 1

2 Introduction In this paper, we consider smooth convex optimization problems with simple constraints and inexactness in the oracle information such as value, partial or directional derivatives of the objective function Dierent types of randomized optimization algorithms, such as random coordinate descent or stochastic gradient descent for empirical risk minimization problem, have been extensively studied in the past decade with main application being convex optimization problems Our main focus in this paper is on accelerated randomized methods: random block-coordinate descent, random directional search, random derivative-free ) method As opposed to non-accelerated methods, these methods have complexity O 1 ε iterations to achieve objective function residual ε Accelerated random block-coordinate descent method was rst proposed in Nesterov [01], which was the starting point for active research in this direction The idea of the method is, on each iteration, to randomly choose a block of coordinates in the decision variable and make a step using the derivative of the objective function with respect to the chosen coordinates Accelerated random directional search and accelerated random derivative-free method were rst proposed in 011 and published recently in Nesterov and Spokoiny [017], but there was no extensive research in this direction The idea of random directional search is to use a projection of the objective's gradient onto a randomly chosen direction to make a step on each iteration Random derivative-free method uses the same idea, but random projection of the gradient is approximated by nitedierence, ie the dierence of values of the objective function in two close points This also means that it is a zero-order method which uses only function values to make a step Existing accelerated randomized methods have dierent convergence analysis This motivated us to pose the main question, we address in this paper, as follows Is it possible to nd a crucial part of the convergence rate analysis and use it to systematically construct new accelerated randomized methods? To some extent, our answer is "`yes"' We determine three main assumptions and use them to prove convergence rate theorem for our generic accelerated randomized method Our framework allows both to reproduce known and to construct new accelerated randomized methods The latter include new accelerated random block-coordinate descent with inexact block derivatives and entropy proximal setup Related Work In the seminal paper Nesterov [01], Nesterov proposed random block-coordinate descent for convex optimization problems with simple convex separable constraints and accelerated random block-coordinate descent for unconstrained convex optimization problems In Lee and Sidford [013], Lee and Sidford proposed accelerated random block-coordinate descent with non-uniform probability of choosing a particular block of coordinates They also developed an ecient implementation without full-dimensional operations on each iteration Fercoq and Richt arik in Fercoq and Richt arik [015] introduced accelerated block-coordinate descent for composite optimization problems, which include problems with separable constraints Later, Lin, Lu and Xiao in Lin et al [014] extended this method for strongly convex

3 problems In May 015, Nesterov and Stich presented an accelerated block-coordinate descent with complexity, which does not explicitly depend on the problem dimension This result was recently published in Nesterov and Stich [017] Similar complexity was obtained also by Allen-Zhu, Qu, Richt arik and Yuan in Allen-Zhu et al [016] and by Gasnikov, Dvurechensky and Usmanova in Gasnikov et al [016c] We also mention special type of accelerated block-coordinate descent of Shalev-Shwartz and Zhang developed in Shalev- Shwartz and Zhang [014] for empirical risk minimization problems All these accelerated block-coordinate descent methods work in Euclidean setup, when the norm in each block is Euclidean and dened using some positive semidenite matrix Non-accelerated blockcoordinate methods, but with non-euclidean setup, were considered by Dang and Lan in Dang and Lan [015] All the mentioned methods rely on exact block derivatives and exact projection on each step Inexact projection in the context of non-accelerated random coordinate descent was considered by Tappenden, Richt arik and Gondzio in Tappenden et al [016] Research on accelerated random directional search and accelerated random derivativefree methods started in Nesterov and Spokoiny [017] Mostly non-accelerated derivative-free methods were further developed in the context of inexact function values in Gasnikov et al [016a,b], Bogolubsky et al [016], Gasnikov et al [017] We should also mention that there are other accelerated randomized methods in Frostig et al [015], Lin et al [015], Zhang and Lin [015], Allen-Zhu [017], Lan and Zhou [017] Most of these methods were developed deliberately for empirical risk minimization problems and do not fall in the scope of this paper Our Approach and Contributions Our framework has two main components, namely, Randomized Inexact Oracle and Randomized Similar Triangles Method The starting point for the denition of our oracle is a unied view on random directional search and random block-coordinate descent In both these methods, on each iteration, a randomized approximation for the objective function's gradient is calculated and used, instead of the true gradient, to make a step This approximation for the gradient is constructed by a projection on a randomly chosen subspace For random directional search, this subspace is the line along a randomly generated direction As a result a directional derivative in this direction is calculated For random block-coordinate descent, this subspace is given by randomly chosen block of coordinates and block derivative is calculated One of the key features of these approximations is that they are unbiased, ie their expectation is equal to the true gradient We generalize two mentioned approaches by allowing other types of random transformations of the gradient for constructing its randomized approximation The inexactness of our oracle is inspired by the relation between derivative-free method and directional search In the framework of derivative-free methods, only the value of the objective function is available for use in an algorithm At the same time, if the objective function is smooth, the directional derivative can be well approximated by the dierence of function values at two points which are close to each other Thus, in the context of zero- 3

4 order optimization, one can calculate only an inexact directional derivative Hence, one can construct only a biased randomized approximation for the gradient when a random direction is used We combine previously mentioned random transformations of the gradient with possible inexactness of this transformations to construct our Randomized Inexact Oracle, which we use in our generic algorithm to make a step on each iteration The basis of our generic algorithm is Similar Triangles Method of Tyurin [017] see also Dvurechensky et al [017]), which is an accelerated gradient method with only one proximal mapping on each iteration, this proximal mapping being essentially the Mirror Descent step The notable point is that, we only need to substitute the true gradient with our Randomized Inexact Oracle and slightly change one step in the Similar Triangles Method, to obtain our generic accelerated randomized algorithm, which we call Randomized Similar Triangles Method RSTM), see Algorithm 1 We prove convergence rate theorem for RSTM in two cases: the inexactness of Randomized Inexact Oracle can be controlled and adjusted on each iteration of the algorithm, the inexactness can not be controlled We apply our framework to several particular settings: random directional search, random coordinate descent, random block-coordinate descent and their combinations with derivativefree approach As a corollary of our main theorem, we obtain both known and new results on the convergence of dierent accelerated randomized methods with inexact oracle To sum up, our contributions in this paper are as follows We introduce a general framework for constructing and analyzing dierent types of accelerated randomized methods, such as accelerated random directional search, accelerated block-coordinate descent, accelerated derivative-free methods Our framework allows to obtain both known and new methods and their convergence rate guarantees as a corollary of our main Theorem 1 Using our framework, we introduce new accelerated methods with inexact oracle, namely, accelerated random directional search, accelerated random block-coordinate descent, accelerated derivative-free method To the best of our knowledge, such methods with inexact oracle were not known before See Section 3 Based on our framework, we introduce new accelerated random block-coordinate descent with inexact oracle and non-euclidean setup, which was not done before in the literature The main application of this method is minimization of functions on a direct product of large number of low-dimensional simplexes See Subsection 33 We introduce new accelerated random derivative-free block-coordinate descent with inexact oracle and non-euclidean setup Such method was not known before in the literature Our method is similar to the method in the previous item, but uses only nite-dierence approximations for block derivatives See Subsection 36 The rest of the paper is organized as follows In Section 1, we provide the problem statement, motivate and make three our main assumptions, illustrate them by random directional search and random block-coordinate descent In Section, we introduce our main 4

5 algorithm, called Randomized Similar Triangles Method, and, based on stated general assumptions, prove convergence rate Theorem 1 Section 3 is devoted to applications of our general framework for dierent particular settings, namely Accelerated Random Directional Search Subsection 31), Accelerated Random Coordinate Descent Subsection 3), Accelerated Random Block-Coordinate Descent Subsection 33), Accelerated Random Derivative-Free Directional Search Subsection 34), Accelerated Random Derivative-Free Coordinate Descent Subsection 35), Accelerated Random Derivative-Free Block-Coordinate Descent Subsection 36) Accelerated Random Derivative-Free Block-Coordinate Descent with Random Approximations for Block Derivatives Subsection 37) 1 Preliminaries 11 Notation Let nite-dimensional real vector space E be a direct product of n nite-dimensional real vector spaces E i, i = 1,, n, ie E = n i=1e i and dime i = p i, i = 1,, n Denote also p = n i=1 p i Let, for i = 1,, n, Ei denote the dual space for E i Then, the space dual to E is E = n i=1ei Given a vector x i) E i for some i 1,, n, we denote as [x i) ] j its j-th coordinate, where j 1,, p i To formalize the relationship between vectors in E i, i = 1,, n and vectors in E, we dene primal partition operators U i : E i E, i = 1,, n, by identity n x = x 1),, x n) ) = U i x i), x i) E i, i = 1,, n, x E 1) i=1 For any xed i 1,, n, U i maps a vector x i) E i, to the vector 0,, x i),, 0) E The adjoint operator Ui T : E Ei, then, is an operator, which, maps a vector g = g 1),, g i),, g n) ) E, to the vector g i) Ei Similarly, we dene dual partition operators Ũi : Ei E, i = 1,, n, by identity n g = g 1),, g n) ) = Ũ i g i), g i) Ei, i = 1,, n, g E ) i=1 For all i = 1,, n, we denote the value of a linear function g i) Ei at a point x i) E i by g i), x i) i We dene n g, x = g i), x i) i, x E, g E i=1 5

6 For all i = 1,, n, let i be some norm on E i and i, be the norm on E i which is dual i g i) i, = max g i), x i) i x i) i 1 Given parameters β i R n ++, i = 1,, n, we dene the norm of a vector x = x 1),, x n) ) E as n x E = β i x i) i Then, clearly, the dual norm of a vector g = g 1),, g n) ) E is i=1 g E, = n i=1 β 1 i g i) i Throughout the paper, we consider optimization problem with feasible set Q, which is assumed to be given as Q = n i=1q i E, where Q i E i, i = 1,, n are closed convex sets To have more exibility and be able to adapt algorithm to the structure of sets Q i, i = 1,, n, we introduce proximal setup, see eg Ben-Tal and Nemirovski [015] For all i = 1,, n, we choose a prox-function d i x i) ) which is continuous, convex on Q i and 1 admits a continuous in x i) Q 0 i selection of subgradients d i x i) ), where x i) Q 0 i Q i, and Q 0 i is the set of all x i), where d i x i) ) exists; is 1-strongly convex on Q i with respect to i, ie, for any x i) Q 0 i, y i) Q i, it holds that d i y i) ) d i x i) ) d i x i) ), y i) x i) i 1 yi) x i) i We dene also the corresponding Bregman divergence V i [z i) ]x i) ) := d i x i) ) d i z i) ) d i z i) ), x i) z i) i, x i) Q i, z i) Q 0 i, i = 1,, n It is easy to see that V i [z i) ]x i) ) 1 xi) z i) i, x i) Q i, z i) Q 0 i, i = 1,, n Standard proximal setups, eg Euclidean, entropy, l 1 /l, simplex can be found in Ben-Tal and Nemirovski [015] It is easy to check that, for given parameters β i R n ++, i = 1,, n, the functions dx) = n i=1 β id i x i) ) and V [z]x) = n i=1 β iv i [z i) ]x i) ) are respectively a prox-function and a Bregman divergence corresponding to Q Also, clearly, V [z]x) 1 x z E, x Q, z Q 0 := n i=1q 0 i 3) For a dierentiable function fx), we denote by fx) E its gradient 1 Problem Statement and Assumptions The main problem, we consider, is as follows min fx), 4) x Q E 6

7 where fx) is a smooth convex function, Q = n i=1q i E, with Q i E i, i = 1,, n being closed convex sets We now list our main assumptions and illustrate them by two simple examples More detailed examples are given in Section 3 As the rst example here, we consider random directional search, in which the gradient of the function f is approximated by a vector fx), e e, where fx), e is the directional derivative in direction e and random vector e is uniformly distributed over the Euclidean sphere of radius 1 Our second example is random block-coordinate descent, in which the gradient of the function f is approximated by a vector ŨiU T i fx), where U T i fx) is i-th block derivative and the block number i is uniformly randomly sampled from 1,, n The common part in both these randomized gradient approximations is that, rst, one randomly chooses a subspace which is either the line, parallel to e, or i-th block of coordinates Then, one projects the gradient on this subspace by calculating either fx), e or U T i fx) Finally, one lifts the obtained random projection back to the whole space E either by multiplying directional derivative by vector e, or applying dual partition operator Ũi At the same time, in both cases, if one scales the obtained randomized approximation for the gradient by multiplying it by n, one obtains an unbiased randomized approximation of the gradient E e n fx), e e = fx), E i nũiu T i fx) = fx), x Q We also want our approach to allow construction of derivative-free methods For a function f with L-Lipschitz-continuous gradient, the directional derivative can be well approximated by the dierence of function values in two close points Namely, it holds that fx + τe) fx) fx), e = + oτ), τ where τ > 0 is a small parameter Thus, if only the value of the function is available, one can calculate only inexact directional derivative, which leads to biased randomized approximation for the gradient if the direction is chosen randomly These three features, namely, random projection and lifting up, unbiased part of the randomized approximation for the gradient, bias in the randomized approximation for the gradient, lead us to the following assumption about the structure of our general Randomized Inexact Oracle Assumption 1 Randomized Inexact Oracle) We access the function f only through Randomized Inexact Oracle fx), x Q, which is given by fx) = ρr r R T p fx) + ξx)) E, 5) where ρ > 0 is a known constant; R p is a random "`projection"' operator from some auxiliary space H to E, and, hence, R T p, acting from E to H, is the adjoint to R p ; R r : H E is also some random "`reconstruction"' operator; ξx) H is a, possibly random, vector characterizing the error of the oracle The oracle is also assumed to satisfy the following properties EρR r R T p fx) = fx), x Q, 6) R r ξx) E, δ, x Q, 7) 7

8 where δ 0 is oracle error level Let us make some comments on this assumption The nature of the operator R T p is generalization of random projection For the case of random directional search, H = R, R T p : E R is given by R T p g = g, e, g E For the case of random block-coordinate descent, H = E i, R T p : E Ei is given by R T p g = Ui T g, g E We assume that there is some additive error ξx) in the generalized random projection R T p fx) This error can be introduced, for example, when nite-dierence approximation of the directional derivative is used Finally, we lift the inexact random projection R T p fx) + ξx) back to E by applying operator R r For the case of random directional search, R r : R E is given by R r t = te, t R For the case of random block-coordinate descent, R r : Ei E is given by R r g i) = Ũig i), g i) Ei The number ρ is the normalizing coecient, which allows the part ρr r R T p fx) to be unbiased randomized approximation for the gradient This is expressed by equality 6) Finally, we assume that the error in our oracle is bounded, which is expressed by property 7) In our analysis, we consider two cases: the error ξ can be controlled and δ can be appropriately chosen on each iteration of the algorithm; the error ξ can not be controlled and we only know oracle error level δ Let us move to the next assumption As said, our generic algorithm is based on Similar Triangles Method of Tyurin [017] see also Dvurechensky et al [017]), which is an accelerated gradient method with only one proximal mapping on each iteration This proximal mapping is essentially the Mirror Descent step For simplicity, let us consider here an unconstrained minimization problem in the Euclidean setting This means that Q i = E i = R p i, x i) i = x i), i = 1,, n Then, given a point u E, a number α, and the gradient fy) at some point y E, the Mirror Descent step is { } 1 u + = arg min x E x u + α fy), x = u α fy) Now we want to substitute the gradient fy) with our Randomized Inexact Oracle fy) Then, we see that the step u + = u α fy) makes progress only in the subspace onto which the gradient is projected, while constructing the Randomized Inexact Oracle In other words, u u + lies in the same subspace as fy) In our analysis, this is a desirable property and we formalize it as follows Assumption Regularity of Prox-Mapping) The set Q, norm E, prox-function dx), and Randomized Inexact Oracle fx) are chosen in such a way that, for any u, y Q, α > 0, the point { u + = arg min V [u]x) + α fy), } x 8) x Q satises R r R T p fy), u u + = fy), u u + 9) The interpretation is that, in terms of linear pairing with u u +, the unbiased part R r R T p fy) of the Randomized Inexact Oracle makes the same progress as the true gradient fy) 8

9 Finally, we want to formalize the smoothness assumption for the function f In our analysis, we use only the smoothness of f in the direction of u + u, where u Q and u + is dened in 8) Thus, we consider two points x, y Q, which satisfy equality x = y+au + u), where a R For the random directional search, it is natural to assume that f has L- Lipschitz-continuous gradient with respect to the Euclidean norm, ie fx) fy) + fy), x y + L x y, x, y Q 10) Then, if we dene x E = L x, we obtain that, for our choice x = y + au + u), fx) fy) + fy), x y + 1 x y E Usual assumption for random block-coordinate descent is that the gradient of f is block-wise Lipschitz continuous This means that, for all i = 1,, n, block derivative f ix) = U T i fx) is L i -Lipschitz continuous with respect to chosen norm i, ie f ix + U i h i) ) f ix) i, L i h i) i, h i) E i, i = 1,, n, x Q 11) By the standard reasoning, using 11), one can prove that, for all i = 1,, n, fx + U i h i) ) fx) + U T i fx), h i) + L i hi) i, h i) E i, x Q 1) In block-coordinate setting, fx) has non-zero elements only in one, say i-th, block and it follows from 8) that u + u also has non-zero components only in the i-th block Hence, there exists h i) E i, such that u + u = U i h i and x = y + au i h i) Then, if we dene x E = n i=1 L i x i) i, we obtain fx) = fy + au i h i) ) 1) fy) + U T i fy), ah i) + L i ahi) i = fy) + fy), au i h i) + 1 au ih i) E = fy) + fy), x y + 1 x y E We generalize these two examples and assume smoothness of f in the following sense Assumption 3 Smoothness) The norm E is chosen in such a way that, for any u, y Q, a R, if x = y + au + u) Q, then fx) fy) + fy), x y + 1 x y E 13) 9

10 Algorithm 1 Randomized Similar Triangles Method RSTM) Input: starting point u 0 Q 0 = n i=1 Q0 i, prox-setup: dx), V [u]x), see Subsection 11 1: Set k = 0, A 0 = α 0 = 1 1 ρ, x 0 = y 0 = u 0 : repeat 3: Find α k+1 as the largest root of the equation := A k + α k+1 = ρ α k+1 14) 4: Calculate 5: Calculate 6: Calculate y k+1 = α k+1u k + A k x k 15) u k+1 = arg min x Q {V [u k]x) + α k+1 fy k+1 ), x } 16) x k+1 = y k+1 + ρ α k+1 u k+1 u k ) 17) 7: Set k = k + 1 8: until Output: The point x k+1 Randomized Similar Triangles Method In this section, we introduce our generic Randomized Similar Triangles Method, which is listed as Algorithm 1 below, and prove Theorem 1, which gives its convergence rate The method is constructed by a modication of Similar Triangles Method see Dvurechensky et al [017]) and, thus, inherits part of its name Lemma 1 Algorithm 1 is correctly dened in the sense that, for all k 0, x k, y k Q Proof The proof is a direct generalization of Lemma in Fercoq and Richt arik [015] By denition 16), for all k 0, u k Q If we prove that, for all k 0, x k Q, then, from 15), it follows that, for all k 0, y k Q Let us prove that, for all k 0, x k is a convex combination of u 0 u k, namely x k = k l=0 γl k u l, where γ0 0 = 1, γ1 0 = 0, γ1 1 = 1, and for k 1, γk+1 l = ) 1 α k+1 γk l, ) ) l = 0,, k 1 1 ρ α k α A k + ρ k A k α k+1, l = k α k+1 ρ α k+1, l = k + 1 Since, x 0 = u 0, we have that γ0 0 = 1 Next, by 17), we have x 1 = y 1 + ρ α 1 A 1 u 1 u 0 ) = u 0 + ρ α 1 A 1 u 1 u 0 ) = 1 ρ α 1 A 1 )u 0 + ρ α 1 A 1 u 1 Solving the equation 14) for k = 0, and using the 10 18)

11 choice α 0 = 1 1 ρ, we obtain that α 1 = 1 ρ and α 1 A 1 14) = α 1 ρ α 1 = 1 ρ 19) Hence, x 1 = u 1 and γ 0 1 = 0, γ 1 1 = 1 Let us now assume that x k = k l=0 γl k u l and prove that x k+1 is also a convex combination with coecients, given by 18) From 15), 17), we have x k+1 = y k+1 + ρ α k+1 u k+1 u k ) = α k+1u k + A k x k + ρ α k+1 u k+1 u k ) A k+1 αk+1 ρ α ) k+1 u k + ρ α k+1 u k+1 = A k x k + = 1 α ) k k+1 γ A ku l l + k+1 l=0 αk+1 ρ α k+1 Note that all the coecients sum to 1 Next, we have x k+1 = = = 1 α k+1 1 α k+1 ) k 1 l=0 ) k 1 l=0 ) k 1 1 α k+1 γku l l + γku l l + γku l l + l=0 ρ α k γk k 1 α k+1 A k αk+1 1 α k+1 ) + ) + 1 ρ α ) k A k ) u k + ρ α k+1 u k+1 αk+1 ρ α k+1 αk+1 ρ α k+1 αk + ρ α k+1 A k )) u k + ρ α k+1 )) u k + ρ α k+1 u k+1 u k+1 )) u k + ρ α k+1 u k+1 So, we see that 18) holds for k + 1 It remains to show that γk+1 l 0, l = 0,, k + 1 For γk+1 l, l = 0,, k 1 è γk+1 k+1 it is obvious From 14), we have α k+1 = ρ A k ρ Thus, since {A k }, k 0 is non-decreasing sequence, {α k+1 }, k 0 is also non-decreasing From 14), we obtain α k+1, which means that this sequence is non-increasing Thus, α k A k α k+1 = α k+1 ρ α k+1 and α k A k α 1 A 1 1 ρ for k 1 These inequalities prove that γk k+1 0 Lemma Let the sequences {x k, y k, u k, α k, A k }, k 0 be generated by Algorithm 1 Then, for all u Q, it holds that α k+1 fy k+1 ), u k u fy k+1 ) fx k+1 )) + V [u k ]u) V [u k+1 ]u) + α k+1 ρ R r ξy k+1 ), u k u k+1 0) 11

12 Proof Using Assumptions 1 and with α = α k+1, y = y k+1, u = u k, u + = u k+1, we obtain α k+1 fy k+1 ), u k u k+1 5) = α k+1 ρ R r R T p fy k+1 ) + ξy k+1 )), u k u k+1 9) = α k+1 ρ fy k+1 ), u k u k+1 + α k+1 ρ R r ξy k+1 ), u k u k+1 17) = fy k+1 ), y k+1 x k+1 + α k+1 ρ R r ξy k+1 ), u k u k+1 1) Note that, from the optimality condition in 16), for any u Q, we have V [u k ]u k+1 ) + α k+1 fyk+1 ), u u k+1 0 ) By the denition of V [u]x), we obtain, for any u Q, V [u k ]u) V [u k+1 ]u) V [u k ]u k+1 ) =du) du k ) du k ), u u k Further, for any u Q, by Assumption 3, du) du k+1 ) du k+1 ), u u k+1 ) du k+1 ) du k ) du k ), u k+1 u k ) = du k ) du k+1 ), u k+1 u = V [u k ]u k+1 ), u k+1 u 3) α k+1 fy k+1 ), u k u = α k+1 fy k+1 ), u k u k+1 + α k+1 fy k+1 ), u k+1 u ) α k+1 fy k+1 ), u k u k+1 + V [u k ]u k+1 ), u k+1 u 3) = α k+1 fy k+1 ), u k u k+1 + V [u k ]u) V [u k+1 ]u) V [u k ]u k+1 ) 3) α k+1 fy k+1 ), u k u k+1 + V [u k ]u) V [u k+1 ]u) 1 u k u k+1 E 1),17) = fy k+1 ), y k+1 x k+1 + α k+1 ρ R r ξy k+1 ), u k u k V [u k ]u) V [u k+1 ]u) y ρ αk+1 k+1 x k+1 E 14) = fy k+1 ), y k+1 x k+1 1 y k+1 x k+1 E 17),13) ) + + α k+1 ρ R r ξy k+1 ), u k u k+1 + V [u k ]u) V [u k+1 ]u) fy k+1 ) fx k+1 )) + V [u k ]u) V [u k+1 ]u)+ + α k+1 ρ R r ξy k+1 ), u k u k+1 In the last inequality, we used Assumption 3 with a = ρ α k+1, x = x k+1, y = y k+1, u = u k, u + = u k+1 1

13 Lemma 3 Let the sequences {x k, y k, u k, α k, A k }, k 0 be generated by Algorithm 1 Then, for all u Q, it holds that α k+1 fy k+1 ), u k u fy k+1 ) E k+1 fx k+1 )) + V [u k ]u) E k+1 V [u k+1 ]u) + E k+1 α k+1 ρ R r ξy k+1 ), u u k+1, 4) where E k+1 denotes the expectation conditioned on all the randomness up to step k Proof First, for any u Q, by Assumption 1, E k+1 α k+1 fy k+1 ), u k u 5) = E k+1 α k+1 ρ R r R T p fy k+1 ) + ξy k+1 )), u k u 6) = α k+1 fy k+1 ), u k u + E k+1 α k+1 ρ R r ξy k+1 ), u k u 5) Taking conditional expectation E k+1 statement of the Lemma in 0) of Lemma and using 5), we obtain the Lemma 4 Let the sequences {x k, y k, u k, α k, A k }, k 0 be generated by Algorithm 1 Then, for all u Q, it holds that E k+1 fx k+1 ) A k fx k ) α k+1 fy k+1 ) + fy k+1 ), u y k+1 ) + V [u k ]u) E k+1 V [u k+1 ]u) + E k+1 α k+1 ρ R r ξy k+1 ), u u k+1 6) Proof For any u Q, α k+1 fy k+1 ), y k+1 u = α k+1 fy k+1 ), y k+1 u k + α k+1 fy k+1 ), u k u 14),15) = A k fy k+1 ), x k y k+1 + α k+1 fy k+1 ), u k u conv-ty A k fx k ) fy k+1 )) + α k+1 fy k+1 ), u k u 4) A k fx k ) fy k+1 )) + fy k+1 ) E k+1 fx k+1 ))+ + V [u k ]u) E k+1 V [u k+1 ]u) + E k+1 α k+1 ρ R r ξy k+1 ), u u k+1 14) = α k+1 fy k+1 ) + A k fx k ) E k+1 fx k+1 ) + V [u k ]u) E k+1 V [u k+1 ]u) + E k+1 α k+1 ρ R r ξy k+1 ), u u k+1 7) Rearranging terms, we obtain the statement of the Lemma Theorem 1 Let the assumptions 1,, 3 hold Let the sequences {x k, y k, u k, α k, A k }, k 0 be generated by Algorithm 1 Let f be the optimal objective value and x be an optimal point in Problem 4) Denote P 0 = A 0 fx 0 ) f ) + V [u 0 ]x ) 8) 13

14 1 If the oracle error ξx) in 5) can be controlled and, on each iteration, the error level δ in 7) satises then, for all k 1, δ P 0 4ρA k, 9) Efx k ) f 3P 0 A k, where E denotes the expectation with respect to all the randomness up to step k If the oracle error ξx) in 5) can not be controlled, then, for all k 1, Efx k ) f P 0 A k + 4A k ρ δ Proof Let us change the counter in Lemma 4 from k to i, x u = x, take the full expectation in each inequality for i = 0,, k 1 and sum all the inequalities for i = 0,, k 1 Then, k 1 A k Efx k ) A 0 fx 0 ) α i+1 E fy i+1 ) + fy i+1 ), x y i+1 ) + V [u 0 ]x ) EV [u k ]x ) i=0 k 1 + Eα i+1 ρ R r ξy i+1 ), x u i+1 i=0 conv-ty,14),7) k 1 A k A 0 )fx ) + V [u 0 ]x ) EV [u k ]x ) + α i+1 ρδe x u i+1 E Rearranging terms and using 8), we obtain, for all k 1, k 1 0 A k Efx k ) f ) P0 EV [u k ]x ) + ρδ α i+1 ER i+1, 30) where we denoted R i = u i x E, i 0 1 We rst prove the rst statement of the Theorem We have 1 R 0 = 1 x u 0 3) E V [u 0 ]x ) 8) P0 31) Hence, ER 0 = R 0 P 0 P0 Let ER i P 0, for all i = 0,, k 1 Let us prove that ER k P 0 By convexity of square function, we obtain 1 ER k) 1 3) ER k EV [u k ]x ) 30) P0 + ρδ i=0 i=0 k α i+1 P 0 + α k ρδer k i=0 14) = P0 + ρδp 0 A k 1 A 0 ) + α k ρδer k P0 + ρδp 0 A k + α k ρδer k 3) 14

15 Since α k A k, k 0, by the choice of δ 9), we have ρδp 0 A k P 0 and α k ρδ A k ρδ P 0 So, we obtain an inequality for ER k 1 ER k) 3P 0 + P 0 4 ER k Solving this quadratic inequality in ER k, we obtain ER k P P P 0 = P 0 Thus, by induction, we have that, for all k 0, ER k P 0 Using the bounds ER i P 0, for all i = 0,, k, we obtain k 1 A k Efx k ) f ) 30) P0 14),9) + ρδ α i+1 ER i P0 + ρ P 0 A k A 0 ) P 0 3P 0 4ρA k i=0 This nishes the proof of the rst statement of the Theorem Now we prove the second statement of the Theorem First, from 30) for k = 1, we have 1 ER 1) 1 3) ER 1 EV [u 1 ]x ) 30) P0 + ρδα 1 ER 1 Solving this inequality in ER 1, we obtain ER 1 ρδα 1 + ρδα 1 ) + P0 ρδα 1 + P 0, 33) where we used that, for any a, b 0, a + b a + b Then, P 0 + ρδα 1 ER 1 P 0 + ρδα 1 ) + ρδα 1 P 0 P 0 + ρδ A 1 A 0 )) Thus, we have proved that the inequality k P0 + ρδ α i+1 ER i+1 i=0 P 0 + ρδ A k 1 A 0 )) 34) holds for k = Let us assume that it holds for some k and prove that it holds for k + 1 We have 1 ER k) 1 3) ER k EV [u k ]x ) 30) P0 + ρδ 34) k α i+1 ER i+1 + α k ρδer k i=0 P 0 + ρδ A k 1 A 0 )) + αk ρδer k 4 15

16 Solving this quadratic inequality in ER k, we obtain ER k α k ρδ + α k ρδ + P 0 + ρδ ) A k 1 A 0 ) α k ρδ) + P 0 + ρδ A k 1 A 0 )), 35) where we used that, for any a, b 0, a + b a + b Further, k 1 P0 34) + ρδ α i+1 ER i+1 i=0 P 0 + ρδ A k 1 A 0 )) + ρδαk ER k 35) P 0 + ρδ ) A k 1 A 0 ) + ρδαk ) + ρδα k P 0 + ρδ A k 1 A 0 )) P 0 + ρδ ) A k 1 A 0 ) + ρδα k = P 0 + ρδ A k A 0 )), which is 34) for k + 1 Using this inequality, we obtain k 1 A k Efx k ) f ) 30) P0 + ρδ α i+1 ER i+1 i=0 which nishes the proof of the Theorem P 0 + ρδ ) A k A 0 ) P 0 + 4ρ δ A k, Let us now estimate the growth rate of the sequence A k, k 0, which will give the rate of convergence for Algorithm 1 Lemma 5 Let the sequence {A k }, k 0 be generated by Algorithm 1 Then, for all k 1 it holds that k 1 + ρ) k 1 + ρ) A 4ρ k 36) ρ Proof As we showed in Lemma 1, α 1 = 1 ρ and, hence, A 1 = α 0 + α 1 = 1 Thus, 36) holds for k = 1 Let us assume that 36) holds for some k 1 and prove that it holds also for k + 1 From 14), we have a quadratic equation for α k+1 ρ α k+1 α k+1 A k = 0 Since we need to take the largest root, we obtain, α k+1 = ρ A k = 1 ρ ρ + 1 ρ + k 1 + ρ ρ = k + ρ ρ, ρ 4 + A k ρ 1 ρ + A k ρ

17 where we used the induction assumption that 36) holds for k On the other hand, α k+1 = 1 ρ + 1 4ρ + A k 4 ρ 1 ρ + A k ρ 1 ρ + k 1 + ρ ρ = k + ρ ρ, where we used inequality a + b a + b, a, b 0 Using the obtained inequalities for α k+1, from 14) and 36) for k, we get and = A k + α k+1 = A k + α k+1 k 1 + ρ) 4ρ k 1 + ρ) ρ In the last inequality we used that k 1, ρ 0 + k + ρ ρ + k + ρ ρ k + ρ) 4ρ k + ρ) ρ Remark 1 According to Theorem 1, if the desired accuracy of the solution is ε, ie the goal is to nd such ˆx Q that Efˆx) f ε, then the Algorithm 1 should be stopped when 3P0 A k ε Then 1 A k and the oracle error level δ should satisfy δ P 0 4ρA k ε 6ρP 0 ε 3P 0 From Lemma 5, we obtain that 3P 0 A k ε holds when k is the smallest integer satisfying k 1 + ρ) 4ρ 3P 0 ε This means that, to obtain an ε-solution, it is enough to choose { } 6P k = max ρ ρ, 0 ε Note that this dependence on ε means that the proposed method is accelerated 3 Examples of Applications In this section, we apply our general framework, which consists of assumptions 1,, 3, RSTM as listed in Algorithm 1 and convergence rate Theorem 1, to obtain several particular algorithms and their convergence rate We consider Problem 4) and, for each particular case, introduce a particular setup, which includes properties of the objective function f, available information about this function, properties of the feasible set Q Based on each setup, we show how the Randomized Inexact Oracle is constructed and check that the assumptions 1,, 3 hold Then, we obtain convergence rate guarantee for each particular algorithm as a corollary of Theorem 1 Our examples include accelerated random directional search 17

18 with inexact directional derivative, accelerated random block-coordinate descent with inexact block derivatives, accelerated random derivative-free directional search with inexact function values, accelerated random derivative-free block-coordinate descent with inexact function values Accelerated random directional search and accelerated random derivativefree directional search were developed in Nesterov and Spokoiny [017], but for the case of exact directional derivatives and exact function values Also, in the existing methods, a Gaussian random vector is used for randomization Accelerated random block-coordinate descent was introduced in Nesterov [01] and further developed in by several authors see Introduction for the extended review) Existing methods of this type use exact information on the block derivatives and also only Euclidean proximal setup In the contrast, our algorithm works with inexact derivatives and is able to work with entropy proximal setup To the best of our knowledge, our accelerated random derivative-free block-coordinate descent with inexact function values is new This method also can work with entropy proximal setup 31 Accelerated Random Directional Search In this subsection, we introduce accelerated random directional search with inexact directional derivative for unconstrained problems with Euclidean proximal setup We assume that, for all i = 1,, n, Q i = E i = R, x i) i = x i) ), x i) E i, d i x i) ) = 1 xi) ), x i) E i and, hence, V i [z i) ]x i) ) = 1 xi) z i) ), x i), z i) E i Thus, Q = E = R n Further, we assume that f in 4) has L-Lipschitz-continuous gradient with respect to Euclidean norm, ie fx) fy) + fy), x y + L x y, x, y E 37) We set β i = L, i = 1,, n Then, by denitions in Subsection 11, we have x E = L x, x E, dx) = L x = 1 x E, x E, V [z]x) = L x z = 1 x z E, x, z E Also, we have g E, = L 1 g, g E We assume that, at any point x E, one can calculate an inexact derivative of f in a direction e E f x, e) = fx), e + ξx), where e is a random vector uniformly distributed on the Euclidean sphere of radius 1, ie S 1) := {s R n : s = 1}, and the directional derivative error ξx) R is uniformly bounded in absolute value by error level, ie ξx), x E Since we are in the Euclidean setting, we consider e also as an element of E We use n fx), e + ξx))e as Randomized Inexact Oracle Let us check the assumptions stated in Subsection 1 Randomized Inexact Oracle In this setting, we have ρ = n, H = R, R T p : E R is given by R T p g = g, e, g E, R r : R E is given by R r t = te, t R Thus, fx) = n fx), e + ξx))e One can prove that E e n fx), e e = ne e ee T fx) = fx), x E, and, thus, 6) holds Also, for all x E, we have R r ξx) E, = 1 L ξx)e L, which proves 7) if we take δ = L 18

19 Regularity of Prox-Mapping Substituting particular choice of Q, V [u]x), fx) in 8), we obtain { } L u + = arg min x R n x u + α n fy), e + ξy))e, x = u αn fy), e + ξy))e L Hence, since e, e = 1, we have R r R T p fy), u u + = fy), e e, αn L fy), e + ξy))e = fy), e e, e αn fy), e + ξy)) L = fy), αn L fy), e + ξy))e = fy), u u +, which proves 9) Smoothness By denition of E and 37), we have fx) fy) + fy), x y + L x y = fy) + fy), x y + 1 x y E, x, y E and 13) holds We have checked that all the assumptions listed in Subsection 1 hold Thus, we can obtain the following convergence rate result for random directional search as a corollary of Theorem 1 and Lemma 5 Corollary 1 Let Algorithm 1 with fx) = n fx), e + ξx))e, where e is random and uniformly distributed over the Euclidean sphere of radius 1, be applied to Problem 4) in the setting of this subsection Let f be the optimal objective value and x be an optimal point in Problem 4) Assume that directional derivative error ξx) satises ξx), x E Denote P0 = 1 1 ) fx 0 ) f ) + L n u 0 x 1 If the directional derivative error ξx) can be controlled and, on each iteration, the error level satises P 0 L, 4nA k then, for all k 1, 6n P0 Efx k ) f k 1 + n), where E denotes the expectation with respect to all the randomness up to step k If the directional derivative error ξx) can not be controlled, then, for all k 1, Efx k ) f 8n P 0 k 1 + n) + 4 L k 1 + n) 19

20 Remark According to Remark 1 and due to the relation δ = L, we obtain that the error level in the directional derivative should satisfy ε L 6nP 0 At the same time, to obtain an ε-solution for Problem 4), it is enough to choose { } 6P k = max n n, 0 ε 3 Accelerated Random Coordinate Descent In this subsection, we introduce accelerated random coordinate descent with inexact coordinate derivatives for problems with separable constraints and Euclidean proximal setup We assume that, for all i = 1,, n, E i = R, Q i E i are closed and convex, x i) i = x i) ), x i) E i, d i x i) ) = 1 xi) ), x i) Q i, and, hence, V i [z i) ]x i) ) = 1 xi) z i) ), x i), z i) Q i Thus, Q = n i=1q i has separable structure Let us denote e i E the i-th coordinate vector Then, for i = 1,, n, the i-th coordinate derivative of f is f ix) = fx), e i We assume that the gradient of f in 4) is coordinatewise Lipschitz continuous with constants L i, i = 1,, n, ie f ix + he i ) f ix) L i h, h R, i = 1,, n, x Q 38) We set β i = L i, i = 1,, n Then, by denitions in Subsection 11, we have x E = n i=1 L ix i) ), x E, dx) = 1 n i=1 L ix i) ), x Q, V [z]x) = 1 n i=1 L ix i) z i) ), x, z Q Also, we have g E, = n i=1 L 1 i g i) ), g E We assume that, at any point x Q, one can calculate an inexact coordinate derivative of f f ix) = fx), e i + ξx), where the coordinate i is chosen from i = 1,, n at random with uniform probability 1 n, the coordinate derivative error ξx) R is uniformly bounded in absolute value by, ie ξx), x Q Since we are in the Euclidean setting, we consider e i also as an element of E We use n fx), e i + ξx))e i as Randomized Inexact Oracle Let us check the assumptions stated in Subsection 1 Randomized Inexact Oracle In this setting, we have ρ = n, H = E i = R, R T p : E R is given by R T p g = g, e i, g E, R r : R E is given by R r t = te i, t R Thus, fx) = n fx), e i + ξx))e i, x Q One can prove that E i n fx), e i e i = ne i e i e T i fx) = fx), x Q, and, thus, 6) holds Also, for all x Q, we have R r ξx) E, = 1 Li ξx) L0, where L 0 = min i=1,,n L i This proves 7) with δ = L0 0

21 Regularity of Prox-Mapping Separable structure of Q and V [u]x) means that the problem 8) boils down to n independent problems of the form { } u j) Lj + = arg min x j) Q j uj) x j) ) + α fy), e j x j), j = 1,, n Since fy) has only one, i-th, non-zero component, fy), e j is zero for all j i Thus, u u + has one, i-th, non-zero component and e i, u u + e i = u u + Hence, R r R T p fy), u u + = fy), e i e i, u u + = fy), e i e i, u u + = fy), e i, u u + e i = fy), u u +, which proves 9) Smoothness By the standard reasoning, using 38), one can prove that, for all i = 1,, n, fx + he i ) fx) + h fx), e i + L ih, h R, x Q 39) Let u, y Q, a R, and x = y + au + u) Q As we have shown above, u + u has only one, i-th, non-zero component Hence, there exists h R, such that u + u = he i and x = y + ahe i Thus, by denition of E and 39), we have fx) = fy + ahe i ) fy) + ah fy), e i + L i ah) = fy) + fy), ahe i + 1 ahe i E = fy) + fy), x y + 1 x y E This proves 13) We have checked that all the assumptions listed in Subsection 1 hold Thus, we can obtain the following convergence rate result for random coordinate descent as a corollary of Theorem 1 and Lemma 5 Corollary Let Algorithm 1 with fx) = n fx), e i +ξx))e i, where i is uniformly at random chosen from 1,, n, be applied to Problem 4) in the setting of this subsection Let f be the optimal objective value and x be an optimal point in Problem 4) Assume that coordinate derivative error ξx) satises ξx), x Q Denote P 0 = 1 1 ) fx 0 ) f ) + n n i=1 L i ui) 0 x i) ) 1

22 1 If the coordinate derivative error ξx) can be controlled and, on each iteration, the error level satises P 0 L0, 4nA k then, for all k 1, Efx k ) f 6n P 0 k 1 + n), where E denotes the expectation with respect to all the randomness up to step k If the coordinate derivative error ξx) can not be controlled, then, for all k 1, Efx k ) f 8n P 0 k 1 + n) + 4 L 0 k 1 + n) Remark 3 According to Remark 1 and due to the relation δ = level in the coordinate derivative should satisfy ε L 0 6nP 0 L0, we obtain that the error At the same time, to obtain an ε-solution for Problem 4), it is enough to choose { } 6P k = max n n, 0 ε 33 Accelerated Random Block-Coordinate Descent In this subsection, we consider two block-coordinate settings The rst one is the Euclidean, which is usually used in the literature for accelerated block-coordinate descent The second one is the entropy, which, to the best of our knowledge, is analyzed in this context for the rst time We develop accelerated random block-coordinate descent with inexact block derivatives for problems with simple constraints in these two settings and their combination Euclidean setup We assume that, for all i = 1,, n, E i = R p i ; Q i is a simple closed convex set; x i) i = B i x i), x i), x i) E i, where B i is symmetric positive semidenite matrix; d i x i) ) = 1 xi) i, x i) Q i, and, hence, V i [z i) ]x i) ) = 1 xi) z i) i, x i), z i) Q i Entropy setup We assume that, for all i = 1,, n, E i = R p i ; Q i is standard simplex in R p i, ie, Q i = {x i) R p i + : p i j=1 [xi) ] j = 1}; x i) i = x i) 1 = p i j=1 [xi) ] j, x i) E i ; d i x i) ) = p i j=1 [xi) ] j ln[x i) ] j, x i) Q i, and, hence, V i [z i) ]x i) ) = p i j=1 [xi) ] j ln [xi) ] j [z i) ] j, x i), z i) Q i Note that, in each block, one also can choose other proximal setups from Ben-Tal and Nemirovski [015] Combination of dierent setups in dierent blocks is also possible, ie, in one block it is possible to choose the Euclidean setup and in an another block one can choose the entropy setup

23 Using operators U i, i = 1,, n dened in 1), for each i = 1,, n, the i-th block derivative of f can be written as f ix) = U T i fx) We assume that the gradient of f in 4) is blockwise Lipschitz continuous with constants L i, i = 1,, n with respect to chosen norms i, ie f ix + U i h i) ) f ix) i, L i h i) i, h i) E i, i = 1,, n, x Q 40) We set β i = L i, i = 1,, n Then, by denitions in Subsection 11, we have x E = n i=1 L i x i) i, x E, dx) = n i=1 L id i x i) ), x Q, V [z]x) = n i=1 L iv i [z i) ]x i) ), x, z Q Also, we have g E, = n i=1 L 1 i g i) i,, g E We assume that, at any point x Q, one can calculate an inexact block derivative of f f ix) = U T i fx) + ξx), where a block number i is chosen from 1,, n randomly uniformly, the block derivative error ξx) Ei is uniformly bounded in norm by, ie ξx) i,, x Q, i = 1,, n As Randomized Inexact Oracle, we use nũiui T fx) + ξx)), where Ũi is dened in ) Let us check the assumptions stated in Subsection 1 Randomized Inexact Oracle In this setting, we have ρ = n, H = E i, R T p : E Ei is given by R T p g = Ui T g, g E, R r : Ei E is given by R r g i) = Ũig i), g i) Ei Thus, fx) = nũiu T i fx) + ξx)), x Q Since i R[1, n], one can prove that E i nũiui T fx) = fx), x Q, and, thus, 6) holds Also, for all x Q, we have R r ξx) E, = Ũiξx) E, = 1 Li ξx) i, L0, where L 0 = min i=1,,n L i This proves 7) with δ = L0 Regularity of Prox-Mapping Separable structure of Q and V [u]x) means that the problem 8) boils down to n independent problems of the form { u j) + = arg min L j V [u j) ]x j) ) + α U T fy), } x j) j x j), j = 1,, n Q j Since fy) has non-zero components only in the block i, U T j fy) is zero for all j i Thus, u u + has non-zero components only in the block i and U i Ũ T i u u + ) = u u + Hence, R r R T p fy), u u + = ŨiU T i fy), u u + = fy), U i Ũ T i u u + ) = fy), u u +, which proves 9) Smoothness By the standard reasoning, using 40), one can prove that, for all i = 1,, n, fx + U i h i) ) fx) + U T i fx), h i) + L i hi) i, h i) E i, x Q 41) 3

24 Let u, y Q, a R, and x = y + au + u) Q As we have shown above, u + u has nonzero components only in the block i Hence, there exists h i) E i, such that u + u = U i h i) and x = y + au i h i) Thus, by denition of E and 41), we have fx) = fy + au i h i) ) fy) + U T i fy), ah i) + L i ahi) i = fy) + fy), au i h i) + 1 au ih i) E = fy) + fy), x y + 1 x y E This proves 13) We have checked that all the assumptions listed in Subsection 1 hold Thus, we can obtain the following convergence rate result for random block-coordinate descent as a corollary of Theorem 1 and Lemma 5 Corollary 3 Let Algorithm 1 with fx) = nũiui T fx) + ξx)), where i is uniformly at random chosen from 1,, n, be applied to Problem 4) in the setting of this subsection Let f be the optimal objective value and x be an optimal point in Problem 4) Assume that block derivative error ξx) satises ξx), x Q Denote P 0 = 1 1 n ) fx 0 ) f ) + V [u 0 ]x ) 1 If the block derivative error ξx) can be controlled and, on each iteration, the error level satises P 0 L0, 4nA k then, for all k 1, 6n P0 Efx k ) f k 1 + n), where E denotes the expectation with respect to all the randomness up to step k If the block derivative error ξx) can not be controlled, then, for all k 1, Efx k ) f 8n P 0 k 1 + n) + 4 L 0 k 1 + n) Remark 4 According to Remark 1 and due to the relation δ = derivative error should satisfy L0, we obtain that the block ε L 0 6nP 0 At the same time, to obtain an ε-solution for Problem 4), it is enough to choose { } 6P k = max n n, 0 ε 4

25 34 Accelerated Random Derivative-Free Directional Search In this subsection, we consider the same setting as in Subsection 31, except for Randomized Inexact Oracle Instead of directional derivative, we use here its nite-dierence approximation We assume that, for all i = 1,, n, Q i = E i = R, x i) i = x i) ), x i) E i, d i x i) ) = 1 xi) ), x i) E i, and, hence, V i [z i) ]x i) ) = 1 xi) z i) ), x i), z i) E i Thus, Q = E = R n Further, we assume that f in 4) has L-Lipschitz-continuous gradient with respect to Euclidean norm, ie fx) fy) + fy), x y + L x y, x, y E 4) We set β i = L, i = 1,, n Then, by denitions in Subsection 11, we have x E = L x, x E, dx) = L x = 1 x E, x E, V [z]x) = L x z = 1 x z E, x, z E Also, we have g E, = L 1 g, g E We assume that, at any point x E, one can calculate an inexact value fx) of the function f, st fx) fx), x E To approximate the gradient of f, we use fx) = n fx + τe) fx) e, τ where τ > 0 is small parameter, which will be chosen later, e E is a random vector uniformly distributed on the Euclidean sphere of radius 1, ie on S 1) := {s R n : s = 1} Since, we are in the Euclidean setting, we consider e also as an element of E Let us check the assumptions stated in Subsection 1 Randomized Inexact Oracle First, let us show that the nite-dierence approximation for the gradient of f can be expressed in the form of 5) We have fx) = n fx + τe) fx) e = n fx), e + 1 ) τ τ fx + τe) fx) τ fx), e ) e Taking ρ = n, H = R, R T p : E R be given by R T p g = g, e, g E, R r : R E be given by R r t = te, t R, we obtain fx) = n fx), e + ξx))e, where ξx) = 1 fx + τe) fx) τ fx), e ) One can prove that E τ e n fx), e e = ne e ee T fx) = fx), x E, and, thus, 6) holds It remains to prove 7), ie, nd δ st for all x E, we have R r ξx) E, δ R r ξx) E, = 1 ξx)e = 1 1 L L τ fx + τe) fx) τ fx), e )e = 1 1 L τ fx + τe) fx + τe) fx) fx)) + 1 fx + τe) fx) τ fx), e ))e L τ L + τ L 5

26 Here we used that fx) fx), x E and 4) So, we have that 7) holds with δ = τ + τ L To balance both terms, we choose τ =, which leads to equality L L δ = Regularity of Prox-Mapping This assumption can be checked in the same way as in Subsection 31 Smoothness This assumption can be checked in the same way as in Subsection 31 We have checked that all the assumptions listed in Subsection 1 hold Thus, we can obtain the following convergence rate result for random derivative-free directional search as a corollary of Theorem 1 and Lemma 5 fx+τe) fx) Corollary 4 Let Algorithm 1 with fx) = n e, where e is random and uniformly τ distributed over the Euclidean sphere of radius 1, be applied to Problem 4) in the setting of this subsection Let f be the optimal objective value and x be an optimal point in Problem 4) Assume that function value error fx) fx) satises fx) fx), x E Denote P0 = 1 1 ) fx 0 ) f ) + L n u 0 x 1 If the error in the value of the objective f can be controlled and, on each iteration, the error level satises and τ = L then, for all k 1, Efx k ) f P 0, 64n A k 6n P 0 k 1 + n), where E denotes the expectation with respect to all the randomness up to step k If the error in the value of the objective f can not be controlled and τ =, then, L for all k 1, Efx k ) f 8n P 0 k 1 + n) + 16k 1 + n) L Remark 5 According to Remark 1 and due to the relation δ =, we obtain that the error level in the function value should satisfy The parameter τ should satisfy τ ε 144n P0 ε 6nP 0 L 6

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order

More information

Gradient methods for minimizing composite functions

Gradient methods for minimizing composite functions Math. Program., Ser. B 2013) 140:125 161 DOI 10.1007/s10107-012-0629-5 FULL LENGTH PAPER Gradient methods for minimizing composite functions Yu. Nesterov Received: 10 June 2010 / Accepted: 29 December

More information

Coordinate Descent Methods on Huge-Scale Optimization Problems

Coordinate Descent Methods on Huge-Scale Optimization Problems Coordinate Descent Methods on Huge-Scale Optimization Problems Zhimin Peng Optimization Group Meeting Warm up exercise? Warm up exercise? Q: Why do mathematicians, after a dinner at a Chinese restaurant,

More information

Gradient methods for minimizing composite functions

Gradient methods for minimizing composite functions Gradient methods for minimizing composite functions Yu. Nesterov May 00 Abstract In this paper we analyze several new methods for solving optimization problems with the objective function formed as a sum

More information

Universal Gradient Methods for Convex Optimization Problems

Universal Gradient Methods for Convex Optimization Problems CORE DISCUSSION PAPER 203/26 Universal Gradient Methods for Convex Optimization Problems Yu. Nesterov April 8, 203; revised June 2, 203 Abstract In this paper, we present new methods for black-box convex

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

arxiv: v1 [math.oc] 10 May 2016

arxiv: v1 [math.oc] 10 May 2016 Fast Primal-Dual Gradient Method for Strongly Convex Minimization Problems with Linear Constraints arxiv:1605.02970v1 [math.oc] 10 May 2016 Alexey Chernov 1, Pavel Dvurechensky 2, and Alexender Gasnikov

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

DISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder

DISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder 011/70 Stochastic first order methods in smooth convex optimization Olivier Devolder DISCUSSION PAPER Center for Operations Research and Econometrics Voie du Roman Pays, 34 B-1348 Louvain-la-Neuve Belgium

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Primal-dual subgradient methods for convex problems

Primal-dual subgradient methods for convex problems Primal-dual subgradient methods for convex problems Yu. Nesterov March 2002, September 2005 (after revision) Abstract In this paper we present a new approach for constructing subgradient schemes for different

More information

Cubic regularization of Newton s method for convex problems with constraints

Cubic regularization of Newton s method for convex problems with constraints CORE DISCUSSION PAPER 006/39 Cubic regularization of Newton s method for convex problems with constraints Yu. Nesterov March 31, 006 Abstract In this paper we derive efficiency estimates of the regularized

More information

Bregman Divergence and Mirror Descent

Bregman Divergence and Mirror Descent Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method

More information

Smooth minimization of non-smooth functions

Smooth minimization of non-smooth functions Math. Program., Ser. A 103, 127 152 (2005) Digital Object Identifier (DOI) 10.1007/s10107-004-0552-5 Yu. Nesterov Smooth minimization of non-smooth functions Received: February 4, 2003 / Accepted: July

More information

Accelerated primal-dual methods for linearly constrained convex problems

Accelerated primal-dual methods for linearly constrained convex problems Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23 Accelerated proximal gradient For convex composite problem: minimize

More information

The Frank-Wolfe Algorithm:

The Frank-Wolfe Algorithm: The Frank-Wolfe Algorithm: New Results, and Connections to Statistical Boosting Paul Grigas, Robert Freund, and Rahul Mazumder http://web.mit.edu/rfreund/www/talks.html Massachusetts Institute of Technology

More information

Stochastic Semi-Proximal Mirror-Prox

Stochastic Semi-Proximal Mirror-Prox Stochastic Semi-Proximal Mirror-Prox Niao He Georgia Institute of echnology nhe6@gatech.edu Zaid Harchaoui NYU, Inria firstname.lastname@nyu.edu Abstract We present a direct extension of the Semi-Proximal

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Coordinate descent methods

Coordinate descent methods Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5

More information

Proximal and First-Order Methods for Convex Optimization

Proximal and First-Order Methods for Convex Optimization Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,

More information

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Primal-dual Subgradient Method for Convex Problems with Functional Constraints Primal-dual Subgradient Method for Convex Problems with Functional Constraints Yurii Nesterov, CORE/INMA (UCL) Workshop on embedded optimization EMBOPT2014 September 9, 2014 (Lucca) Yu. Nesterov Primal-dual

More information

Techinical Proofs for Nonlinear Learning using Local Coordinate Coding

Techinical Proofs for Nonlinear Learning using Local Coordinate Coding Techinical Proofs for Nonlinear Learning using Local Coordinate Coding 1 Notations and Main Results Denition 1.1 (Lipschitz Smoothness) A function f(x) on R d is (α, β, p)-lipschitz smooth with respect

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints

Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints By I. Necoara, Y. Nesterov, and F. Glineur Lijun Xu Optimization Group Meeting November 27, 2012 Outline

More information

Computation Of Asymptotic Distribution. For Semiparametric GMM Estimators. Hidehiko Ichimura. Graduate School of Public Policy

Computation Of Asymptotic Distribution. For Semiparametric GMM Estimators. Hidehiko Ichimura. Graduate School of Public Policy Computation Of Asymptotic Distribution For Semiparametric GMM Estimators Hidehiko Ichimura Graduate School of Public Policy and Graduate School of Economics University of Tokyo A Conference in honor of

More information

Robust linear optimization under general norms

Robust linear optimization under general norms Operations Research Letters 3 (004) 50 56 Operations Research Letters www.elsevier.com/locate/dsw Robust linear optimization under general norms Dimitris Bertsimas a; ;, Dessislava Pachamanova b, Melvyn

More information

Math 273a: Optimization Subgradient Methods

Math 273a: Optimization Subgradient Methods Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R

More information

Efficient Methods for Stochastic Composite Optimization

Efficient Methods for Stochastic Composite Optimization Efficient Methods for Stochastic Composite Optimization Guanghui Lan School of Industrial and Systems Engineering Georgia Institute of Technology, Atlanta, GA 3033-005 Email: glan@isye.gatech.edu June

More information

On Nesterov s Random Coordinate Descent Algorithms - Continued

On Nesterov s Random Coordinate Descent Algorithms - Continued On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower

More information

Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Optimal Regularized Dual Averaging Methods for Stochastic Optimization Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business

More information

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Zhaosong Lu Lin Xiao June 25, 2013 Abstract In this paper we propose a randomized block coordinate non-monotone

More information

Accelerated Proximal Gradient Methods for Convex Optimization

Accelerated Proximal Gradient Methods for Convex Optimization Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS

More information

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1, Math 30 Winter 05 Solution to Homework 3. Recognizing the convexity of g(x) := x log x, from Jensen s inequality we get d(x) n x + + x n n log x + + x n n where the equality is attained only at x = (/n,...,

More information

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods Renato D.C. Monteiro B. F. Svaiter May 10, 011 Revised: May 4, 01) Abstract This

More information

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.

More information

Inexact Alternating Direction Method of Multipliers for Separable Convex Optimization

Inexact Alternating Direction Method of Multipliers for Separable Convex Optimization Inexact Alternating Direction Method of Multipliers for Separable Convex Optimization Hongchao Zhang hozhang@math.lsu.edu Department of Mathematics Center for Computation and Technology Louisiana State

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Gradient Sliding for Composite Optimization

Gradient Sliding for Composite Optimization Noname manuscript No. (will be inserted by the editor) Gradient Sliding for Composite Optimization Guanghui Lan the date of receipt and acceptance should be inserted later Abstract We consider in this

More information

Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization

Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization Noname manuscript No. (will be inserted by the editor) Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization Saeed Ghadimi Guanghui Lan Hongchao Zhang the date of

More information

On the Iteration Complexity of Some Projection Methods for Monotone Linear Variational Inequalities

On the Iteration Complexity of Some Projection Methods for Monotone Linear Variational Inequalities On the Iteration Complexity of Some Projection Methods for Monotone Linear Variational Inequalities Caihua Chen Xiaoling Fu Bingsheng He Xiaoming Yuan January 13, 2015 Abstract. Projection type methods

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization Convex Optimization Ofer Meshi Lecture 6: Lower Bounds Constrained Optimization Lower Bounds Some upper bounds: #iter μ 2 M #iter 2 M #iter L L μ 2 Oracle/ops GD κ log 1/ε M x # ε L # x # L # ε # με f

More information

Math 273a: Optimization Subgradients of convex functions

Math 273a: Optimization Subgradients of convex functions Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions

More information

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization A Sparsity Preserving Stochastic Gradient Method for Composite Optimization Qihang Lin Xi Chen Javier Peña April 3, 11 Abstract We propose new stochastic gradient algorithms for solving convex composite

More information

LEARNING IN CONCAVE GAMES

LEARNING IN CONCAVE GAMES LEARNING IN CONCAVE GAMES P. Mertikopoulos French National Center for Scientific Research (CNRS) Laboratoire d Informatique de Grenoble GSBE ETBC seminar Maastricht, October 22, 2015 Motivation and Preliminaries

More information

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Arkadi Nemirovski H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology Joint research

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Smoothing Proximal Gradient Method. General Structured Sparse Regression

Smoothing Proximal Gradient Method. General Structured Sparse Regression for General Structured Sparse Regression Xi Chen, Qihang Lin, Seyoung Kim, Jaime G. Carbonell, Eric P. Xing (Annals of Applied Statistics, 2012) Gatsby Unit, Tea Talk October 25, 2013 Outline Motivation:

More information

Block stochastic gradient update method

Block stochastic gradient update method Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic

More information

Lecture 24 November 27

Lecture 24 November 27 EE 381V: Large Scale Optimization Fall 01 Lecture 4 November 7 Lecturer: Caramanis & Sanghavi Scribe: Jahshan Bhatti and Ken Pesyna 4.1 Mirror Descent Earlier, we motivated mirror descent as a way to improve

More information

Iteration-complexity of first-order penalty methods for convex programming

Iteration-complexity of first-order penalty methods for convex programming Iteration-complexity of first-order penalty methods for convex programming Guanghui Lan Renato D.C. Monteiro July 24, 2008 Abstract This paper considers a special but broad class of convex programing CP)

More information

Richard DiSalvo. Dr. Elmer. Mathematical Foundations of Economics. Fall/Spring,

Richard DiSalvo. Dr. Elmer. Mathematical Foundations of Economics. Fall/Spring, The Finite Dimensional Normed Linear Space Theorem Richard DiSalvo Dr. Elmer Mathematical Foundations of Economics Fall/Spring, 20-202 The claim that follows, which I have called the nite-dimensional normed

More information

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 9 Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 2 Separable convex optimization a special case is min f(x)

More information

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Accelerated Block-Coordinate Relaxation for Regularized Optimization Accelerated Block-Coordinate Relaxation for Regularized Optimization Stephen J. Wright Computer Sciences University of Wisconsin, Madison October 09, 2012 Problem descriptions Consider where f is smooth

More information

Estimate sequence methods: extensions and approximations

Estimate sequence methods: extensions and approximations Estimate sequence methods: extensions and approximations Michel Baes August 11, 009 Abstract The approach of estimate sequence offers an interesting rereading of a number of accelerating schemes proposed

More information

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016 Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall 206 2 Nov 2 Dec 206 Let D be a convex subset of R n. A function f : D R is convex if it satisfies f(tx + ( t)y) tf(x)

More information

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org

More information

5. Subgradient method

5. Subgradient method L. Vandenberghe EE236C (Spring 2016) 5. Subgradient method subgradient method convergence analysis optimal step size when f is known alternating projections optimality 5-1 Subgradient method to minimize

More information

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function

More information

Generalization of Hensel lemma: nding of roots of p-adic Lipschitz functions

Generalization of Hensel lemma: nding of roots of p-adic Lipschitz functions Generalization of Hensel lemma: nding of roots of p-adic Lipschitz functions (joint talk with Andrei Khrennikov) Dr. Ekaterina Yurova Axelsson Linnaeus University, Sweden September 8, 2015 Outline Denitions

More information

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson 204 IEEE INTERNATIONAL WORKSHOP ON ACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 2 24, 204, REIS, FRANCE A DELAYED PROXIAL GRADIENT ETHOD WITH LINEAR CONVERGENCE RATE Hamid Reza Feyzmahdavian, Arda Aytekin,

More information

Non-smooth Non-convex Bregman Minimization: Unification and new Algorithms

Non-smooth Non-convex Bregman Minimization: Unification and new Algorithms Non-smooth Non-convex Bregman Minimization: Unification and new Algorithms Peter Ochs, Jalal Fadili, and Thomas Brox Saarland University, Saarbrücken, Germany Normandie Univ, ENSICAEN, CNRS, GREYC, France

More information

arxiv: v1 [math.oc] 5 Dec 2014

arxiv: v1 [math.oc] 5 Dec 2014 FAST BUNDLE-LEVEL TYPE METHODS FOR UNCONSTRAINED AND BALL-CONSTRAINED CONVEX OPTIMIZATION YUNMEI CHEN, GUANGHUI LAN, YUYUAN OUYANG, AND WEI ZHANG arxiv:141.18v1 [math.oc] 5 Dec 014 Abstract. It has been

More information

Lecture 23: November 21

Lecture 23: November 21 10-725/36-725: Convex Optimization Fall 2016 Lecturer: Ryan Tibshirani Lecture 23: November 21 Scribes: Yifan Sun, Ananya Kumar, Xin Lu Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

On the convergence properties of the projected gradient method for convex optimization

On the convergence properties of the projected gradient method for convex optimization Computational and Applied Mathematics Vol. 22, N. 1, pp. 37 52, 2003 Copyright 2003 SBMAC On the convergence properties of the projected gradient method for convex optimization A. N. IUSEM* Instituto de

More information

Optimization over Sparse Symmetric Sets via a Nonmonotone Projected Gradient Method

Optimization over Sparse Symmetric Sets via a Nonmonotone Projected Gradient Method Optimization over Sparse Symmetric Sets via a Nonmonotone Projected Gradient Method Zhaosong Lu November 21, 2015 Abstract We consider the problem of minimizing a Lipschitz dierentiable function over a

More information

Primal-dual first-order methods with O(1/ǫ) iteration-complexity for cone programming

Primal-dual first-order methods with O(1/ǫ) iteration-complexity for cone programming Mathematical Programming manuscript No. (will be inserted by the editor) Primal-dual first-order methods with O(1/ǫ) iteration-complexity for cone programming Guanghui Lan Zhaosong Lu Renato D. C. Monteiro

More information

IE 5531: Engineering Optimization I

IE 5531: Engineering Optimization I IE 5531: Engineering Optimization I Lecture 15: Nonlinear optimization Prof. John Gunnar Carlsson November 1, 2010 Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 1 / 24

More information

Nonmonotonic back-tracking trust region interior point algorithm for linear constrained optimization

Nonmonotonic back-tracking trust region interior point algorithm for linear constrained optimization Journal of Computational and Applied Mathematics 155 (2003) 285 305 www.elsevier.com/locate/cam Nonmonotonic bac-tracing trust region interior point algorithm for linear constrained optimization Detong

More information

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming Zhaosong Lu Lin Xiao March 9, 2015 (Revised: May 13, 2016; December 30, 2016) Abstract We propose

More information

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

An Optimal Affine Invariant Smooth Minimization Algorithm.

An Optimal Affine Invariant Smooth Minimization Algorithm. An Optimal Affine Invariant Smooth Minimization Algorithm. Alexandre d Aspremont, CNRS & École Polytechnique. Joint work with Martin Jaggi. Support from ERC SIPA. A. d Aspremont IWSL, Moscow, June 2013,

More information

Lecture 25: Subgradient Method and Bundle Methods April 24

Lecture 25: Subgradient Method and Bundle Methods April 24 IE 51: Convex Optimization Spring 017, UIUC Lecture 5: Subgradient Method and Bundle Methods April 4 Instructor: Niao He Scribe: Shuanglong Wang Courtesy warning: hese notes do not necessarily cover everything

More information

Nesterov s Optimal Gradient Methods

Nesterov s Optimal Gradient Methods Yurii Nesterov http://www.core.ucl.ac.be/~nesterov Nesterov s Optimal Gradient Methods Xinhua Zhang Australian National University NICTA 1 Outline The problem from machine learning perspective Preliminaries

More information

ACCELERATED BUNDLE LEVEL TYPE METHODS FOR LARGE SCALE CONVEX OPTIMIZATION

ACCELERATED BUNDLE LEVEL TYPE METHODS FOR LARGE SCALE CONVEX OPTIMIZATION ACCELERATED BUNDLE LEVEL TYPE METHODS FOR LARGE SCALE CONVEX OPTIMIZATION By WEI ZHANG A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

More information

satisfying ( i ; j ) = ij Here ij = if i = j and 0 otherwise The idea to use lattices is the following Suppose we are given a lattice L and a point ~x

satisfying ( i ; j ) = ij Here ij = if i = j and 0 otherwise The idea to use lattices is the following Suppose we are given a lattice L and a point ~x Dual Vectors and Lower Bounds for the Nearest Lattice Point Problem Johan Hastad* MIT Abstract: We prove that given a point ~z outside a given lattice L then there is a dual vector which gives a fairly

More information

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability... Functional Analysis Franck Sueur 2018-2019 Contents 1 Metric spaces 1 1.1 Definitions........................................ 1 1.2 Completeness...................................... 3 1.3 Compactness......................................

More information

Vector Space Basics. 1 Abstract Vector Spaces. 1. (commutativity of vector addition) u + v = v + u. 2. (associativity of vector addition)

Vector Space Basics. 1 Abstract Vector Spaces. 1. (commutativity of vector addition) u + v = v + u. 2. (associativity of vector addition) Vector Space Basics (Remark: these notes are highly formal and may be a useful reference to some students however I am also posting Ray Heitmann's notes to Canvas for students interested in a direct computational

More information

PROPERTIES OF A CLASS OF APPROXIMATELY SHRINKING OPERATORS AND THEIR APPLICATIONS

PROPERTIES OF A CLASS OF APPROXIMATELY SHRINKING OPERATORS AND THEIR APPLICATIONS Fixed Point Theory, 15(2014), No. 2, 399-426 http://www.math.ubbcluj.ro/ nodeacj/sfptcj.html PROPERTIES OF A CLASS OF APPROXIMATELY SHRINKING OPERATORS AND THEIR APPLICATIONS ANDRZEJ CEGIELSKI AND RAFA

More information

A Greedy Framework for First-Order Optimization

A Greedy Framework for First-Order Optimization A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts

More information

You should be able to...

You should be able to... Lecture Outline Gradient Projection Algorithm Constant Step Length, Varying Step Length, Diminishing Step Length Complexity Issues Gradient Projection With Exploration Projection Solving QPs: active set

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Solving DC Programs that Promote Group 1-Sparsity

Solving DC Programs that Promote Group 1-Sparsity Solving DC Programs that Promote Group 1-Sparsity Ernie Esser Contains joint work with Xiaoqun Zhang, Yifei Lou and Jack Xin SIAM Conference on Imaging Science Hong Kong Baptist University May 14 2014

More information

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725 Dual methods and ADMM Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Given f : R n R, the function is called its conjugate Recall conjugate functions f (y) = max x R n yt x f(x)

More information

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong 2014 Workshop

More information

Generalized Uniformly Optimal Methods for Nonlinear Programming

Generalized Uniformly Optimal Methods for Nonlinear Programming Generalized Uniformly Optimal Methods for Nonlinear Programming Saeed Ghadimi Guanghui Lan Hongchao Zhang Janumary 14, 2017 Abstract In this paper, we present a generic framewor to extend existing uniformly

More information

Research Article Modified Halfspace-Relaxation Projection Methods for Solving the Split Feasibility Problem

Research Article Modified Halfspace-Relaxation Projection Methods for Solving the Split Feasibility Problem Advances in Operations Research Volume 01, Article ID 483479, 17 pages doi:10.1155/01/483479 Research Article Modified Halfspace-Relaxation Projection Methods for Solving the Split Feasibility Problem

More information

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom

More information

arxiv: v7 [math.oc] 22 Feb 2018

arxiv: v7 [math.oc] 22 Feb 2018 A SMOOTH PRIMAL-DUAL OPTIMIZATION FRAMEWORK FOR NONSMOOTH COMPOSITE CONVEX MINIMIZATION QUOC TRAN-DINH, OLIVIER FERCOQ, AND VOLKAN CEVHER arxiv:1507.06243v7 [math.oc] 22 Feb 2018 Abstract. We propose a

More information