Pavel Dvurechensky Alexander Gasnikov Alexander Tiurin. July 26, 2017

Similar documents
One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

Optimization methods

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Gradient methods for minimizing composite functions

Coordinate Descent Methods on Huge-Scale Optimization Problems

Gradient methods for minimizing composite functions

Universal Gradient Methods for Convex Optimization Problems

Convex Optimization Lecture 16

arxiv: v1 [math.oc] 10 May 2016

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

A Unified Approach to Proximal Algorithms using Bregman Distance

DISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder

Optimization methods

Primal-dual subgradient methods for convex problems

Cubic regularization of Newton s method for convex problems with constraints

Bregman Divergence and Mirror Descent

Nesterov s Acceleration

Stochastic and online algorithms

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Smooth minimization of non-smooth functions

Accelerated primal-dual methods for linearly constrained convex problems

The Frank-Wolfe Algorithm:

Stochastic Semi-Proximal Mirror-Prox

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Big Data Analytics: Optimization and Randomization

Coordinate descent methods

Proximal and First-Order Methods for Convex Optimization

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Techinical Proofs for Nonlinear Learning using Local Coordinate Coding

SVRG++ with Non-uniform Sampling

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Coordinate Descent and Ascent Methods

Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints

Computation Of Asymptotic Distribution. For Semiparametric GMM Estimators. Hidehiko Ichimura. Graduate School of Public Policy

Robust linear optimization under general norms

Math 273a: Optimization Subgradient Methods

Efficient Methods for Stochastic Composite Optimization

On Nesterov s Random Coordinate Descent Algorithms - Continued

Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming

Accelerated Proximal Gradient Methods for Convex Optimization

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

Inexact Alternating Direction Method of Multipliers for Separable Convex Optimization

Conditional Gradient (Frank-Wolfe) Method

Gradient Sliding for Composite Optimization

Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization

On the Iteration Complexity of Some Projection Methods for Monotone Linear Variational Inequalities

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Math 273a: Optimization Subgradients of convex functions

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization

LEARNING IN CONCAVE GAMES

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Smoothing Proximal Gradient Method. General Structured Sparse Regression

Block stochastic gradient update method

Lecture 24 November 27

Iteration-complexity of first-order penalty methods for convex programming

Richard DiSalvo. Dr. Elmer. Mathematical Foundations of Economics. Fall/Spring,

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Estimate sequence methods: extensions and approximations

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

5. Subgradient method

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Generalization of Hensel lemma: nding of roots of p-adic Lipschitz functions

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson

Non-smooth Non-convex Bregman Minimization: Unification and new Algorithms

arxiv: v1 [math.oc] 5 Dec 2014

Lecture 23: November 21

On the convergence properties of the projected gradient method for convex optimization

Optimization over Sparse Symmetric Sets via a Nonmonotone Projected Gradient Method

Primal-dual first-order methods with O(1/ǫ) iteration-complexity for cone programming

IE 5531: Engineering Optimization I

Nonmonotonic back-tracking trust region interior point algorithm for linear constrained optimization

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

Accelerating Stochastic Optimization

An Optimal Affine Invariant Smooth Minimization Algorithm.

Lecture 25: Subgradient Method and Bundle Methods April 24

Nesterov s Optimal Gradient Methods

ACCELERATED BUNDLE LEVEL TYPE METHODS FOR LARGE SCALE CONVEX OPTIMIZATION

satisfying ( i ; j ) = ij Here ij = if i = j and 0 otherwise The idea to use lattices is the following Suppose we are given a lattice L and a point ~x

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...

Vector Space Basics. 1 Abstract Vector Spaces. 1. (commutativity of vector addition) u + v = v + u. 2. (associativity of vector addition)

PROPERTIES OF A CLASS OF APPROXIMATELY SHRINKING OPERATORS AND THEIR APPLICATIONS

A Greedy Framework for First-Order Optimization

You should be able to...

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Solving DC Programs that Promote Group 1-Sparsity

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

Generalized Uniformly Optimal Methods for Nonlinear Programming

Research Article Modified Halfspace-Relaxation Projection Methods for Solving the Split Feasibility Problem

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč

arxiv: v7 [math.oc] 22 Feb 2018

Transcription:

Randomized Similar Triangles Method: A Unifying Framework for Accelerated Randomized Optimization Methods Coordinate Descent, Directional Search, Derivative-Free Method) Pavel Dvurechensky Alexander Gasnikov Alexander Tiurin July 6, 017 Abstract In this paper, we consider smooth convex optimization problems with simple constraints and inexactness in the oracle information such as value, partial or directional derivatives of the objective function We introduce a unifying framework, which allows to construct dierent types of accelerated randomized methods for such problems and to prove convergence rate theorems for them We focus on accelerated random block-coordinate descent, accelerated random directional search, accelerated random derivative-free method and, using our framework, provide their versions for problems with inexact oracle information Our contribution also includes accelerated random block-coordinate descent with inexact oracle and entropy proximal setup as well as derivative-free version of this method Keywords: convex optimization, accelerated random block-coordinate descent, accelerated random directional search, accelerated random derivative-free method, inexact oracle, complexity, accelerated gradient descent methods, rst-order methods, zero-order methods AMS Classication: 90C5, 90C30, 90C06, 90C56, 68Q5, 65K05, 49M7, 68W0, 65Y0, 68W40 Weierstrass Institute for Applied Analysis and Stochastics, Berlin; Institute for Information Transmission Problems RAS, Moscow, paveldvurechensky@wias-berlinde Moscow Institute of Physics and Technology, Moscow; Institute for Information Transmission Problems RAS, Moscow, gasnikov@yandexru National Research University Higher School of Economics, Moscow, alexandertiurin@gmailcom The results obtained in this paper were presented in December, 016 http://wwwmathnet ru/php/seminarsphtml?option_lang=rus&presentid=16180) and in June, 017 http://www lccclthse/indexphp?mact=reglerseminars,cntnt01,abstractbio,0&cntnt01abstractid=889& cntnt01returnid=116) 1

Introduction In this paper, we consider smooth convex optimization problems with simple constraints and inexactness in the oracle information such as value, partial or directional derivatives of the objective function Dierent types of randomized optimization algorithms, such as random coordinate descent or stochastic gradient descent for empirical risk minimization problem, have been extensively studied in the past decade with main application being convex optimization problems Our main focus in this paper is on accelerated randomized methods: random block-coordinate descent, random directional search, random derivative-free ) method As opposed to non-accelerated methods, these methods have complexity O 1 ε iterations to achieve objective function residual ε Accelerated random block-coordinate descent method was rst proposed in Nesterov [01], which was the starting point for active research in this direction The idea of the method is, on each iteration, to randomly choose a block of coordinates in the decision variable and make a step using the derivative of the objective function with respect to the chosen coordinates Accelerated random directional search and accelerated random derivative-free method were rst proposed in 011 and published recently in Nesterov and Spokoiny [017], but there was no extensive research in this direction The idea of random directional search is to use a projection of the objective's gradient onto a randomly chosen direction to make a step on each iteration Random derivative-free method uses the same idea, but random projection of the gradient is approximated by nitedierence, ie the dierence of values of the objective function in two close points This also means that it is a zero-order method which uses only function values to make a step Existing accelerated randomized methods have dierent convergence analysis This motivated us to pose the main question, we address in this paper, as follows Is it possible to nd a crucial part of the convergence rate analysis and use it to systematically construct new accelerated randomized methods? To some extent, our answer is "`yes"' We determine three main assumptions and use them to prove convergence rate theorem for our generic accelerated randomized method Our framework allows both to reproduce known and to construct new accelerated randomized methods The latter include new accelerated random block-coordinate descent with inexact block derivatives and entropy proximal setup Related Work In the seminal paper Nesterov [01], Nesterov proposed random block-coordinate descent for convex optimization problems with simple convex separable constraints and accelerated random block-coordinate descent for unconstrained convex optimization problems In Lee and Sidford [013], Lee and Sidford proposed accelerated random block-coordinate descent with non-uniform probability of choosing a particular block of coordinates They also developed an ecient implementation without full-dimensional operations on each iteration Fercoq and Richt arik in Fercoq and Richt arik [015] introduced accelerated block-coordinate descent for composite optimization problems, which include problems with separable constraints Later, Lin, Lu and Xiao in Lin et al [014] extended this method for strongly convex

problems In May 015, Nesterov and Stich presented an accelerated block-coordinate descent with complexity, which does not explicitly depend on the problem dimension This result was recently published in Nesterov and Stich [017] Similar complexity was obtained also by Allen-Zhu, Qu, Richt arik and Yuan in Allen-Zhu et al [016] and by Gasnikov, Dvurechensky and Usmanova in Gasnikov et al [016c] We also mention special type of accelerated block-coordinate descent of Shalev-Shwartz and Zhang developed in Shalev- Shwartz and Zhang [014] for empirical risk minimization problems All these accelerated block-coordinate descent methods work in Euclidean setup, when the norm in each block is Euclidean and dened using some positive semidenite matrix Non-accelerated blockcoordinate methods, but with non-euclidean setup, were considered by Dang and Lan in Dang and Lan [015] All the mentioned methods rely on exact block derivatives and exact projection on each step Inexact projection in the context of non-accelerated random coordinate descent was considered by Tappenden, Richt arik and Gondzio in Tappenden et al [016] Research on accelerated random directional search and accelerated random derivativefree methods started in Nesterov and Spokoiny [017] Mostly non-accelerated derivative-free methods were further developed in the context of inexact function values in Gasnikov et al [016a,b], Bogolubsky et al [016], Gasnikov et al [017] We should also mention that there are other accelerated randomized methods in Frostig et al [015], Lin et al [015], Zhang and Lin [015], Allen-Zhu [017], Lan and Zhou [017] Most of these methods were developed deliberately for empirical risk minimization problems and do not fall in the scope of this paper Our Approach and Contributions Our framework has two main components, namely, Randomized Inexact Oracle and Randomized Similar Triangles Method The starting point for the denition of our oracle is a unied view on random directional search and random block-coordinate descent In both these methods, on each iteration, a randomized approximation for the objective function's gradient is calculated and used, instead of the true gradient, to make a step This approximation for the gradient is constructed by a projection on a randomly chosen subspace For random directional search, this subspace is the line along a randomly generated direction As a result a directional derivative in this direction is calculated For random block-coordinate descent, this subspace is given by randomly chosen block of coordinates and block derivative is calculated One of the key features of these approximations is that they are unbiased, ie their expectation is equal to the true gradient We generalize two mentioned approaches by allowing other types of random transformations of the gradient for constructing its randomized approximation The inexactness of our oracle is inspired by the relation between derivative-free method and directional search In the framework of derivative-free methods, only the value of the objective function is available for use in an algorithm At the same time, if the objective function is smooth, the directional derivative can be well approximated by the dierence of function values at two points which are close to each other Thus, in the context of zero- 3

order optimization, one can calculate only an inexact directional derivative Hence, one can construct only a biased randomized approximation for the gradient when a random direction is used We combine previously mentioned random transformations of the gradient with possible inexactness of this transformations to construct our Randomized Inexact Oracle, which we use in our generic algorithm to make a step on each iteration The basis of our generic algorithm is Similar Triangles Method of Tyurin [017] see also Dvurechensky et al [017]), which is an accelerated gradient method with only one proximal mapping on each iteration, this proximal mapping being essentially the Mirror Descent step The notable point is that, we only need to substitute the true gradient with our Randomized Inexact Oracle and slightly change one step in the Similar Triangles Method, to obtain our generic accelerated randomized algorithm, which we call Randomized Similar Triangles Method RSTM), see Algorithm 1 We prove convergence rate theorem for RSTM in two cases: the inexactness of Randomized Inexact Oracle can be controlled and adjusted on each iteration of the algorithm, the inexactness can not be controlled We apply our framework to several particular settings: random directional search, random coordinate descent, random block-coordinate descent and their combinations with derivativefree approach As a corollary of our main theorem, we obtain both known and new results on the convergence of dierent accelerated randomized methods with inexact oracle To sum up, our contributions in this paper are as follows We introduce a general framework for constructing and analyzing dierent types of accelerated randomized methods, such as accelerated random directional search, accelerated block-coordinate descent, accelerated derivative-free methods Our framework allows to obtain both known and new methods and their convergence rate guarantees as a corollary of our main Theorem 1 Using our framework, we introduce new accelerated methods with inexact oracle, namely, accelerated random directional search, accelerated random block-coordinate descent, accelerated derivative-free method To the best of our knowledge, such methods with inexact oracle were not known before See Section 3 Based on our framework, we introduce new accelerated random block-coordinate descent with inexact oracle and non-euclidean setup, which was not done before in the literature The main application of this method is minimization of functions on a direct product of large number of low-dimensional simplexes See Subsection 33 We introduce new accelerated random derivative-free block-coordinate descent with inexact oracle and non-euclidean setup Such method was not known before in the literature Our method is similar to the method in the previous item, but uses only nite-dierence approximations for block derivatives See Subsection 36 The rest of the paper is organized as follows In Section 1, we provide the problem statement, motivate and make three our main assumptions, illustrate them by random directional search and random block-coordinate descent In Section, we introduce our main 4

algorithm, called Randomized Similar Triangles Method, and, based on stated general assumptions, prove convergence rate Theorem 1 Section 3 is devoted to applications of our general framework for dierent particular settings, namely Accelerated Random Directional Search Subsection 31), Accelerated Random Coordinate Descent Subsection 3), Accelerated Random Block-Coordinate Descent Subsection 33), Accelerated Random Derivative-Free Directional Search Subsection 34), Accelerated Random Derivative-Free Coordinate Descent Subsection 35), Accelerated Random Derivative-Free Block-Coordinate Descent Subsection 36) Accelerated Random Derivative-Free Block-Coordinate Descent with Random Approximations for Block Derivatives Subsection 37) 1 Preliminaries 11 Notation Let nite-dimensional real vector space E be a direct product of n nite-dimensional real vector spaces E i, i = 1,, n, ie E = n i=1e i and dime i = p i, i = 1,, n Denote also p = n i=1 p i Let, for i = 1,, n, Ei denote the dual space for E i Then, the space dual to E is E = n i=1ei Given a vector x i) E i for some i 1,, n, we denote as [x i) ] j its j-th coordinate, where j 1,, p i To formalize the relationship between vectors in E i, i = 1,, n and vectors in E, we dene primal partition operators U i : E i E, i = 1,, n, by identity n x = x 1),, x n) ) = U i x i), x i) E i, i = 1,, n, x E 1) i=1 For any xed i 1,, n, U i maps a vector x i) E i, to the vector 0,, x i),, 0) E The adjoint operator Ui T : E Ei, then, is an operator, which, maps a vector g = g 1),, g i),, g n) ) E, to the vector g i) Ei Similarly, we dene dual partition operators Ũi : Ei E, i = 1,, n, by identity n g = g 1),, g n) ) = Ũ i g i), g i) Ei, i = 1,, n, g E ) i=1 For all i = 1,, n, we denote the value of a linear function g i) Ei at a point x i) E i by g i), x i) i We dene n g, x = g i), x i) i, x E, g E i=1 5

For all i = 1,, n, let i be some norm on E i and i, be the norm on E i which is dual i g i) i, = max g i), x i) i x i) i 1 Given parameters β i R n ++, i = 1,, n, we dene the norm of a vector x = x 1),, x n) ) E as n x E = β i x i) i Then, clearly, the dual norm of a vector g = g 1),, g n) ) E is i=1 g E, = n i=1 β 1 i g i) i Throughout the paper, we consider optimization problem with feasible set Q, which is assumed to be given as Q = n i=1q i E, where Q i E i, i = 1,, n are closed convex sets To have more exibility and be able to adapt algorithm to the structure of sets Q i, i = 1,, n, we introduce proximal setup, see eg Ben-Tal and Nemirovski [015] For all i = 1,, n, we choose a prox-function d i x i) ) which is continuous, convex on Q i and 1 admits a continuous in x i) Q 0 i selection of subgradients d i x i) ), where x i) Q 0 i Q i, and Q 0 i is the set of all x i), where d i x i) ) exists; is 1-strongly convex on Q i with respect to i, ie, for any x i) Q 0 i, y i) Q i, it holds that d i y i) ) d i x i) ) d i x i) ), y i) x i) i 1 yi) x i) i We dene also the corresponding Bregman divergence V i [z i) ]x i) ) := d i x i) ) d i z i) ) d i z i) ), x i) z i) i, x i) Q i, z i) Q 0 i, i = 1,, n It is easy to see that V i [z i) ]x i) ) 1 xi) z i) i, x i) Q i, z i) Q 0 i, i = 1,, n Standard proximal setups, eg Euclidean, entropy, l 1 /l, simplex can be found in Ben-Tal and Nemirovski [015] It is easy to check that, for given parameters β i R n ++, i = 1,, n, the functions dx) = n i=1 β id i x i) ) and V [z]x) = n i=1 β iv i [z i) ]x i) ) are respectively a prox-function and a Bregman divergence corresponding to Q Also, clearly, V [z]x) 1 x z E, x Q, z Q 0 := n i=1q 0 i 3) For a dierentiable function fx), we denote by fx) E its gradient 1 Problem Statement and Assumptions The main problem, we consider, is as follows min fx), 4) x Q E 6

where fx) is a smooth convex function, Q = n i=1q i E, with Q i E i, i = 1,, n being closed convex sets We now list our main assumptions and illustrate them by two simple examples More detailed examples are given in Section 3 As the rst example here, we consider random directional search, in which the gradient of the function f is approximated by a vector fx), e e, where fx), e is the directional derivative in direction e and random vector e is uniformly distributed over the Euclidean sphere of radius 1 Our second example is random block-coordinate descent, in which the gradient of the function f is approximated by a vector ŨiU T i fx), where U T i fx) is i-th block derivative and the block number i is uniformly randomly sampled from 1,, n The common part in both these randomized gradient approximations is that, rst, one randomly chooses a subspace which is either the line, parallel to e, or i-th block of coordinates Then, one projects the gradient on this subspace by calculating either fx), e or U T i fx) Finally, one lifts the obtained random projection back to the whole space E either by multiplying directional derivative by vector e, or applying dual partition operator Ũi At the same time, in both cases, if one scales the obtained randomized approximation for the gradient by multiplying it by n, one obtains an unbiased randomized approximation of the gradient E e n fx), e e = fx), E i nũiu T i fx) = fx), x Q We also want our approach to allow construction of derivative-free methods For a function f with L-Lipschitz-continuous gradient, the directional derivative can be well approximated by the dierence of function values in two close points Namely, it holds that fx + τe) fx) fx), e = + oτ), τ where τ > 0 is a small parameter Thus, if only the value of the function is available, one can calculate only inexact directional derivative, which leads to biased randomized approximation for the gradient if the direction is chosen randomly These three features, namely, random projection and lifting up, unbiased part of the randomized approximation for the gradient, bias in the randomized approximation for the gradient, lead us to the following assumption about the structure of our general Randomized Inexact Oracle Assumption 1 Randomized Inexact Oracle) We access the function f only through Randomized Inexact Oracle fx), x Q, which is given by fx) = ρr r R T p fx) + ξx)) E, 5) where ρ > 0 is a known constant; R p is a random "`projection"' operator from some auxiliary space H to E, and, hence, R T p, acting from E to H, is the adjoint to R p ; R r : H E is also some random "`reconstruction"' operator; ξx) H is a, possibly random, vector characterizing the error of the oracle The oracle is also assumed to satisfy the following properties EρR r R T p fx) = fx), x Q, 6) R r ξx) E, δ, x Q, 7) 7

where δ 0 is oracle error level Let us make some comments on this assumption The nature of the operator R T p is generalization of random projection For the case of random directional search, H = R, R T p : E R is given by R T p g = g, e, g E For the case of random block-coordinate descent, H = E i, R T p : E Ei is given by R T p g = Ui T g, g E We assume that there is some additive error ξx) in the generalized random projection R T p fx) This error can be introduced, for example, when nite-dierence approximation of the directional derivative is used Finally, we lift the inexact random projection R T p fx) + ξx) back to E by applying operator R r For the case of random directional search, R r : R E is given by R r t = te, t R For the case of random block-coordinate descent, R r : Ei E is given by R r g i) = Ũig i), g i) Ei The number ρ is the normalizing coecient, which allows the part ρr r R T p fx) to be unbiased randomized approximation for the gradient This is expressed by equality 6) Finally, we assume that the error in our oracle is bounded, which is expressed by property 7) In our analysis, we consider two cases: the error ξ can be controlled and δ can be appropriately chosen on each iteration of the algorithm; the error ξ can not be controlled and we only know oracle error level δ Let us move to the next assumption As said, our generic algorithm is based on Similar Triangles Method of Tyurin [017] see also Dvurechensky et al [017]), which is an accelerated gradient method with only one proximal mapping on each iteration This proximal mapping is essentially the Mirror Descent step For simplicity, let us consider here an unconstrained minimization problem in the Euclidean setting This means that Q i = E i = R p i, x i) i = x i), i = 1,, n Then, given a point u E, a number α, and the gradient fy) at some point y E, the Mirror Descent step is { } 1 u + = arg min x E x u + α fy), x = u α fy) Now we want to substitute the gradient fy) with our Randomized Inexact Oracle fy) Then, we see that the step u + = u α fy) makes progress only in the subspace onto which the gradient is projected, while constructing the Randomized Inexact Oracle In other words, u u + lies in the same subspace as fy) In our analysis, this is a desirable property and we formalize it as follows Assumption Regularity of Prox-Mapping) The set Q, norm E, prox-function dx), and Randomized Inexact Oracle fx) are chosen in such a way that, for any u, y Q, α > 0, the point { u + = arg min V [u]x) + α fy), } x 8) x Q satises R r R T p fy), u u + = fy), u u + 9) The interpretation is that, in terms of linear pairing with u u +, the unbiased part R r R T p fy) of the Randomized Inexact Oracle makes the same progress as the true gradient fy) 8

Finally, we want to formalize the smoothness assumption for the function f In our analysis, we use only the smoothness of f in the direction of u + u, where u Q and u + is dened in 8) Thus, we consider two points x, y Q, which satisfy equality x = y+au + u), where a R For the random directional search, it is natural to assume that f has L- Lipschitz-continuous gradient with respect to the Euclidean norm, ie fx) fy) + fy), x y + L x y, x, y Q 10) Then, if we dene x E = L x, we obtain that, for our choice x = y + au + u), fx) fy) + fy), x y + 1 x y E Usual assumption for random block-coordinate descent is that the gradient of f is block-wise Lipschitz continuous This means that, for all i = 1,, n, block derivative f ix) = U T i fx) is L i -Lipschitz continuous with respect to chosen norm i, ie f ix + U i h i) ) f ix) i, L i h i) i, h i) E i, i = 1,, n, x Q 11) By the standard reasoning, using 11), one can prove that, for all i = 1,, n, fx + U i h i) ) fx) + U T i fx), h i) + L i hi) i, h i) E i, x Q 1) In block-coordinate setting, fx) has non-zero elements only in one, say i-th, block and it follows from 8) that u + u also has non-zero components only in the i-th block Hence, there exists h i) E i, such that u + u = U i h i and x = y + au i h i) Then, if we dene x E = n i=1 L i x i) i, we obtain fx) = fy + au i h i) ) 1) fy) + U T i fy), ah i) + L i ahi) i = fy) + fy), au i h i) + 1 au ih i) E = fy) + fy), x y + 1 x y E We generalize these two examples and assume smoothness of f in the following sense Assumption 3 Smoothness) The norm E is chosen in such a way that, for any u, y Q, a R, if x = y + au + u) Q, then fx) fy) + fy), x y + 1 x y E 13) 9

Algorithm 1 Randomized Similar Triangles Method RSTM) Input: starting point u 0 Q 0 = n i=1 Q0 i, prox-setup: dx), V [u]x), see Subsection 11 1: Set k = 0, A 0 = α 0 = 1 1 ρ, x 0 = y 0 = u 0 : repeat 3: Find α k+1 as the largest root of the equation := A k + α k+1 = ρ α k+1 14) 4: Calculate 5: Calculate 6: Calculate y k+1 = α k+1u k + A k x k 15) u k+1 = arg min x Q {V [u k]x) + α k+1 fy k+1 ), x } 16) x k+1 = y k+1 + ρ α k+1 u k+1 u k ) 17) 7: Set k = k + 1 8: until Output: The point x k+1 Randomized Similar Triangles Method In this section, we introduce our generic Randomized Similar Triangles Method, which is listed as Algorithm 1 below, and prove Theorem 1, which gives its convergence rate The method is constructed by a modication of Similar Triangles Method see Dvurechensky et al [017]) and, thus, inherits part of its name Lemma 1 Algorithm 1 is correctly dened in the sense that, for all k 0, x k, y k Q Proof The proof is a direct generalization of Lemma in Fercoq and Richt arik [015] By denition 16), for all k 0, u k Q If we prove that, for all k 0, x k Q, then, from 15), it follows that, for all k 0, y k Q Let us prove that, for all k 0, x k is a convex combination of u 0 u k, namely x k = k l=0 γl k u l, where γ0 0 = 1, γ1 0 = 0, γ1 1 = 1, and for k 1, γk+1 l = ) 1 α k+1 γk l, ) ) l = 0,, k 1 1 ρ α k α A k + ρ k A k α k+1, l = k α k+1 ρ α k+1, l = k + 1 Since, x 0 = u 0, we have that γ0 0 = 1 Next, by 17), we have x 1 = y 1 + ρ α 1 A 1 u 1 u 0 ) = u 0 + ρ α 1 A 1 u 1 u 0 ) = 1 ρ α 1 A 1 )u 0 + ρ α 1 A 1 u 1 Solving the equation 14) for k = 0, and using the 10 18)

choice α 0 = 1 1 ρ, we obtain that α 1 = 1 ρ and α 1 A 1 14) = α 1 ρ α 1 = 1 ρ 19) Hence, x 1 = u 1 and γ 0 1 = 0, γ 1 1 = 1 Let us now assume that x k = k l=0 γl k u l and prove that x k+1 is also a convex combination with coecients, given by 18) From 15), 17), we have x k+1 = y k+1 + ρ α k+1 u k+1 u k ) = α k+1u k + A k x k + ρ α k+1 u k+1 u k ) A k+1 αk+1 ρ α ) k+1 u k + ρ α k+1 u k+1 = A k x k + = 1 α ) k k+1 γ A ku l l + k+1 l=0 αk+1 ρ α k+1 Note that all the coecients sum to 1 Next, we have x k+1 = = = 1 α k+1 1 α k+1 ) k 1 l=0 ) k 1 l=0 ) k 1 1 α k+1 γku l l + γku l l + γku l l + l=0 ρ α k γk k 1 α k+1 A k αk+1 1 α k+1 ) + ) + 1 ρ α ) k A k ) u k + ρ α k+1 u k+1 αk+1 ρ α k+1 αk+1 ρ α k+1 αk + ρ α k+1 A k )) u k + ρ α k+1 )) u k + ρ α k+1 u k+1 u k+1 )) u k + ρ α k+1 u k+1 So, we see that 18) holds for k + 1 It remains to show that γk+1 l 0, l = 0,, k + 1 For γk+1 l, l = 0,, k 1 è γk+1 k+1 it is obvious From 14), we have α k+1 = 1 + 1 + 4ρ A k ρ Thus, since {A k }, k 0 is non-decreasing sequence, {α k+1 }, k 0 is also non-decreasing From 14), we obtain α k+1, which means that this sequence is non-increasing Thus, α k A k α k+1 = α k+1 ρ α k+1 and α k A k α 1 A 1 1 ρ for k 1 These inequalities prove that γk k+1 0 Lemma Let the sequences {x k, y k, u k, α k, A k }, k 0 be generated by Algorithm 1 Then, for all u Q, it holds that α k+1 fy k+1 ), u k u fy k+1 ) fx k+1 )) + V [u k ]u) V [u k+1 ]u) + α k+1 ρ R r ξy k+1 ), u k u k+1 0) 11

Proof Using Assumptions 1 and with α = α k+1, y = y k+1, u = u k, u + = u k+1, we obtain α k+1 fy k+1 ), u k u k+1 5) = α k+1 ρ R r R T p fy k+1 ) + ξy k+1 )), u k u k+1 9) = α k+1 ρ fy k+1 ), u k u k+1 + α k+1 ρ R r ξy k+1 ), u k u k+1 17) = fy k+1 ), y k+1 x k+1 + α k+1 ρ R r ξy k+1 ), u k u k+1 1) Note that, from the optimality condition in 16), for any u Q, we have V [u k ]u k+1 ) + α k+1 fyk+1 ), u u k+1 0 ) By the denition of V [u]x), we obtain, for any u Q, V [u k ]u) V [u k+1 ]u) V [u k ]u k+1 ) =du) du k ) du k ), u u k Further, for any u Q, by Assumption 3, du) du k+1 ) du k+1 ), u u k+1 ) du k+1 ) du k ) du k ), u k+1 u k ) = du k ) du k+1 ), u k+1 u = V [u k ]u k+1 ), u k+1 u 3) α k+1 fy k+1 ), u k u = α k+1 fy k+1 ), u k u k+1 + α k+1 fy k+1 ), u k+1 u ) α k+1 fy k+1 ), u k u k+1 + V [u k ]u k+1 ), u k+1 u 3) = α k+1 fy k+1 ), u k u k+1 + V [u k ]u) V [u k+1 ]u) V [u k ]u k+1 ) 3) α k+1 fy k+1 ), u k u k+1 + V [u k ]u) V [u k+1 ]u) 1 u k u k+1 E 1),17) = fy k+1 ), y k+1 x k+1 + α k+1 ρ R r ξy k+1 ), u k u k+1 + + V [u k ]u) V [u k+1 ]u) y ρ αk+1 k+1 x k+1 E 14) = fy k+1 ), y k+1 x k+1 1 y k+1 x k+1 E 17),13) ) + + α k+1 ρ R r ξy k+1 ), u k u k+1 + V [u k ]u) V [u k+1 ]u) fy k+1 ) fx k+1 )) + V [u k ]u) V [u k+1 ]u)+ + α k+1 ρ R r ξy k+1 ), u k u k+1 In the last inequality, we used Assumption 3 with a = ρ α k+1, x = x k+1, y = y k+1, u = u k, u + = u k+1 1

Lemma 3 Let the sequences {x k, y k, u k, α k, A k }, k 0 be generated by Algorithm 1 Then, for all u Q, it holds that α k+1 fy k+1 ), u k u fy k+1 ) E k+1 fx k+1 )) + V [u k ]u) E k+1 V [u k+1 ]u) + E k+1 α k+1 ρ R r ξy k+1 ), u u k+1, 4) where E k+1 denotes the expectation conditioned on all the randomness up to step k Proof First, for any u Q, by Assumption 1, E k+1 α k+1 fy k+1 ), u k u 5) = E k+1 α k+1 ρ R r R T p fy k+1 ) + ξy k+1 )), u k u 6) = α k+1 fy k+1 ), u k u + E k+1 α k+1 ρ R r ξy k+1 ), u k u 5) Taking conditional expectation E k+1 statement of the Lemma in 0) of Lemma and using 5), we obtain the Lemma 4 Let the sequences {x k, y k, u k, α k, A k }, k 0 be generated by Algorithm 1 Then, for all u Q, it holds that E k+1 fx k+1 ) A k fx k ) α k+1 fy k+1 ) + fy k+1 ), u y k+1 ) + V [u k ]u) E k+1 V [u k+1 ]u) + E k+1 α k+1 ρ R r ξy k+1 ), u u k+1 6) Proof For any u Q, α k+1 fy k+1 ), y k+1 u = α k+1 fy k+1 ), y k+1 u k + α k+1 fy k+1 ), u k u 14),15) = A k fy k+1 ), x k y k+1 + α k+1 fy k+1 ), u k u conv-ty A k fx k ) fy k+1 )) + α k+1 fy k+1 ), u k u 4) A k fx k ) fy k+1 )) + fy k+1 ) E k+1 fx k+1 ))+ + V [u k ]u) E k+1 V [u k+1 ]u) + E k+1 α k+1 ρ R r ξy k+1 ), u u k+1 14) = α k+1 fy k+1 ) + A k fx k ) E k+1 fx k+1 ) + V [u k ]u) E k+1 V [u k+1 ]u) + E k+1 α k+1 ρ R r ξy k+1 ), u u k+1 7) Rearranging terms, we obtain the statement of the Lemma Theorem 1 Let the assumptions 1,, 3 hold Let the sequences {x k, y k, u k, α k, A k }, k 0 be generated by Algorithm 1 Let f be the optimal objective value and x be an optimal point in Problem 4) Denote P 0 = A 0 fx 0 ) f ) + V [u 0 ]x ) 8) 13

1 If the oracle error ξx) in 5) can be controlled and, on each iteration, the error level δ in 7) satises then, for all k 1, δ P 0 4ρA k, 9) Efx k ) f 3P 0 A k, where E denotes the expectation with respect to all the randomness up to step k If the oracle error ξx) in 5) can not be controlled, then, for all k 1, Efx k ) f P 0 A k + 4A k ρ δ Proof Let us change the counter in Lemma 4 from k to i, x u = x, take the full expectation in each inequality for i = 0,, k 1 and sum all the inequalities for i = 0,, k 1 Then, k 1 A k Efx k ) A 0 fx 0 ) α i+1 E fy i+1 ) + fy i+1 ), x y i+1 ) + V [u 0 ]x ) EV [u k ]x ) i=0 k 1 + Eα i+1 ρ R r ξy i+1 ), x u i+1 i=0 conv-ty,14),7) k 1 A k A 0 )fx ) + V [u 0 ]x ) EV [u k ]x ) + α i+1 ρδe x u i+1 E Rearranging terms and using 8), we obtain, for all k 1, k 1 0 A k Efx k ) f ) P0 EV [u k ]x ) + ρδ α i+1 ER i+1, 30) where we denoted R i = u i x E, i 0 1 We rst prove the rst statement of the Theorem We have 1 R 0 = 1 x u 0 3) E V [u 0 ]x ) 8) P0 31) Hence, ER 0 = R 0 P 0 P0 Let ER i P 0, for all i = 0,, k 1 Let us prove that ER k P 0 By convexity of square function, we obtain 1 ER k) 1 3) ER k EV [u k ]x ) 30) P0 + ρδ i=0 i=0 k α i+1 P 0 + α k ρδer k i=0 14) = P0 + ρδp 0 A k 1 A 0 ) + α k ρδer k P0 + ρδp 0 A k + α k ρδer k 3) 14

Since α k A k, k 0, by the choice of δ 9), we have ρδp 0 A k P 0 and α k ρδ A k ρδ P 0 So, we obtain an inequality for ER k 1 ER k) 3P 0 + P 0 4 ER k Solving this quadratic inequality in ER k, we obtain ER k P 0 4 + P 0 16 + 3P 0 = P 0 Thus, by induction, we have that, for all k 0, ER k P 0 Using the bounds ER i P 0, for all i = 0,, k, we obtain k 1 A k Efx k ) f ) 30) P0 14),9) + ρδ α i+1 ER i P0 + ρ P 0 A k A 0 ) P 0 3P 0 4ρA k i=0 This nishes the proof of the rst statement of the Theorem Now we prove the second statement of the Theorem First, from 30) for k = 1, we have 1 ER 1) 1 3) ER 1 EV [u 1 ]x ) 30) P0 + ρδα 1 ER 1 Solving this inequality in ER 1, we obtain ER 1 ρδα 1 + ρδα 1 ) + P0 ρδα 1 + P 0, 33) where we used that, for any a, b 0, a + b a + b Then, P 0 + ρδα 1 ER 1 P 0 + ρδα 1 ) + ρδα 1 P 0 P 0 + ρδ A 1 A 0 )) Thus, we have proved that the inequality k P0 + ρδ α i+1 ER i+1 i=0 P 0 + ρδ A k 1 A 0 )) 34) holds for k = Let us assume that it holds for some k and prove that it holds for k + 1 We have 1 ER k) 1 3) ER k EV [u k ]x ) 30) P0 + ρδ 34) k α i+1 ER i+1 + α k ρδer k i=0 P 0 + ρδ A k 1 A 0 )) + αk ρδer k 4 15

Solving this quadratic inequality in ER k, we obtain ER k α k ρδ + α k ρδ + P 0 + ρδ ) A k 1 A 0 ) α k ρδ) + P 0 + ρδ A k 1 A 0 )), 35) where we used that, for any a, b 0, a + b a + b Further, k 1 P0 34) + ρδ α i+1 ER i+1 i=0 P 0 + ρδ A k 1 A 0 )) + ρδαk ER k 35) P 0 + ρδ ) A k 1 A 0 ) + ρδαk ) + ρδα k P 0 + ρδ A k 1 A 0 )) P 0 + ρδ ) A k 1 A 0 ) + ρδα k = P 0 + ρδ A k A 0 )), which is 34) for k + 1 Using this inequality, we obtain k 1 A k Efx k ) f ) 30) P0 + ρδ α i+1 ER i+1 i=0 which nishes the proof of the Theorem P 0 + ρδ ) A k A 0 ) P 0 + 4ρ δ A k, Let us now estimate the growth rate of the sequence A k, k 0, which will give the rate of convergence for Algorithm 1 Lemma 5 Let the sequence {A k }, k 0 be generated by Algorithm 1 Then, for all k 1 it holds that k 1 + ρ) k 1 + ρ) A 4ρ k 36) ρ Proof As we showed in Lemma 1, α 1 = 1 ρ and, hence, A 1 = α 0 + α 1 = 1 Thus, 36) holds for k = 1 Let us assume that 36) holds for some k 1 and prove that it holds also for k + 1 From 14), we have a quadratic equation for α k+1 ρ α k+1 α k+1 A k = 0 Since we need to take the largest root, we obtain, α k+1 = 1 + 1 + 4ρ A k = 1 ρ ρ + 1 ρ + k 1 + ρ ρ = k + ρ ρ, 16 1 4ρ 4 + A k ρ 1 ρ + A k ρ

where we used the induction assumption that 36) holds for k On the other hand, α k+1 = 1 ρ + 1 4ρ + A k 4 ρ 1 ρ + A k ρ 1 ρ + k 1 + ρ ρ = k + ρ ρ, where we used inequality a + b a + b, a, b 0 Using the obtained inequalities for α k+1, from 14) and 36) for k, we get and = A k + α k+1 = A k + α k+1 k 1 + ρ) 4ρ k 1 + ρ) ρ In the last inequality we used that k 1, ρ 0 + k + ρ ρ + k + ρ ρ k + ρ) 4ρ k + ρ) ρ Remark 1 According to Theorem 1, if the desired accuracy of the solution is ε, ie the goal is to nd such ˆx Q that Efˆx) f ε, then the Algorithm 1 should be stopped when 3P0 A k ε Then 1 A k and the oracle error level δ should satisfy δ P 0 4ρA k ε 6ρP 0 ε 3P 0 From Lemma 5, we obtain that 3P 0 A k ε holds when k is the smallest integer satisfying k 1 + ρ) 4ρ 3P 0 ε This means that, to obtain an ε-solution, it is enough to choose { } 6P k = max ρ 0 + 1 ρ, 0 ε Note that this dependence on ε means that the proposed method is accelerated 3 Examples of Applications In this section, we apply our general framework, which consists of assumptions 1,, 3, RSTM as listed in Algorithm 1 and convergence rate Theorem 1, to obtain several particular algorithms and their convergence rate We consider Problem 4) and, for each particular case, introduce a particular setup, which includes properties of the objective function f, available information about this function, properties of the feasible set Q Based on each setup, we show how the Randomized Inexact Oracle is constructed and check that the assumptions 1,, 3 hold Then, we obtain convergence rate guarantee for each particular algorithm as a corollary of Theorem 1 Our examples include accelerated random directional search 17

with inexact directional derivative, accelerated random block-coordinate descent with inexact block derivatives, accelerated random derivative-free directional search with inexact function values, accelerated random derivative-free block-coordinate descent with inexact function values Accelerated random directional search and accelerated random derivativefree directional search were developed in Nesterov and Spokoiny [017], but for the case of exact directional derivatives and exact function values Also, in the existing methods, a Gaussian random vector is used for randomization Accelerated random block-coordinate descent was introduced in Nesterov [01] and further developed in by several authors see Introduction for the extended review) Existing methods of this type use exact information on the block derivatives and also only Euclidean proximal setup In the contrast, our algorithm works with inexact derivatives and is able to work with entropy proximal setup To the best of our knowledge, our accelerated random derivative-free block-coordinate descent with inexact function values is new This method also can work with entropy proximal setup 31 Accelerated Random Directional Search In this subsection, we introduce accelerated random directional search with inexact directional derivative for unconstrained problems with Euclidean proximal setup We assume that, for all i = 1,, n, Q i = E i = R, x i) i = x i) ), x i) E i, d i x i) ) = 1 xi) ), x i) E i and, hence, V i [z i) ]x i) ) = 1 xi) z i) ), x i), z i) E i Thus, Q = E = R n Further, we assume that f in 4) has L-Lipschitz-continuous gradient with respect to Euclidean norm, ie fx) fy) + fy), x y + L x y, x, y E 37) We set β i = L, i = 1,, n Then, by denitions in Subsection 11, we have x E = L x, x E, dx) = L x = 1 x E, x E, V [z]x) = L x z = 1 x z E, x, z E Also, we have g E, = L 1 g, g E We assume that, at any point x E, one can calculate an inexact derivative of f in a direction e E f x, e) = fx), e + ξx), where e is a random vector uniformly distributed on the Euclidean sphere of radius 1, ie S 1) := {s R n : s = 1}, and the directional derivative error ξx) R is uniformly bounded in absolute value by error level, ie ξx), x E Since we are in the Euclidean setting, we consider e also as an element of E We use n fx), e + ξx))e as Randomized Inexact Oracle Let us check the assumptions stated in Subsection 1 Randomized Inexact Oracle In this setting, we have ρ = n, H = R, R T p : E R is given by R T p g = g, e, g E, R r : R E is given by R r t = te, t R Thus, fx) = n fx), e + ξx))e One can prove that E e n fx), e e = ne e ee T fx) = fx), x E, and, thus, 6) holds Also, for all x E, we have R r ξx) E, = 1 L ξx)e L, which proves 7) if we take δ = L 18

Regularity of Prox-Mapping Substituting particular choice of Q, V [u]x), fx) in 8), we obtain { } L u + = arg min x R n x u + α n fy), e + ξy))e, x = u αn fy), e + ξy))e L Hence, since e, e = 1, we have R r R T p fy), u u + = fy), e e, αn L fy), e + ξy))e = fy), e e, e αn fy), e + ξy)) L = fy), αn L fy), e + ξy))e = fy), u u +, which proves 9) Smoothness By denition of E and 37), we have fx) fy) + fy), x y + L x y = fy) + fy), x y + 1 x y E, x, y E and 13) holds We have checked that all the assumptions listed in Subsection 1 hold Thus, we can obtain the following convergence rate result for random directional search as a corollary of Theorem 1 and Lemma 5 Corollary 1 Let Algorithm 1 with fx) = n fx), e + ξx))e, where e is random and uniformly distributed over the Euclidean sphere of radius 1, be applied to Problem 4) in the setting of this subsection Let f be the optimal objective value and x be an optimal point in Problem 4) Assume that directional derivative error ξx) satises ξx), x E Denote P0 = 1 1 ) fx 0 ) f ) + L n u 0 x 1 If the directional derivative error ξx) can be controlled and, on each iteration, the error level satises P 0 L, 4nA k then, for all k 1, 6n P0 Efx k ) f k 1 + n), where E denotes the expectation with respect to all the randomness up to step k If the directional derivative error ξx) can not be controlled, then, for all k 1, Efx k ) f 8n P 0 k 1 + n) + 4 L k 1 + n) 19

Remark According to Remark 1 and due to the relation δ = L, we obtain that the error level in the directional derivative should satisfy ε L 6nP 0 At the same time, to obtain an ε-solution for Problem 4), it is enough to choose { } 6P k = max n 0 + 1 n, 0 ε 3 Accelerated Random Coordinate Descent In this subsection, we introduce accelerated random coordinate descent with inexact coordinate derivatives for problems with separable constraints and Euclidean proximal setup We assume that, for all i = 1,, n, E i = R, Q i E i are closed and convex, x i) i = x i) ), x i) E i, d i x i) ) = 1 xi) ), x i) Q i, and, hence, V i [z i) ]x i) ) = 1 xi) z i) ), x i), z i) Q i Thus, Q = n i=1q i has separable structure Let us denote e i E the i-th coordinate vector Then, for i = 1,, n, the i-th coordinate derivative of f is f ix) = fx), e i We assume that the gradient of f in 4) is coordinatewise Lipschitz continuous with constants L i, i = 1,, n, ie f ix + he i ) f ix) L i h, h R, i = 1,, n, x Q 38) We set β i = L i, i = 1,, n Then, by denitions in Subsection 11, we have x E = n i=1 L ix i) ), x E, dx) = 1 n i=1 L ix i) ), x Q, V [z]x) = 1 n i=1 L ix i) z i) ), x, z Q Also, we have g E, = n i=1 L 1 i g i) ), g E We assume that, at any point x Q, one can calculate an inexact coordinate derivative of f f ix) = fx), e i + ξx), where the coordinate i is chosen from i = 1,, n at random with uniform probability 1 n, the coordinate derivative error ξx) R is uniformly bounded in absolute value by, ie ξx), x Q Since we are in the Euclidean setting, we consider e i also as an element of E We use n fx), e i + ξx))e i as Randomized Inexact Oracle Let us check the assumptions stated in Subsection 1 Randomized Inexact Oracle In this setting, we have ρ = n, H = E i = R, R T p : E R is given by R T p g = g, e i, g E, R r : R E is given by R r t = te i, t R Thus, fx) = n fx), e i + ξx))e i, x Q One can prove that E i n fx), e i e i = ne i e i e T i fx) = fx), x Q, and, thus, 6) holds Also, for all x Q, we have R r ξx) E, = 1 Li ξx) L0, where L 0 = min i=1,,n L i This proves 7) with δ = L0 0

Regularity of Prox-Mapping Separable structure of Q and V [u]x) means that the problem 8) boils down to n independent problems of the form { } u j) Lj + = arg min x j) Q j uj) x j) ) + α fy), e j x j), j = 1,, n Since fy) has only one, i-th, non-zero component, fy), e j is zero for all j i Thus, u u + has one, i-th, non-zero component and e i, u u + e i = u u + Hence, R r R T p fy), u u + = fy), e i e i, u u + = fy), e i e i, u u + = fy), e i, u u + e i = fy), u u +, which proves 9) Smoothness By the standard reasoning, using 38), one can prove that, for all i = 1,, n, fx + he i ) fx) + h fx), e i + L ih, h R, x Q 39) Let u, y Q, a R, and x = y + au + u) Q As we have shown above, u + u has only one, i-th, non-zero component Hence, there exists h R, such that u + u = he i and x = y + ahe i Thus, by denition of E and 39), we have fx) = fy + ahe i ) fy) + ah fy), e i + L i ah) = fy) + fy), ahe i + 1 ahe i E = fy) + fy), x y + 1 x y E This proves 13) We have checked that all the assumptions listed in Subsection 1 hold Thus, we can obtain the following convergence rate result for random coordinate descent as a corollary of Theorem 1 and Lemma 5 Corollary Let Algorithm 1 with fx) = n fx), e i +ξx))e i, where i is uniformly at random chosen from 1,, n, be applied to Problem 4) in the setting of this subsection Let f be the optimal objective value and x be an optimal point in Problem 4) Assume that coordinate derivative error ξx) satises ξx), x Q Denote P 0 = 1 1 ) fx 0 ) f ) + n n i=1 L i ui) 0 x i) ) 1

1 If the coordinate derivative error ξx) can be controlled and, on each iteration, the error level satises P 0 L0, 4nA k then, for all k 1, Efx k ) f 6n P 0 k 1 + n), where E denotes the expectation with respect to all the randomness up to step k If the coordinate derivative error ξx) can not be controlled, then, for all k 1, Efx k ) f 8n P 0 k 1 + n) + 4 L 0 k 1 + n) Remark 3 According to Remark 1 and due to the relation δ = level in the coordinate derivative should satisfy ε L 0 6nP 0 L0, we obtain that the error At the same time, to obtain an ε-solution for Problem 4), it is enough to choose { } 6P k = max n 0 + 1 n, 0 ε 33 Accelerated Random Block-Coordinate Descent In this subsection, we consider two block-coordinate settings The rst one is the Euclidean, which is usually used in the literature for accelerated block-coordinate descent The second one is the entropy, which, to the best of our knowledge, is analyzed in this context for the rst time We develop accelerated random block-coordinate descent with inexact block derivatives for problems with simple constraints in these two settings and their combination Euclidean setup We assume that, for all i = 1,, n, E i = R p i ; Q i is a simple closed convex set; x i) i = B i x i), x i), x i) E i, where B i is symmetric positive semidenite matrix; d i x i) ) = 1 xi) i, x i) Q i, and, hence, V i [z i) ]x i) ) = 1 xi) z i) i, x i), z i) Q i Entropy setup We assume that, for all i = 1,, n, E i = R p i ; Q i is standard simplex in R p i, ie, Q i = {x i) R p i + : p i j=1 [xi) ] j = 1}; x i) i = x i) 1 = p i j=1 [xi) ] j, x i) E i ; d i x i) ) = p i j=1 [xi) ] j ln[x i) ] j, x i) Q i, and, hence, V i [z i) ]x i) ) = p i j=1 [xi) ] j ln [xi) ] j [z i) ] j, x i), z i) Q i Note that, in each block, one also can choose other proximal setups from Ben-Tal and Nemirovski [015] Combination of dierent setups in dierent blocks is also possible, ie, in one block it is possible to choose the Euclidean setup and in an another block one can choose the entropy setup

Using operators U i, i = 1,, n dened in 1), for each i = 1,, n, the i-th block derivative of f can be written as f ix) = U T i fx) We assume that the gradient of f in 4) is blockwise Lipschitz continuous with constants L i, i = 1,, n with respect to chosen norms i, ie f ix + U i h i) ) f ix) i, L i h i) i, h i) E i, i = 1,, n, x Q 40) We set β i = L i, i = 1,, n Then, by denitions in Subsection 11, we have x E = n i=1 L i x i) i, x E, dx) = n i=1 L id i x i) ), x Q, V [z]x) = n i=1 L iv i [z i) ]x i) ), x, z Q Also, we have g E, = n i=1 L 1 i g i) i,, g E We assume that, at any point x Q, one can calculate an inexact block derivative of f f ix) = U T i fx) + ξx), where a block number i is chosen from 1,, n randomly uniformly, the block derivative error ξx) Ei is uniformly bounded in norm by, ie ξx) i,, x Q, i = 1,, n As Randomized Inexact Oracle, we use nũiui T fx) + ξx)), where Ũi is dened in ) Let us check the assumptions stated in Subsection 1 Randomized Inexact Oracle In this setting, we have ρ = n, H = E i, R T p : E Ei is given by R T p g = Ui T g, g E, R r : Ei E is given by R r g i) = Ũig i), g i) Ei Thus, fx) = nũiu T i fx) + ξx)), x Q Since i R[1, n], one can prove that E i nũiui T fx) = fx), x Q, and, thus, 6) holds Also, for all x Q, we have R r ξx) E, = Ũiξx) E, = 1 Li ξx) i, L0, where L 0 = min i=1,,n L i This proves 7) with δ = L0 Regularity of Prox-Mapping Separable structure of Q and V [u]x) means that the problem 8) boils down to n independent problems of the form { u j) + = arg min L j V [u j) ]x j) ) + α U T fy), } x j) j x j), j = 1,, n Q j Since fy) has non-zero components only in the block i, U T j fy) is zero for all j i Thus, u u + has non-zero components only in the block i and U i Ũ T i u u + ) = u u + Hence, R r R T p fy), u u + = ŨiU T i fy), u u + = fy), U i Ũ T i u u + ) = fy), u u +, which proves 9) Smoothness By the standard reasoning, using 40), one can prove that, for all i = 1,, n, fx + U i h i) ) fx) + U T i fx), h i) + L i hi) i, h i) E i, x Q 41) 3

Let u, y Q, a R, and x = y + au + u) Q As we have shown above, u + u has nonzero components only in the block i Hence, there exists h i) E i, such that u + u = U i h i) and x = y + au i h i) Thus, by denition of E and 41), we have fx) = fy + au i h i) ) fy) + U T i fy), ah i) + L i ahi) i = fy) + fy), au i h i) + 1 au ih i) E = fy) + fy), x y + 1 x y E This proves 13) We have checked that all the assumptions listed in Subsection 1 hold Thus, we can obtain the following convergence rate result for random block-coordinate descent as a corollary of Theorem 1 and Lemma 5 Corollary 3 Let Algorithm 1 with fx) = nũiui T fx) + ξx)), where i is uniformly at random chosen from 1,, n, be applied to Problem 4) in the setting of this subsection Let f be the optimal objective value and x be an optimal point in Problem 4) Assume that block derivative error ξx) satises ξx), x Q Denote P 0 = 1 1 n ) fx 0 ) f ) + V [u 0 ]x ) 1 If the block derivative error ξx) can be controlled and, on each iteration, the error level satises P 0 L0, 4nA k then, for all k 1, 6n P0 Efx k ) f k 1 + n), where E denotes the expectation with respect to all the randomness up to step k If the block derivative error ξx) can not be controlled, then, for all k 1, Efx k ) f 8n P 0 k 1 + n) + 4 L 0 k 1 + n) Remark 4 According to Remark 1 and due to the relation δ = derivative error should satisfy L0, we obtain that the block ε L 0 6nP 0 At the same time, to obtain an ε-solution for Problem 4), it is enough to choose { } 6P k = max n 0 + 1 n, 0 ε 4

34 Accelerated Random Derivative-Free Directional Search In this subsection, we consider the same setting as in Subsection 31, except for Randomized Inexact Oracle Instead of directional derivative, we use here its nite-dierence approximation We assume that, for all i = 1,, n, Q i = E i = R, x i) i = x i) ), x i) E i, d i x i) ) = 1 xi) ), x i) E i, and, hence, V i [z i) ]x i) ) = 1 xi) z i) ), x i), z i) E i Thus, Q = E = R n Further, we assume that f in 4) has L-Lipschitz-continuous gradient with respect to Euclidean norm, ie fx) fy) + fy), x y + L x y, x, y E 4) We set β i = L, i = 1,, n Then, by denitions in Subsection 11, we have x E = L x, x E, dx) = L x = 1 x E, x E, V [z]x) = L x z = 1 x z E, x, z E Also, we have g E, = L 1 g, g E We assume that, at any point x E, one can calculate an inexact value fx) of the function f, st fx) fx), x E To approximate the gradient of f, we use fx) = n fx + τe) fx) e, τ where τ > 0 is small parameter, which will be chosen later, e E is a random vector uniformly distributed on the Euclidean sphere of radius 1, ie on S 1) := {s R n : s = 1} Since, we are in the Euclidean setting, we consider e also as an element of E Let us check the assumptions stated in Subsection 1 Randomized Inexact Oracle First, let us show that the nite-dierence approximation for the gradient of f can be expressed in the form of 5) We have fx) = n fx + τe) fx) e = n fx), e + 1 ) τ τ fx + τe) fx) τ fx), e ) e Taking ρ = n, H = R, R T p : E R be given by R T p g = g, e, g E, R r : R E be given by R r t = te, t R, we obtain fx) = n fx), e + ξx))e, where ξx) = 1 fx + τe) fx) τ fx), e ) One can prove that E τ e n fx), e e = ne e ee T fx) = fx), x E, and, thus, 6) holds It remains to prove 7), ie, nd δ st for all x E, we have R r ξx) E, δ R r ξx) E, = 1 ξx)e = 1 1 L L τ fx + τe) fx) τ fx), e )e = 1 1 L τ fx + τe) fx + τe) fx) fx)) + 1 fx + τe) fx) τ fx), e ))e L τ L + τ L 5

Here we used that fx) fx), x E and 4) So, we have that 7) holds with δ = τ + τ L To balance both terms, we choose τ =, which leads to equality L L δ = Regularity of Prox-Mapping This assumption can be checked in the same way as in Subsection 31 Smoothness This assumption can be checked in the same way as in Subsection 31 We have checked that all the assumptions listed in Subsection 1 hold Thus, we can obtain the following convergence rate result for random derivative-free directional search as a corollary of Theorem 1 and Lemma 5 fx+τe) fx) Corollary 4 Let Algorithm 1 with fx) = n e, where e is random and uniformly τ distributed over the Euclidean sphere of radius 1, be applied to Problem 4) in the setting of this subsection Let f be the optimal objective value and x be an optimal point in Problem 4) Assume that function value error fx) fx) satises fx) fx), x E Denote P0 = 1 1 ) fx 0 ) f ) + L n u 0 x 1 If the error in the value of the objective f can be controlled and, on each iteration, the error level satises and τ = L then, for all k 1, Efx k ) f P 0, 64n A k 6n P 0 k 1 + n), where E denotes the expectation with respect to all the randomness up to step k If the error in the value of the objective f can not be controlled and τ =, then, L for all k 1, Efx k ) f 8n P 0 k 1 + n) + 16k 1 + n) L Remark 5 According to Remark 1 and due to the relation δ =, we obtain that the error level in the function value should satisfy The parameter τ should satisfy τ ε 144n P0 ε 6nP 0 L 6