RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION

Similar documents
Stochastic Subgradient Methods

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Tight Complexity Bounds for Optimizing Composite Objectives

Distributed Subgradient Methods for Multi-agent Optimization

Computational and Statistical Learning Theory

Sharp Time Data Tradeoffs for Linear Inverse Problems

Asynchronous Gossip Algorithms for Stochastic Optimization

1 Bounding the Margin

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

A Simple Regression Problem

An incremental mirror descent subgradient algorithm with random sweeping and proximal step

An incremental mirror descent subgradient algorithm with random sweeping and proximal step

CS Lecture 13. More Maximum Likelihood

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay

Convex Programming for Scheduling Unrelated Parallel Machines

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

Constrained Consensus and Optimization in Multi-Agent Networks arxiv: v2 [math.oc] 17 Dec 2008

OPTIMIZATION in multi-agent networks has attracted

A Theoretical Analysis of a Warm Start Technique

Fairness via priority scheduling

Recovering Data from Underdetermined Quadratic Measurements (CS 229a Project: Final Writeup)

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

Data Dependent Convergence For Consensus Stochastic Optimization

A note on the multiplication of sparse matrices

Soft Computing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis

arxiv: v1 [cs.lg] 8 Jan 2019

Fast Montgomery-like Square Root Computation over GF(2 m ) for All Trinomials

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

COS 424: Interacting with Data. Written Exercises

On the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation

1 Rademacher Complexity Bounds

Inexact Proximal Gradient Methods for Non-Convex and Non-Smooth Optimization

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

arxiv: v3 [cs.lg] 7 Jan 2016

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

Soft-margin SVM can address linearly separable problems with outliers

On Constant Power Water-filling

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis

PAC-Bayes Analysis Of Maximum Entropy Learning

Combining Classifiers

Complex Quadratic Optimization and Semidefinite Programming

Hybrid System Identification: An SDP Approach

Randomized Accuracy-Aware Program Transformations For Efficient Approximate Computations

Graphical Models in Local, Asymmetric Multi-Agent Markov Decision Processes

Lower Bounds for Quantized Matrix Completion

1 Identical Parallel Machines

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs

Boosting with log-loss

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Ch 12: Variations on Backpropagation

arxiv: v1 [cs.ds] 3 Feb 2014

Support Vector Machines MIT Course Notes Cynthia Rudin

Estimating Entropy and Entropy Norm on Data Streams

Multi-Dimensional Hegselmann-Krause Dynamics

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information

C na (1) a=l. c = CO + Clm + CZ TWO-STAGE SAMPLE DESIGN WITH SMALL CLUSTERS. 1. Introduction

Kernel Methods and Support Vector Machines

Extension of CSRSM for the Parametric Study of the Face Stability of Pressurized Tunnels

arxiv: v2 [cs.lg] 30 Mar 2017

Polygonal Designs: Existence and Construction

A Probabilistic and RIPless Theory of Compressed Sensing

Bootstrapping Dependent Data

Approximation in Stochastic Scheduling: The Power of LP-Based Priority Policies

Tight Information-Theoretic Lower Bounds for Welfare Maximization in Combinatorial Auctions

arxiv: v1 [math.na] 10 Oct 2016

In this chapter, we consider several graph-theoretic and probabilistic models

A Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science

Composite optimization for robust blind deconvolution

Non-Parametric Non-Line-of-Sight Identification 1

Optimal Jamming Over Additive Noise: Vector Source-Channel Case

Lecture 9: Multi Kernel SVM

Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression

Efficient Learning with Partially Observed Attributes

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

A PROBABILISTIC AND RIPLESS THEORY OF COMPRESSED SENSING. Emmanuel J. Candès Yaniv Plan. Technical Report No November 2010

arxiv: v1 [cs.ds] 17 Mar 2016

Gradient Sliding for Composite Optimization

Exact tensor completion with sum-of-squares

Probability Distributions

Error Exponents in Asynchronous Communication

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space

A Note on Online Scheduling for Jobs with Arbitrary Release Times

On the Use of A Priori Information for Sparse Signal Approximations

Interactive Markov Models of Evolutionary Algorithms

Pattern Recognition and Machine Learning. Artificial Neural networks

Structured signal recovery from quadratic measurements: Breaking sample complexity barriers via nonconvex optimization

Necessity of low effective dimension

Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence

Computational and Statistical Learning Theory

MSEC MODELING OF DEGRADATION PROCESSES TO OBTAIN AN OPTIMAL SOLUTION FOR MAINTENANCE AND PERFORMANCE

Proceedings of the 2016 Winter Simulation Conference T. M. K. Roeder, P. I. Frazier, R. Szechtman, E. Zhou, T. Huschka, and S. E. Chick, eds.

Optimal Resource Allocation in Multicast Device-to-Device Communications Underlaying LTE Networks

Bipartite subgraphs and the smallest eigenvalue

Analyzing Simulation Results

Transcription:

RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION GUANGHUI LAN AND YI ZHOU Abstract. In this paper, we consider a class of finite-su convex optiization probles defined over a distributed ultiagent network with agents connected to a central server. In particular, the objective function consists of the average of ( 1) sooth coponents associated with each network agent together with a strongly convex ter. Our ajor contribution is to develop a new randoized increental gradient algorith, naely rando gradient extrapolation ethod (RGEM), which does not require any exact gradient evaluation even for the initial point, but can achieve the optial O(log(1/ɛ)) coplexity bound in ters of the total nuber of gradient evaluations of coponent functions to solve the finite-su probles. Furtherore, we deonstrate that for stochastic finite-su optiization probles, RGEM aintains the optial O(1/ɛ) coplexity (up to a certain logarithic factor) in ters of the nuber of stochastic gradient coputations, but attains an O(log(1/ɛ)) coplexity in ters of counication rounds (each round involves only one agent). It is worth noting that the forer bound is independent of the nuber of agents, while the latter one only linearly depends on or even for ill-conditioned probles. To the best of our knowledge, this is the first tie that these coplexity bounds have been obtained for distributed and stochastic optiization probles. Moreover, our algoriths were developed based on a novel dual perspective of Nesterov s accelerated gradient ethod. Keywords: finite-su optiization, gradient extrapolation, randoized ethod, distributed achine learning, stochastic optiization. 1. Introduction. The ain proble of interest in this paper is the finite-su convex prograing (CP) proble given in the for of ψ { := in ψ(x) := 1 x X i=1 f i(x) + w(x) }. (1.1) Here, X R n is a closed convex set, f i : X R, i = 1,...,, are sooth convex functions with Lipschitz continuous gradients over X, i.e., L i 0 such that f i (x 1 ) f i (x ) L i x 1 x, x 1, x X, (1.) w : X R is a strongly convex function with odulus 1 w.r.t. a nor, i.e., w(x 1 ) w(x ) w (x ), x 1 x 1 x 1 x, x 1, x X, (1.3) where w ( ) denotes any subgradient (or gradient) of w( ) and 0 is a given constant. Hence, the objective function ψ is strongly convex whenever > 0. For notational convenience, we also denote f(x) 1 i=1 f i(x), L 1 i=1 L i, and ˆL = ax i=1,..., L i. It is easy to see that for soe L f 0, f(x 1 ) f(x ) L f x 1 x L x 1 x, x 1, x X. (1.4) We also consider a class of stochastic finite-su optiization probles given by ψ := in x X { ψ(x) := 1 i=1 E ξ i F i (x, ξ i ) + w(x) }, (1.5) where ξ i s are rando variables with support Ξ i R d. It can be easily seen that (1.5) is a special case of (1.1) with f i (x) = E ξi F i (x, ξ i ), i = 1,...,. However, different fro deterinistic finite-su optiization probles, only noisy gradient inforation of each coponent function f i can be accessed for the stochastic finite-su optiization proble in (1.5). The deterinistic finite-su proble (1.1) can odel the epirical risk iniization in achine learning and statistical inferences, and hence has becoe the subject of intensive studies during the past few years. H. Milton Stewart School of Industrial & Systes Engineering, Georgia Institute of Technology, Atlanta, GA, 3033. (eail: george.lan@isye.gatech.edu). H. Milton Stewart School of Industrial & Systes Engineering, Georgia Institute of Technology, Atlanta, GA, 3033. (eail: yizhou@gatech.edu). 1

Our study on finite-su probles (1.1) and (1.5) has also been otivated by the eerging need for distributed optiization and achine learning. Under such settings, each coponent function f i is associated with an agent i, i = 1,...,, which are connected through a distributed network. While different topologies can be considered for distributed optiization (see, e.g., Figure 1.1 and 1.), in this paper, we focus on the star network where agents are connected to one central server, and all agents only counicate with the server (see Figure 1.1). These types of distributed optiization probles have several unique features. Firstly, they allow for data privacy, since no local data is stored in the server. Secondly, network agents behave independently and they ay not be responsive at the sae tie. Thirdly, the counication between the server and agent can be expensive and has high latency. Finally, by considering the stochastic finite-su optiization proble, we are interested in not only the deterinistic epirical risk iniization, but also the generalization risk for distributed achine learning. Moreover, we allow the private data for each agent to be collected in an online (steaing) fashion. One typical exaple of the aforeentioned distributed probles is Federated Learning recently introduced by Google in 5. As a particular exaple, in the l -regularized logistic regression proble, we have f i (x) = l i (x) := 1 Ni N i j=1 log(1 + exp( bi j ai T j x)), i = 1,...,, w(x) = R(x) := 1 x, provided that f i is the loss function of agent i with training data {a i j, bi j }Ni j=1 Rn { 1, 1}, and := λ is the penalty paraeter. For iniization of the generalized risk, f i s are given in the for of expectation, i.e., f i (x) = l i (x) := E ξi log(1 + exp( ξ T i x)), i = 1,...,, where the rando variable ξ i odels the underlying distribution for training dataset of agent i. Note that Fig. 1.1. A distributed network with 5 agents and one server Fig. 1.. An exaple of the decentralized network another type of topology for distributed optiization is the ulti-agent network without a central server, naely the decentralized setting, as shown in Figure 1., where the agents can only counicate with their neighbors to update inforation, please refer to 1, 3, 3 and reference therein for decentralized algoriths. During the past few years, randoized increental gradient (RIG) ethods have eerged as an iportant class of first-order ethods for finite-su optiization (e.g.,4, 16, 35, 8, 9,, 1, 14, 4). For solving nonsooth finite-su probles, Neirovski et al. 6, 7 showed that stochastic subgradient (irror) descent ethods can possibly save up to O( ) subgradient evaluations. By utilizing the soothness properties of the objective, Lan 18 showed that one can separate the ipact of variance fro other deterinistic coponents for stochastic gradient descent and presented a new class of accelerated stochastic gradient descent ethods to further iprove these coplexity bounds. However, the overall rate of convergence of these stochastic ethods is still sublinear even for sooth and strongly finite-su probles (see 11, 1). Inspired by these works and the success of the increental aggregated gradient ethod by Blatt et al.4, Schiidt et al. 9 presented a stochastic average gradient (SAG) ethod, which uses randoized sapling of f o update the gradients, and can achieve a linear rate of convergence, i.e., an O { + (L/) log(1/ɛ)} coplexity bound, to solve unconstrained finite-su probles (1.1). Johnson and Zhang later in 16 presented a stochastic variance

reduced gradient (SVRG) ethod, which coputes an estiator of f by iteratively updating the gradient of one randoly selected f i of the current exact gradient inforation and re-evaluating the exact gradient fro tie to tie. Xiao and Zhang 35 later extended SVRG to solve proxial finite-su probles (1.1). All these ethods exhibit an iproved O {( + L/) log(1/ɛ)} coplexity bound, and Defazio et al. 8 also presented an iproved SAG ethod, called SAGA, that can achieve such a coplexity result. Coparing to the class of stochastic dual ethods (e.g., 31, 30, 36), each iteration of the RIG ethods only involves the coputation f i, rather than solving a ore coplicated subproble argin{ g, y + f i (y) + y }, which ay not have explicit solutions 31. Noting that ost of these RIG ethods are not optial even for = 1, uch recent research effort has been directed to the acceleration of RIG ethods. In 015, Lan and Zhou in proposed a RIG ethod, naely randoized prial-dual gradient (RPDG) ethod, and show that its total nuber of gradient coputations of f i can be bounded by {( ) } L O + log 1 ɛ. (1.6) The RPDG ethod utilizes a direct acceleration without even using the concept of variance reduction, evolving fro the randoized prial-dual ethods developed in 36, 7 for solving saddle-point probles. Lan and Zhou also established a lower coplexity bound for the RIG ethods by showing that the nuber of gradient evaluations of f i required by any RIG ethods to find an ɛ-solution of (1.1), i.e., a point x X s.t. E x x ɛ, cannot be saller than whenever the diension (( Ω + n (k + /)/ log(1/q), ) ) L log 1 ɛ, (1.7) where k is the total nuber of iterations and q = 1 + /( L/(( + 1)) 1). Siultaneously, Lin et al. 4 presented a catalyst schee which utilizes a restarting technique to accelerate the SAG ethod in 9 (or other non-accelerated first-order ethods) and thus can possibly iprove the coplexity bounds obtained by SVRG and SAGA to (1.6) (under the Euclidean setting). Allen-Zhu 1 later showed that one can also directly accelerate SVRG to achieve the optial rate of convergence (1.6). All these accelerated RIG ethods can save up to O( ) in the nuber of gradient evaluations of f i coparing to optial deterinistic first-order ethods when L/. It should be noted that ost existing RIG ethods were inspired by epirical risk iniization on a single server (or cluster) in achine learning rather than on a set of agents distributed over a network. Under the distributed setting, ethods requiring full gradient coputation and/or restarting fro tie to tie ay incur extra counication and synchronization costs. As a consequence, ethods which require fewer full gradient coputations (e.g. SAG, SAGA and RPDG) see to be ore advantageous in this regard. An interesting but yet unresolved question in stochastic optiization is whether there exists a ethod which does not require the coputation of any full gradients (even at the initial point), but can still achieve the optial rate of convergence in (1.6). Moreover, little attention in the study of RIG ethods has been paid to the stochastic finite-su proble in (1.5), which is iportant for generalization risk iniization in achine learning. Very recently, there are soe progresses on stochastic prial-dual type ethods for solving proble (1.5). For exaple, Lan, Lee and Zhou 1 proposed a stochastic decentralized counication sliding ethod that can achieve the optial sapling coplexity of O(1/ɛ) and best-known O(1/ ɛ) coplexity bounds for counication rounds for solving stochastic decentralized strongly convex probles. For the distributed setting with a central sever, by using ini-batch technique to collect gradient inforation and any stochastic gradient based algorith as a black box to update iterates, Dekel et al. 9 presented a distributed inibatch algorith with a batch size of o( 1/ ) that can obtain O(1/ɛ) sapling coplexity (i.e., nuber of 3

stochastic gradients) for stochastic strongly convex probles, and hence iplies at least O(1/ ɛ) bound for counication coplexity. An asynchronous version was later proposed by Feyzahdavian et al. in 10 that aintained the above convergence rate for regularized stochastic strongly convex probles. It should be pointed out that these ini-batch based distributed algoriths require sapling fro all network agents iteratively and hence leads to at least O(/ ɛ) rate of convergence in ters of counication costs aong server and agents. It is unknown whether there exists an algorith which only requires a significantly saller counication rounds (e.g. O(log 1/ɛ)), but can achieve the optial O(1/ɛ) sapling coplexity for solving the stochastic finite-su proble in (1.5). The ain contribution of this paper is to introduce a new randoized increental gradient type ethod to solve (1.1) and (1.5). Firstly, we develop a rando gradient extrapolation ethod (RGEM) for solving (1.1) that does not require any exact gradient evaluations of f. For strongly convex probles, we deonstrate that RGEM can still achieve the optial rate of convergence (1.6) under the assuption that the average of gradients of f i at the initial point x 0 is bounded by σ 0. To the best of our knowledge, this is the first tie that such an optial RIG ethods without any exact gradient evaluations has been presented for solving (1.1) in the literature. In fact, without any full gradient coputation, RGEM possesses iteration costs as low as pure stochastic gradient descent (SGD) ethods, but achieves a uch faster and optial linear rate of convergence for solving deterinistic finite-su probles. In coparison with the well-known randoized Kaczarz ethod 33, which can be viewed as an enhanced version of SGD, but can achieve a linear rate of convergence for solving linear systes, RGEM has a better convergence rate in ters of the dependence on the condition nuber L/. Secondly, we develop a stochastic version of RGEM and establish its optial convergence properties for solving stochastic finite-su probles (1.5). More specifically, we assue that only noisy first-order inforation of one randoly selected coponent function f i can be accessed via a stochastic first-order (SFO) oracle iteratively. In other words, at each iteration only one randoly selected network agent needs to copute an estiator of its gradient by sapling fro its local data using a SFO oracle instead of perforing exact gradient evaluation of its coponent function f i. Note that for these probles, it is difficult to copute the exact gradients even at the initial point. Under standard assuptions for centralized stochastic optiization, i.e., the gradient estiators coputed by the SFO oracle are unbiased and have bounded variance σ, the nuber of stochastic gradient evaluations perfored by RGEM to solve (1.5) can be bounded by 1 { } σ 0 Õ /+σ ɛ + x0 x +ψ(x0 ) ψ ɛ, (1.8) for finding a point x X s.t. E x x ɛ. Moreover, by utilizing the ini-batch technique, RGEM can achieve an {( ) } O + ˆL log 1 ɛ, (1.9) coplexity bound in ters of the nuber of counication rounds, and each round only involves the counication between the server and a randoly selected agent. This bound sees to be optial, since it atches the lower coplexity bound for RIG ethods to solve deterinistic finite-su probles. It is worth noting that the forer bound (1.8) is independent of the nuber of agents, while the latter one (1.9) only linearly depends on or even for ill-conditioned probles. To the best of our knowledge, this is the first tie that such a RIG type ethod has been developed for solving stochastic finite-su probles (1.5) that can achieve the optial counication coplexity and nearly optial (up to a logarithic factor) sapling coplexity in the literature. RGEM is developed based on a novel algorithic fraework, naely gradient extrapolation ethod (GEM), that we introduce in this paper for solving black-box convex optiization (i.e., = 1). The developent of GEM was inspired by our recent studies on the relation between accelerated gradient ethods and the prial-dual gradient ethods. In particular, it is observed in that Nesterov s accelerated gradient ethod is a special prial-dual gradient (PDG) ethod where the extrapolation step is perfored in the prial space. 1 Õ indicates the rate of convergence is up to a logarithic factor - log(1/ɛ). 4

Such a prial extrapolation step, however, ight result in a search point outside the feasible region under the randoized setting in the RPDG ethod entioned above. In view of this deficiency of PDG and RPDG ethods, we propose to switch the prial and dual spaces for prial-dual gradient ethods, and to perfor the extrapolation step in the dual (gradient) space. The resulting new first-order ethod, i.e., GEM, can be viewed as a dual version of Nesterov s accelerated gradient ethod, and we show that it can also achieve the optial rate of convergence for black-box convex optiization. RGEM is a randoized version of GEM which only coputes the gradient of a randoly selected coponent function f i at each iteration. It utilizes the gradient extrapolation step also for estiating exact gradients in addition to predicting dual inforation as in GEM. As a result, it has several advantages over RPDG. Firstly, RPDG requires a restricted assuption that each f i has to be differentiable and has Lipschitz continuous gradients over the whole R n due to its prial extrapolation step. RGEM relaxes this assuption to having Lipschitz gradients over the feasible set X (see (1.)), and hence can be applied to a uch broader class of probles. Secondly, RGEM possesses sipler convergence analysis carried out in the prial space due to its siplified algorithic schee. However, RPDG has a coplicated algorithic schee, which contains a prial extrapolation step and a gradient (dual) prediction step in addition to solving a prial proxial subproble, and thus leads to an intricate prial-dual convergence analysis. Last but not least, it is unknown whether RPDG could aintain the optial convergence rate (1.6) without the exact gradient evaluation of f during initialization. This paper is organized as follows. In Section we present the proposed rando gradient extrapolation ethods (RGEM), and their convergence properties for solving (1.1) and (1.5). In order to provide ore insights into the design of the algorithic schee of RGEM, we provide an introduction to the gradient extrapolation ethod (GEM) and its relation to the prial-dual gradient ethod, as well as Nesterov s ethod in Section 3. Section 4 is devoted to the convergence analysis of RGEM. Soe concluding rearks are ade in Section 5. 1.1. Notation and terinology. We use to denote a general nor in R n without specific ention. We also use to denote the conjugate nor of. For any p 1, p denotes the standard p-nor in R n, i.e., x p p = n i=1 x i p, for any x R n. For any convex function h, h(x) is the set of subdifferential at x. For a given strongly convex function w with odulus 1 (see (1.1)), we define a prox-function associated with w as P (x 0, x) P w (x 0, x) := w(x) w(x 0 ) + w (x 0 ), x x 0, (1.10) where w (x 0 ) w(x 0 ) is an arbitrary subgradient of w at x 0. By the strong convexity of w, we have P (x 0, x) 1 x x0, x, x 0 X. (1.11) It should be pointed out that the prox-function P (, ) described above is a generalized Bregan distance in the sense that w is not necessarily differentiable. This is different fro the standard definition for Bregan distance 5,, 3, 17, 6. Throughout this paper, we assue that the prox-apping associated with X and w, given by M X (g, x 0, η) := argin x X { g, x + w(x) + ηp (x 0, x) }, (1.1) is easily coputable for any x 0 X, g R n, 0, η > 0. For any real nuber r, r and r denote the nearest integer to r fro above and below, respectively. R + and R ++, respectively, denote the set of nonnegative and positive real nubers.. Algoriths and ain results. This section contains three subsections. We first present in Subsection.1 an optial rando gradient extrapolation ethod (RGEM) for solving the distributed finite-su proble in (1.1), and then discuss in Subsection., a stochastic version of RGEM for solving the stochastic finite-su proble in (1.5). Subsection.3 is devoted to the ipleentation of RGEM in a distributed setting and the discussion about its counication coplexity. 5

.1. RGEM for deterinistic finite-su optiization. The basic schee of RGEM is forally stated in Algorith 1. This algorith siply initializes the gradient as y 1 = y 0 = 0. At each iteration, RGEM requires the new gradient inforation of only one randoly selected coponent function f i, but aintains pairs of search points and gradients (x t i, yt i ), i = 1,...,, which are stored by their corresponding agents in the distributed network. More specifically, it first perfors a gradient extrapolation step in (.1) and the prial proxial apping in (.). Then a randoly selected block x t is updated in (.3) and the corresponding coponent gradient f it is coputed in (.4). As can be seen fro Algorith 1, RGEM does not require any exact gradient evaluations. Algorith 1 A rando gradient extrapolation ethod (RGEM) Input: Let x 0 X, and the nonnegative paraeters {α t }, {η t }, and {τ t } be given. Initialization: Set x 0 i = x0, i = 1,...,, y 1 = y 0 = 0. No exact gradient evaluation for initialization for t = 1,..., k do Choose according to Prob{ = i} = 1, i = 1,...,. Update z t = (x t, y t ) according to end for Output: For soe θ t > 0, t = 1,..., k, set ỹ t = y t 1 + α t (y t 1 y t ), (.1) x t = M X ( 1 i=1ỹt i, xt 1, η t ), { (.) x t (1 + τ t ) 1 (x t + τ t x t 1 i ), i =, i = x t 1 i, i. { (.3) y f i (x t i = t, y t 1 i, i. (.4) x k := ( t=1 θ t) 1 t=1 θ tx t. (.5) Note that the coputation of x t in (.) requires an involved coputation of 1 i=1ỹt i. In order to save coputational tie when ipleenting this algorith, we suggest to copute this quantity in a recursive anner as follows. Let us denote g t 1 i=1 yt i, t = 1,..., k. Clearly, in view of the fact that yt i = yt 1 i, i, we have g t = g t 1 + 1 (yt y t 1 ). (.6) Also, by the definition of g t and (.1), we have 1 i=1ỹt i = 1 i=1 yt 1 i + αt (yt 1 1 y t 1 ) = g t 1 + αt (yt 1 1 y t 1 ). (.7) Using these two ideas entioned above, we can copute 1 i=1ỹt i in two steps: i) initialize g0 = 0, and update g t as in (.6) after the gradient evaluation step (.4); ii) replace (.1) by (.7) to copute 1 i=1ỹt i. Also note that the difference y t y t 1 can be saved as it is used in both (.6) and (.7) for the next iteration. These enhanceents will be incorporated into the distributed setting in Subsection.3 to possibly save counication costs. It is also interesting to observe the differences between RGEM and RPDG. RGEM has only one extrapolation step (.1) which cobines two types of predictions. One is to predict future gradients using historic data, and the other is to obtain an estiator of the current exact gradient of f fro the randoly updated gradient inforation of f i. However, RPDG ethod needs two extrapolation steps in both the 6

prial and dual spaces. Due to the existence of the prial extrapolation step, RPDG cannot guarantee the search points where it perfors gradient evaluations to fall within the feasible set X. Hence, it requires the assuption that f i s are differentiable with Lipschitz continuous gradients over R n. Such a strong assuption is not required by RGEM, since all the prial iterates generated by RGEM stay within the feasible region X. As a result, RGEM can deal with a uch wider class of probles than RPDG. Moreover, RGEM allows no exact gradient coputation for initialization, which provides a fully-distributed algorithic fraework under the assuption that there exists σ 0 0 such that 1 i=1 f i(x 0 ) σ 0, (.8) where x 0 is the given initial point. We now provide a constant step-size policy for RGEM to solve strongly convex probles given in the for of (1.1) and show that the resulting algorith exhibits an optial linear rate of convergence in Theore.1. The proof of Theore.1 can be found in Subsection 4.1. Theore.1. Let x be an optial solution of (1.1), x k and x k be defined in (.) and (.5), respectively, and ˆL = ax i=1,..., L i. Also let {τ t }, {η t } and {α t } be set to If (.8) holds and α is set as then τ t τ = 1 (1 α) 1, η t η = α 1 α, and α t α. (.9) 1 α = 1, (.10) + +16 ˆL/ where EP (x k, x ) 0,σ 0 αk, { } (.11) Eψ(x k ) ψ(x ) 16 ax 0,σ0 α k/, (.1), ˆL 0,σ0 := P (x 0, x ) + ψ(x 0 ) ψ + σ 0. (.13) In view of Theore.1, we can provide bounds on the total nuber of gradient evaluations perfored by RGEM to find a stochastic ɛ-solution of proble (1.1), i.e., a point x X s.t. Eψ( x) ψ ɛ. Theore.1 iplies the nuber of gradient evaluations of f i perfored by RGEM to find a stochastic ɛ-solution of (1.1) can be bounded by K(ɛ, C, σ 0) = ( + ) + 16C log 16 ax{,c} 0,σ 0 ɛ {( = O + ) } ˆL log 1 ɛ. (.14) Here C = ˆL/. Therefore, whenever C log(1/ɛ) is doinating, and L f and ˆL are in the sae order of agnitude, RGEM can save up to O( ) gradient evaluations of the coponent function f han the optial deterinistic first-order ethods. More specifically, RGEM does not require any exact gradient coputation and its counication cost is siilar to pure stochastic gradient descent. To the best of our knowledge, it is the first tie that such an optial RIG ethod is presented for solving (1.1) in the literature. It should be pointed out that while the rates of convergence of RGEM obtained in Theore.1 is stated in ters of expectation, we can develop large-deviation results for these rates of convergence using siilar techniques in for solving strongly convex probles. Furtherore, if a one-tie exact gradient evaluation is available at the initial point, i.e., y 1 = y 0 = ( f 1 (x 0 ),..., f (x 0 )), we can drop the assuption in (.8) and eploy a ore aggressive stepsize policy with α = 1 +, +8 ˆL/ 7

Siilarly, we can deonstrate that the nuber of gradient evaluations of f i perfored by RGEM with this initialization ethod to find a stochastic ɛ-solution can be bounded by ( + ) ( ) {( ) } 6 ax{,c} 0,0 + 8C log ɛ + = O + ˆL log 1 ɛ... RGEM for stochastic finite-su optiization. We discuss in this subsection the stochastic finite-su optiization and online learning probles, where only noisy gradient inforation of f i can be accessed via a stochastic first-order (SFO) oracle. In particular, for any given point x t i X, the SFO oracle outputs a vector G i (x t i, ξt i ) s.t. EξGi(x t i, ξt i) = f i (x t i ), i = 1,...,, (.15) E ξ G i (x t i, ξt i) f i (x t i ) σ, i = 1,...,. (.16) We also assue that throughout this subsection that the is associated with the inner product,. As shown in Algorith, the RGEM for stochastic finite-su optiization is naturally obtained by replacing the gradient evaluation of f i in Algorith 1 (see (.4)) with a stochastic gradient estiator of f i given in (.17). In particular, at each iteration, we collect B t nuber of stochastic gradients of only one randoly selected coponent f i and take their average as the stochastic estiator of f i. Moreover, it needs to be entioned that the way RGEM initializes gradients, i.e, y 1 = y 0 = 0, is very iportant for stochastic optiization, since it is usually ipossible to copute exact gradient for expectation functions even at the initial point. Algorith RGEM for stochastic finite-su optiization This algorith is the sae as Algorith 1 except that (.4) is replaced by { 1 Bt y = B t j=1 G i(x t i, ξt i,j ), i =, y t 1 i, i. (.17) Here, G i (x t i, ξt i,j ), j = 1,..., B t, are stochastic gradients of f i coputed by the SFO oracle at x t i. Under the standard assuptions in (.15) and (.16) for stochastic optiization, and with proper choices of algorithic paraeters, Theore. shows that RGEM can achieve the optial O{σ / ɛ} rate of convergence (up to a certain logarithic factor) for solving strongly convex probles given in the for of (1.5) in ters of the nuber of stochastic gradients of f i. The proof of the this result can be found in Subsection 4.. Theore.. Let x be an optial solution of (1.5), x k and x k be generated by Algorith, and ˆL = ax i=1,..., L i. Suppose that σ 0 and σ are defined in (.8) and (.16), respectively. Given the iteration liit k, let {τ t }, {η t } and {α t } be set to (.9) with α being set as (.10), and we also set then where the expectation is taken w.r.t. { } and {ξ t i } and B t = k(1 α) α t, t = 1,..., k, (.18) EP (x k, x ) αk 0,σ0,σ, { } (.19) Eψ(x k ) ψ(x ) 6 ax 0,σ0,σα k/, (.0), ˆL 0,σ0,σ := P (x 0, x ) + ψ(x 0 ) ψ(x ) + σ 0 /+5σ. (.1) 8

In view of (.0), the nuber of iterations perfored by RGEM to find a stochastic ɛ-solution of (1.5), can be bounded by ( ˆK(ɛ, C, σ0, σ ) := + ) + 16C log 6 ax{,c} 0,σ 0,σ ɛ. (.) Furtherore, in view of (.19) this iteration coplexity bound can be iproved to K(ɛ, α, σ 0, σ ) := log 1/α 0,σ0,σ ɛ, (.3) in ters of finding a point x X s.t. EP ( x, x ) ɛ. Therefore, the corresponding nuber of stochastic gradient evaluations perfored by RGEM for solving proble (1.5) can be bounded by t=1 B t k {( k t=1 (1 α) α t 0,σ0,σ + k = O ɛ + + ) } C log 0,σ 0,σ ɛ, (.4) which together with (.1) iply that the total nuber of required stochastic gradients or saples of the rando variables ξ i, i = 1,...,, can be bounded by { } σ Õ 0 /+σ ɛ + P (x0,x )+ψ(x 0 ) ψ ɛ + + ˆL. Observe that this bound does not depend on the nuber of ters for sall enough ɛ. To the best of our knowledge, it is the first tie that such a convergence result is established for RIG algoriths to solve distributed stochastic finite-su probles. This coplexity bound in fact is in the sae order of agnitude (up to a logarithic factor) as the coplexity bound achieved by the optial accelerated stochastic approxiation ethods 11, 1, 19, which uniforly saple all the rando variables ξ i, i = 1,...,. However, this latter approach will thus involve uch higher counication costs in the distributed setting (see Subsection.3 for ore discussions)..3. RGEM for distributed optiization and achine learning. This subsection is devoted to RGEMs (see Algorith 1 and Algorith ) fro two different perspectives, i.e., the server and the activated agent under a distributed setting. We also discuss the counication costs incurred by RGEM under this setting. Both the server and agents in the distributed network start with the sae global initial point x 0, i.e., x 0 i = x0, i = 1,...,, and the server also sets y = 0 and g 0 = 0. During the process of RGEM, the server updates iterate x t and calculates the output solution x k (cf. (.5)) which is given by sux/suθ. Each agent only stores its local variable x t i and updates it according to the inforation received fro the server (i.e., xt ) when activated. The activated agent also needs to upload the changes of gradient y o the server. Observe that since y ight be sparse, uploading it will incur saller aount of counication costs than uploading the new gradient y t i. Note that line 5 of RGEM fro the -th agent s perspective is optional if the agent saves historic gradient inforation fro the last update. 9

RGEM The server s perspective 1: while t k do : x t M X (g t 1 + αt y, xt 1, η t ) 3: sux sux + θ t x t 4: suθ suθ + θ t 5: Send signal to the -th agent where is selected uniforly fro {1,..., } 6: if -th agent is responsive then 7: Send current iterate x t to -th agent 8: if Receive feedback y then 9: g t g t 1 + y 10: t t + 1 11: else goto Line 5 1: end if 13: else goto Line 5 14: end if 15: end while RGEM The activated -th agent s perspective 1: Download the current iterate x t fro the server : if t = 1 then 3: y t 1 i 0 4: else 5: y t 1 i f i (x t 1 i ) Optional 6: end if 7: x t i (1 + τ t) 1 (x t + τ t x t 1 i ) 8: y t i f i(x t i ) 9: Upload the local changes to the server, i.e., y i = y t i yt 1 i We now add soe rearks about the potential benefits of RGEM for distributed optiization and achine learning. Firstly, since RGEM does not require any exact gradient evaluation of f, it does not need to wait for the responses fro all agents in order to copute an exact gradient. Each iteration of RGEM only involves counication between the server and the activated -th agent. In fact, RGEM will ove to the next iteration in case no response is received fro the -th agent. This schee works under the assuption that the probability for any agent being responsive or available at a certain point of tie is equal. However, all other optial RIG algoriths, except RPDG, need the exact gradient inforation fro all network agents once in a while, which incurs high counication costs and synchronous delays as long as one agent is not responsive. Even RPDG requires a full round of counications and synchronization at the initial point. Secondly, since each iteration of RGEM involves only constant nuber of counication rounds between the server and one selected agent, the counication coplexity for RGEM under distributed setting can be bounded by {( ) } O + ˆL log 1 ɛ. Therefore, it can save up to O{ } rounds of counication than the optial deterinistic first-order ethods. For solving distributed stochastic finite-su optiization probles (1.5), RGEM fro the -th agent s perspective will be slightly odified as follows. RGEM The activated -th agent s perspective for solving (1.5) 1: Download the current iterate x t fro the server : if t = 1 then 3: y t 1 i 0 Assuing RGEM saves y t 1 i for t at the latest update 4: end if 5: x t i (1 + τ t) 1 (x t + τ t x t 1 i ) 6: y 1 Bt B t j=1 G i(x t i, ξt i,j ) B t is the batch size, and G i s are the stochastic gradients given by SFO 7: Upload the local changes to the server, i.e., y i = y yt 1 i Siilar to the case for the deterinistic finite-su optiization, the total nuber of counication 10

rounds perfored by the above RGEM can be bounded by {( O + ˆL ) log 1 ɛ for solving (1.5). Each round of counication only involves the server and a randoly selected agent. This counication coplexity sees to be optial, since it atches the lower coplexity bound (1.7) established in. Moreover, the sapling coplexity, i.e., the total nuber of saples to be collected by all the agents, is also nearly optial and coparable to the case when all these saples are collected in a centralized location and processed by an optial stochastic approxiation ethod. On the other hand, if one applies an existing optial stochastic approxiation ethod to solve the distributed stochastic optiization proble, the counication coplexity will be as high as O(1/ ɛ), which is uch worse than RGEM. 3. Gradient extrapolation ethod: dual of Nesterov s acceleration. Our goal in this section is to introduce a new algorithic fraework, referred to as the gradient extrapolation ethod (GEM), for solving the convex optiization proble given by }, ψ := in {ψ(x) := f(x) + w(x)}. (3.1) x X We show that GEM can be viewed as a dual of Nesterov s accelerated gradient ethod although these two algoriths appear to be quite different. Moreover, GEM possess soe nice properties which enable us to develop and analyze the rando gradient extrapolation ethod for distributed and stochastic optiization. 3.1. Generalized Bregan distance. In this subsection, we provide a brief introduction to the generalized Bregan distance defined in (1.10) and soe properties for its associated prox-apping defined in(1.1). Note that whenever w is non-differentiable, we need to specify a particular selection of the subgradient w before perforing the prox-apping. We assue throughout this paper that such a selection of w is defined recursively as follows. Denote x 1 M X (g, x 0, η). By the optiality condition of (1.1), we have g + ( + η)w (x 1 ) ηw (x 0 ) N X (x 1 ), where N X (x 1 ) := {v R n : v T (x x 1 ) 0, x X} denotes the noral cone of X at x 1. Once such a w (x 1 ) satisfying the above relation is identified, we will use it as a subgradient when defining P (x 1, x) in the next iteration. Note that such a subgradient can be identified as long as x 1 is obtained, since it satisfies the optiality condition of (1.1). The following lea, which generalizes Lea 6 of 0 and Lea of 11, characterizes the solutions to (1.1). The proof of this result can be found in Lea 5 of. Lea 3.1. Let U be a closed convex set and a point ũ U be given. Also let w : U R be a convex function and W (ũ, u) = w(u) w(ũ) w (ũ), u ũ for soe w (ũ) w(ũ). Assue that the function q : U R satisfies q(u 1 ) q(u ) q (u ), u 1 u 0 W (u, u 1 ), u 1, u U for soe 0 0. Also assue that the scalars 1 and are chosen such that 0 + 1 + 0. If u Argin{q(u) + 1 w(u) + W (ũ, u) : u U}, then for any u U, we have q(u ) + 1 w(u ) + W (ũ, u ) + ( 0 + 1 + )W (u, u) q(u) + 1 w(u) + W (ũ, u). 11

3.. The algorith. As shown in Algorith 3, GEM starts with a gradient extrapolation step (3.) to copute g t fro the two previous gradients g t 1 and g t. Based on g t, it perfors a proxial gradient descent step in (3.3) and updates the output solution x t. Finally, the gradient at x t is coputed for gradient extrapolation in the next iteration. This algorith is a special case of RGEM in Algorith 1 (with = 1). Algorith 3 An optial gradient extrapolation ethod (GEM) Input: Let x 0 X, and the nonnegative paraeters {α t }, {η t }, and {τ t } be given. Set x 0 = x 0 and g 1 = g 0 = f(x 0 ). for t = 1,,..., k do end for Output: x k. g t = α t (g t 1 g t ) + g t 1. (3.) x t = M X ( g t, x t 1, η t ). (3.3) x t = ( x t + τ t x t 1) /(1 + τ t ). (3.4) g t = f(x t ). (3.5) We now show that GEM can be viewed as the dual of the well-known Nesterov s accelerated gradient (NAG) ethod as studied in. To see such a relationship, we will first rewrite GEM in a prial-dual for. Let us consider the dual space G, where the gradients of f reside, and equip it with the conjugate nor. Let J f : G R be the conjugate function of f such that f(x) := ax g G { x, g J f (g)}. We can reforulate the original proble in (3.1) as the following saddle point proble: ψ := in x X { ax g G { x, g J f (g)} + w(x) }. (3.6) It is clear that J f is strongly convex with odulus 1/L f w.r.t. (See Chapter E in 15 for details). Therefore, we can define its associated dual generalized Bregan distance and dual prox-appings as D f (g 0, g) := J f (g) J f (g 0 ) + J f (g 0 ), g g 0, (3.7) M G ( x, g 0 {, τ) := arg in x, g + Jf (g) + τd f (g 0, g) }, (3.8) g G for any g 0, g G. The following result, whose proof is given in Lea 1 of, shows that the coputation of the dual prox-apping associated with D f is equivalent to the coputation of f. Lea 3.. Let x X and g 0 G be given and D f (g 0, g) be defined in (3.7). For any τ > 0, let us denote z = x + τj f (g0 )/(1 + τ). Then we have f(z) = M G ( x, g 0, τ). Using this result, we can see that the GEM iteration can be written a prial-dual for. Given (x 0, g 1, g 0 ) X G G, it updates (x t, g t ) by g t = α t (g t 1 g t ) + g t 1, (3.9) x t = M X ( g t, x t 1, η t ), (3.10) g t = M G ( x t, g t 1, τ t ), (3.11) with a specific selection of J f (gt 1 ) = x t 1 in D f (g t 1, g). Indeed, by denoting x 0 = x 0, we can easily see fro g 0 = f(x 0 ) that x 0 J f (g 0 ). Now assue that g t 1 = f(x t 1 ) and hence that x t 1 J f (g t 1 ). By the definition of g t in (3.11) and Lea 3., we conclude that g t = f(x t ) with x t = (x t +τ t x t 1 )/(1+τ t ), which are exactly the definitions given in (3.4) and (3.5). 1

Recall that in a siple version of the NAG ethod (e.g., 8, 34, 19, 11, 1, 13), given (x t 1, x t 1 ) X X, it updates (x t, x t ) by x t = (1 λ t ) x t 1 + λ t x t 1, (3.1) g t = f(x t ), (3.13) x t = M X (g t, x t 1, η t ), (3.14) x t = (1 λ t ) x t 1 + λ t x t, (3.15) for soe λ t 0, 1. Moreover, we have shown in that (3.1)-(3.15) can be viewed as a specific instantiation of the following prial-dual updates: x t = α t (x t 1 x t ) + x t 1, (3.16) g t = M G ( x t, g t 1, τ t ), (3.17) x t = M X (g t, x t 1, η t ). (3.18) Coparing (3.9)-(3.11) with (3.16)-(3.18), we can clearly see that GEM is a dual version of NAG, obtained by switching the prial and dual variables in each equation of (3.16)-(3.18). The ajor difference exists in that the extrapolation step in GEM is perfored in the dual space while the one in NAG is perfored in the prial space. In fact, extrapolation in the dual space will help us to greatly siplify and further enhance the randoized increental gradient ethods developed in based on NAG. Another interesting fact is that in GEM, the gradients are coputed for the output solutions {x t }. On the other hand, the output solutions in the NAG ethod are given by { x t } while the gradients are coputed for the extrapolation sequence {x t }. 3.3. Convergence of GEM. Our goal in this subsection is to establish the convergence properties of the GEM ethod for solving (3.1). Observe that our analysis is carried out copletely in the prial space and does not rely on the prial-dual interpretation described in the previous section. This type of analysis technique appears to be new for solving proble (3.1) in the literature as it also differs significantly fro that of NAG. We first establish soe general convergence properties for GEM for both sooth convex ( = 0) and strongly convex cases ( > 0). Theore 3.3. Suppose that {η t }, {τ t }, and {α t } in GEM satisfy θ t 1 = α t θ t, t =,..., k, (3.19) θ t η t θ t 1 ( + η t 1 ), t =,..., k, (3.0) θ t τ t = θ t 1 (1 + τ t 1 ), t =,..., k, (3.1) α t L f τ t 1 η t, t =,..., k, (3.) L f τ k ( + η k ), (3.3) for soe θ t 0, t = 1,..., k. Then, for any k 1 and any given x X, we have θ k (1 + τ k )ψ(x k ) ψ(x) + θ k(+η k ) P (x k, x) θ 1 τ 1 ψ(x 0 ) ψ(x) + θ 1 η 1 P (x 0, x). (3.4) Proof. Applying Lea 3.1 to (3.3), we obtain x t x, α t (g t 1 g t ) + g t 1 + w(x t ) w(x) η t P (x t 1, x) ( + η t )P (x t, x) η t P (x t 1, x t ). (3.5) 13

Moreover, using the definition of ψ, the convexity of f, and the fact that g t = f(x t ), we have (1 + τ t )f(x t ) + w(x t ) ψ(x) (1 + τ t )f(x t ) + w(x t ) w(x) f(x t ) + g t, x x t = τ t f(x t ) g t, x t x t 1 g t, x x t + w(x t ) w(x) τt L f g t g t 1 + τ t f(x t 1 ) g t, x x t + w(x t ) w(x) τt L f g t g t 1 + τ t f(x t 1 ) + x t x, g t g t 1 α t (g t 1 g t ) + η t P (x t 1, x) ( + η t )P (x t, x) η t P (x t 1, x t ), where the first equality follows fro the definition of x t in (3.4), the second inequality follows fro the soothness of f (see Theore.1.5 in 8), and the last inequality follows fro (3.5). Multiplying both sides of the above inequality by θ t, and suing up the resulting inequalities fro t = 1 to k, we obtain t=1 θ t(1 + τ t )f(x t ) + t=1 θ tw(x t ) ψ(x) θ tτ t t=1 L f g t g t 1 + t=1 θ tτ t f(x t 1 ) + t=1 θ t x t x, g t g t 1 α t (g t 1 g t ) + t=1 θ tη t P (x t 1, x) ( + η t )P (x t, x) η t P (x t 1, x t ). (3.6) Now by (3.19) and the fact that g 1 = g 0, we have t=1 θ t x t x, g t g t 1 α t (g t 1 g t ) = t=1 θ t x t x, g t g t 1 α t x t 1 x, g t 1 g t t= θ tα t x t x t 1, g t 1 g t = θ k x k x, g k g k 1 t= θ tα t x t x t 1, g t 1 g t. Moreover, in view of (3.0), (3.1) and the definition of x t (3.4), we obtain t=1 θ tη t P (x t 1, x) ( + η t )P (x t (3.0), x) θ 1 η 1 P (x 0, x) θ k ( + η k )P (x k, x), t=1 θ t(1 + τ t )f(x t ) τ t f(x t 1 ) (3.1) = θ k (1 + τ k )f(x k ) θ 1 τ 1 f(x 0 ), t=1 θ (3.1) t = t= θ tτ t θ t 1 τ t 1 + θ k = θ k (1 + τ k ) θ 1 τ 1, θ k (1 + τ k )x k(3.4) = θ k (x k + τ k 1+τ k 1 x k 1 + + k (3.1) = t=1 θ tx t + θ 1 τ 1 x 0. The last two relations, in view of the convexity of w( ), also iply that θ k (1 + τ k )w(x k ) t=1 θ tw(x t ) + θ 1 τ 1 w(x 0 ). Therefore, by (3.6), the above relations, and the definition of ψ, we conclude that τ t t= 1+τ t 1 x 1 + k t= τ t 1+τ t 1 τ 1 x 0 ) θ k (1 + τ k )ψ(x k ) ψ(x) k t= θt 1τt 1 L f g t 1 g t θ t α t x t x t 1, g t 1 g t θ t η t P (x t 1, x t ) θ k τk L f g k g k 1 x k x, g k g k 1 + ( + η k )P (x k, x) + θ 1 η 1 P (x 0, x) + θ 1 τ 1 ψ(x 0 ) ψ(x) θ 1 η 1 P (x 0, x 1 ). (3.7) By the strong convexity of P (, ) in (1.11), the siple relation that b u, v a v / b u /(a), a > 0, 14

and the conditions in (3.) and (3.3), we have t= θt 1τ t 1 L f g t 1 g t + θ t α t x t x t 1, g t 1 g t + θ t η t P (x t 1, x t ) ( ) k θ t αtl f t= τ t 1 η t x t 1 x t 0 θ τk k L f g k g k 1 x k x, g k g k 1 + (+η k) P (x k, x) ) x k x 0. θ k ( Lf τ k +η k Using the above relations in (3.7), we obtain (3.4). We are now ready to establish the optial convergence behavior of GEM as a consequence of Theore 3.3. We first provide a constant step-size policy which guarantees an optial linear rate of convergence for the strongly convex case ( > 0). Corollary 3.4. Let x be an optial solution of (1.1), x k and x k be defined in (3.3) and (3.4), respectively. Suppose that > 0, and that {τ t }, {η t } and {α t } are set to Lf τ t τ =, η t η = Lf / L f, and α t α =, t = 1,..., k. (3.8) Then, 1+ L f / P (x k, x ) α k P (x 0, x ) + 1 (ψ(x0 ) ψ ), (3.9) ψ(x k ) ψ α k P (x 0, x ) + ψ(x 0 ) ψ. (3.30) Proof. Let us set θ t = α t, t = 1,..., k. It is easy to check that the selection of {τ t }, {η t } and {α t } in (3.8) satisfies conditions (3.19)-(3.3). In view of Theore 3.3 and (3.8), we have ψ(x k ) ψ(x ) + +η (1+τ) P (xk, x ) θ1τ θ k (1+τ) ψ(x0 ) ψ(x ) + = α k ψ(x 0 ) ψ(x ) + P (x 0, x ). It also follows fro the above relation, the fact ψ(x k ) ψ(x ) 0, and (3.8) that θ1η θ k (1+τ) P (x0, x ) P (x k, x ) (1+τ)αk +η P (x 0, x ) + ψ(x 0 ) ψ(x ) = α k P (x 0, x ) + 1 (ψ(x0 ) ψ(x )). We now provide a stepsize policy which guarantees the optial rate of convergence for the sooth case ( = 0). Observe that in sooth case we can estiate the solution quality for the sequence {x k } only. Corollary 3.5. Let x be an optial solution of (1.1), and x k be defined in (3.4). Suppose that = 0, and that {τ t }, {η t } and {α t } are set to Then, τ t = t, η t = 4L f t, and α t = t t+1, t = 1,..., k. (3.31) ψ(x k ) ψ(x ) = f(x k ) f(x ) (k+1)(k+) f(x0 ) f(x ) + 8L f P (x 0, x ). (3.3) Proof. Let us set θ t = t+1, t = 1,..., k. It is easy to check that the paraeters in (3.31) satisfy conditions (3.)-(3.3). In view of (3.4) and (3.31), we conclude that ψ(x k ) ψ(x ) (k+1)(k+) ψ(x0 ) ψ(x ) + 8L f P (x 0, x ). 15

In Corollary 3.6, we iprove the above coplexity result in ters of the dependence on f(x 0 ) f(x ) by using a different step-size policy and a slightly ore involved analysis for the sooth case ( = 0). Corollary 3.6. Let x be an optial solution of (1.1), x k and x k be defined in (3.3) and (3.4), respectively. Suppose that = 0, and that {τ t }, {η t } and {α t } are set to Then, for any k 1, τ t = t 1, η t = 6L f t, and α t = t 1 t, t = 1,..., k. (3.33) ψ(x k ) ψ(x ) = f(x k ) f(x ) 1L f k(k+1) P (x0, x ). (3.34) Proof. If we set θ t = t, t = 1,..., k. It is easy to check that the paraeters in (3.33) satisfy conditions (3.19)-(3.1) and (3.3). However, condition (3.) only holds for t = 3,..., k, i.e., In view of (3.7) and the fact that τ 1 = 0, we have α t L f τ t 1 η t, t = 3,..., k. (3.35) θ k (1 + τ k )ψ(x k ) ψ(x) θ α x x 1, g 1 g 0 + η P (x 1, x ) θ 1 η 1 P (x 0, x 1 ) k θt 1τ t 1 t=3 L f g t 1 g t + θ t α t x t x t 1, g t 1 g t + θ t η t P (x t 1, x t ) θ τk k L f g k g k 1 x k x, g k g k 1 + ( + η k )P (x k, x) + θ 1 η 1 P (x 0, x) θ1α η g 1 g 0 θ1η1 x1 x 0 + ( ) k θ t αtl f t=3 τ t 1 η t x t 1 x t ( ) + θ k Lf τ k η k x k x + θ 1 η 1 P (x 0, x) θ kη k P (xk, x) θ1αl f η x 1 x 0 θ1η1 x1 x 0 + θ 1 η 1 P (x 0, x) θ kη k P (xk, x) ( ) αl θ f 1 η η 1 x 1 x 0 + θ 1 η 1 P (x 0, x) θ kη k P (xk, x), where the second inequality follows fro the siple relation that b u, v a v / b u /(a), a > 0 and (1.11), the third inequality follows fro (3.35), (3.3), the definition of g t in (3.5) and (1.4), and the last inequality follows fro the facts that x 0 = x 0 and x 1 = x 1 (due to τ 1 = 0). Therefore, by plugging the paraeter setting in (3.33) into the above inequality, we conclude that ψ(x k ) ψ = f(x k ) f(x ) θ k (1 + τ k ) 1 θ 1 η 1 P (x 0, x ) θ kη k P (xk, x) 1L f k(k+1) P (x0, x ). In view of the results obtained in the above three corollaries, GEM exhibits optial rates of convergence for both strongly convex and sooth cases. Different fro the classical NAG ethod, GEM perfors extrapolation on the gradients, rather than the iterates. This fact will help us to develop an enhanced randoized increental gradient ethod than RPDG in, i.e., the Rando Gradient Extrapolation Method, with a uch sipler analysis. 4. Convergence analysis of RGEM. Our ain goal in this section is to establish the convergence properties of RGEM for solving (1.1) and (1.5), i.e., the ain results stated in Theore.1 and.. In fact, coparing RGEM in Algorith 1 with GEM in Algorith 3, RGEM is a direct randoization of GEM. Therefore, inheriting fro GEM, its convergence analysis is carried out copletely in the prial space. However, the analysis for RGEM is ore challenging especially because we need to 1) build up the relationship 16

between 1 i=1 f i(x k i ) and f(xk ), for which we exploit the function Q defined in (4.3) as an interediate tool; ) bound the error caused by inexact gradients at the initial point and 3) analyze the accuulated error caused by randoization and noisy stochastic gradients. Before proving Theore.1 and., we first need to provide soe iportant technical results. The following siple result deonstrates a few identities related to x t i (cf. (.3)) and yt (cf. (.4) or (.17)). as Lea 4.1. Let x t and y t be defined in (.) and (.4) (or (.17)), respectively, and ˆx t i and ŷt be defined ˆx t i = (1 + τ t) 1 (x t + τ t x t 1 i ), i = 1,...,, t 1, (4.1) { ŷ f i (ˆx t i = ), if yt is defined in (.4), 1 Bt B t j=1 G i(ˆx t i, ξt i,j ), if i = 1,...,, t 1, (4.) yt is defined in (.17), respectively. Then we have, for any i = 1,..., and t = 1,..., k, E t yi t = 1 ŷt i + (1 1 )yt 1 i, E t x t i = 1 ˆxt i + (1 1 )xt 1 i, E t f i (x t i) = 1 f i(ˆx t i) + (1 1 )f i(x t 1 i ), E t f i (x t i ) f i(x t 1 i ) = 1 f i(ˆx t i ) f i(x t 1 i ), where E t denotes the conditional expectation w.r.t. given i 1,..., 1 when y t is defined in (.4), and w.r.t. given i 1,..., 1, ξ1, t..., ξ t when y t is defined in (.17), respectively. Proof. This first equality follows iediately fro the facts that Prob t {y = ŷt i } = Prob t{ = i} = 1 and Prob t {y = yt 1 i } = 1 1. Here Prob t denotes the conditional probability w.r.t. given i 1,..., 1 when y t is defined in (.4) and w.r.t given i 1,..., 1, ξ1, t..., ξ t when y t is defined in (.17), respectively. Siilarly, we can prove the rest equalities. We define the following function Q to help us analyze the convergence properties of RGEM. Let x, x X be two feasible solutions of (1.1) (or (1.5)), we define the corresponding Q(x, x) by Q(x, x) := f(x), x x + w(x) w(x). (4.3) It is obvious that if we fix x = x, an optial solution of (1.1) (or (1.5)), by the convexity of w and the optiality condition of x, for any feasible solution x, we can conclude that Q(x, x ) f(x ) + w (x ), x x 0. Moreover, observing that f is sooth, we conclude that Q(x, x ) = f(x ) + f(x ), x x + w(x) ψ(x ) L f x x + ψ(x) ψ(x ). (4.4) The following lea establishes an iportant relationship regarding Q. Lea 4.. Let x t be defined in (.), and x X be any feasible solution of (1.1) or (1.5). Suppose that τ t in RGEM satisfy for soe θ t 0, t = 1,..., k. Then, we have θ t ((1 + τ t ) 1) = θ t 1 (1 + τ t 1 ), t =,..., k, (4.5) t=1 θ teq(x t, x) θ k (1 + τ k ) i=1 Ef i(x k i ) + t=1 θ tew(x t ) ψ(x) θ 1 ((1 + τ 1 ) 1) x 0 x, f(x) + f(x). (4.6) 17