RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION

Size: px
Start display at page:

Download "RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION"

Transcription

1 RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION GUANGHUI LAN AND YI ZHOU Abstract. In this paper, we consider a class of finite-su convex optiization probles defined over a distributed ultiagent network with agents connected to a central server. In particular, the objective function consists of the average of ( 1) sooth coponents associated with each network agent together with a strongly convex ter. Our ajor contribution is to develop a new randoized increental gradient algorith, naely rando gradient extrapolation ethod (RGEM), which does not require any exact gradient evaluation even for the initial point, but can achieve the optial O(log(1/ɛ)) coplexity bound in ters of the total nuber of gradient evaluations of coponent functions to solve the finite-su probles. Furtherore, we deonstrate that for stochastic finite-su optiization probles, RGEM aintains the optial O(1/ɛ) coplexity (up to a certain logarithic factor) in ters of the nuber of stochastic gradient coputations, but attains an O(log(1/ɛ)) coplexity in ters of counication rounds (each round involves only one agent). It is worth noting that the forer bound is independent of the nuber of agents, while the latter one only linearly depends on or even for ill-conditioned probles. To the best of our knowledge, this is the first tie that these coplexity bounds have been obtained for distributed and stochastic optiization probles. Moreover, our algoriths were developed based on a novel dual perspective of Nesterov s accelerated gradient ethod. Keywords: finite-su optiization, gradient extrapolation, randoized ethod, distributed achine learning, stochastic optiization. 1. Introduction. The ain proble of interest in this paper is the finite-su convex prograing (CP) proble given in the for of ψ { := in ψ(x) := 1 x X i=1 f i(x) + w(x) }. (1.1) Here, X R n is a closed convex set, f i : X R, i = 1,...,, are sooth convex functions with Lipschitz continuous gradients over X, i.e., L i 0 such that f i (x 1 ) f i (x ) L i x 1 x, x 1, x X, (1.) w : X R is a strongly convex function with odulus 1 w.r.t. a nor, i.e., w(x 1 ) w(x ) w (x ), x 1 x 1 x 1 x, x 1, x X, (1.3) where w ( ) denotes any subgradient (or gradient) of w( ) and 0 is a given constant. Hence, the objective function ψ is strongly convex whenever > 0. For notational convenience, we also denote f(x) 1 i=1 f i(x), L 1 i=1 L i, and ˆL = ax i=1,..., L i. It is easy to see that for soe L f 0, f(x 1 ) f(x ) L f x 1 x L x 1 x, x 1, x X. (1.4) We also consider a class of stochastic finite-su optiization probles given by ψ := in x X { ψ(x) := 1 i=1 E ξ i F i (x, ξ i ) + w(x) }, (1.5) where ξ i s are rando variables with support Ξ i R d. It can be easily seen that (1.5) is a special case of (1.1) with f i (x) = E ξi F i (x, ξ i ), i = 1,...,. However, different fro deterinistic finite-su optiization probles, only noisy gradient inforation of each coponent function f i can be accessed for the stochastic finite-su optiization proble in (1.5). The deterinistic finite-su proble (1.1) can odel the epirical risk iniization in achine learning and statistical inferences, and hence has becoe the subject of intensive studies during the past few years. H. Milton Stewart School of Industrial & Systes Engineering, Georgia Institute of Technology, Atlanta, GA, (eail: george.lan@isye.gatech.edu). H. Milton Stewart School of Industrial & Systes Engineering, Georgia Institute of Technology, Atlanta, GA, (eail: yizhou@gatech.edu). 1

2 Our study on finite-su probles (1.1) and (1.5) has also been otivated by the eerging need for distributed optiization and achine learning. Under such settings, each coponent function f i is associated with an agent i, i = 1,...,, which are connected through a distributed network. While different topologies can be considered for distributed optiization (see, e.g., Figure 1.1 and 1.), in this paper, we focus on the star network where agents are connected to one central server, and all agents only counicate with the server (see Figure 1.1). These types of distributed optiization probles have several unique features. Firstly, they allow for data privacy, since no local data is stored in the server. Secondly, network agents behave independently and they ay not be responsive at the sae tie. Thirdly, the counication between the server and agent can be expensive and has high latency. Finally, by considering the stochastic finite-su optiization proble, we are interested in not only the deterinistic epirical risk iniization, but also the generalization risk for distributed achine learning. Moreover, we allow the private data for each agent to be collected in an online (steaing) fashion. One typical exaple of the aforeentioned distributed probles is Federated Learning recently introduced by Google in 5. As a particular exaple, in the l -regularized logistic regression proble, we have f i (x) = l i (x) := 1 Ni N i j=1 log(1 + exp( bi j ai T j x)), i = 1,...,, w(x) = R(x) := 1 x, provided that f i is the loss function of agent i with training data {a i j, bi j }Ni j=1 Rn { 1, 1}, and := λ is the penalty paraeter. For iniization of the generalized risk, f i s are given in the for of expectation, i.e., f i (x) = l i (x) := E ξi log(1 + exp( ξ T i x)), i = 1,...,, where the rando variable ξ i odels the underlying distribution for training dataset of agent i. Note that Fig A distributed network with 5 agents and one server Fig. 1.. An exaple of the decentralized network another type of topology for distributed optiization is the ulti-agent network without a central server, naely the decentralized setting, as shown in Figure 1., where the agents can only counicate with their neighbors to update inforation, please refer to 1, 3, 3 and reference therein for decentralized algoriths. During the past few years, randoized increental gradient (RIG) ethods have eerged as an iportant class of first-order ethods for finite-su optiization (e.g.,4, 16, 35, 8, 9,, 1, 14, 4). For solving nonsooth finite-su probles, Neirovski et al. 6, 7 showed that stochastic subgradient (irror) descent ethods can possibly save up to O( ) subgradient evaluations. By utilizing the soothness properties of the objective, Lan 18 showed that one can separate the ipact of variance fro other deterinistic coponents for stochastic gradient descent and presented a new class of accelerated stochastic gradient descent ethods to further iprove these coplexity bounds. However, the overall rate of convergence of these stochastic ethods is still sublinear even for sooth and strongly finite-su probles (see 11, 1). Inspired by these works and the success of the increental aggregated gradient ethod by Blatt et al.4, Schiidt et al. 9 presented a stochastic average gradient (SAG) ethod, which uses randoized sapling of f o update the gradients, and can achieve a linear rate of convergence, i.e., an O { + (L/) log(1/ɛ)} coplexity bound, to solve unconstrained finite-su probles (1.1). Johnson and Zhang later in 16 presented a stochastic variance

3 reduced gradient (SVRG) ethod, which coputes an estiator of f by iteratively updating the gradient of one randoly selected f i of the current exact gradient inforation and re-evaluating the exact gradient fro tie to tie. Xiao and Zhang 35 later extended SVRG to solve proxial finite-su probles (1.1). All these ethods exhibit an iproved O {( + L/) log(1/ɛ)} coplexity bound, and Defazio et al. 8 also presented an iproved SAG ethod, called SAGA, that can achieve such a coplexity result. Coparing to the class of stochastic dual ethods (e.g., 31, 30, 36), each iteration of the RIG ethods only involves the coputation f i, rather than solving a ore coplicated subproble argin{ g, y + f i (y) + y }, which ay not have explicit solutions 31. Noting that ost of these RIG ethods are not optial even for = 1, uch recent research effort has been directed to the acceleration of RIG ethods. In 015, Lan and Zhou in proposed a RIG ethod, naely randoized prial-dual gradient (RPDG) ethod, and show that its total nuber of gradient coputations of f i can be bounded by {( ) } L O + log 1 ɛ. (1.6) The RPDG ethod utilizes a direct acceleration without even using the concept of variance reduction, evolving fro the randoized prial-dual ethods developed in 36, 7 for solving saddle-point probles. Lan and Zhou also established a lower coplexity bound for the RIG ethods by showing that the nuber of gradient evaluations of f i required by any RIG ethods to find an ɛ-solution of (1.1), i.e., a point x X s.t. E x x ɛ, cannot be saller than whenever the diension (( Ω + n (k + /)/ log(1/q), ) ) L log 1 ɛ, (1.7) where k is the total nuber of iterations and q = 1 + /( L/(( + 1)) 1). Siultaneously, Lin et al. 4 presented a catalyst schee which utilizes a restarting technique to accelerate the SAG ethod in 9 (or other non-accelerated first-order ethods) and thus can possibly iprove the coplexity bounds obtained by SVRG and SAGA to (1.6) (under the Euclidean setting). Allen-Zhu 1 later showed that one can also directly accelerate SVRG to achieve the optial rate of convergence (1.6). All these accelerated RIG ethods can save up to O( ) in the nuber of gradient evaluations of f i coparing to optial deterinistic first-order ethods when L/. It should be noted that ost existing RIG ethods were inspired by epirical risk iniization on a single server (or cluster) in achine learning rather than on a set of agents distributed over a network. Under the distributed setting, ethods requiring full gradient coputation and/or restarting fro tie to tie ay incur extra counication and synchronization costs. As a consequence, ethods which require fewer full gradient coputations (e.g. SAG, SAGA and RPDG) see to be ore advantageous in this regard. An interesting but yet unresolved question in stochastic optiization is whether there exists a ethod which does not require the coputation of any full gradients (even at the initial point), but can still achieve the optial rate of convergence in (1.6). Moreover, little attention in the study of RIG ethods has been paid to the stochastic finite-su proble in (1.5), which is iportant for generalization risk iniization in achine learning. Very recently, there are soe progresses on stochastic prial-dual type ethods for solving proble (1.5). For exaple, Lan, Lee and Zhou 1 proposed a stochastic decentralized counication sliding ethod that can achieve the optial sapling coplexity of O(1/ɛ) and best-known O(1/ ɛ) coplexity bounds for counication rounds for solving stochastic decentralized strongly convex probles. For the distributed setting with a central sever, by using ini-batch technique to collect gradient inforation and any stochastic gradient based algorith as a black box to update iterates, Dekel et al. 9 presented a distributed inibatch algorith with a batch size of o( 1/ ) that can obtain O(1/ɛ) sapling coplexity (i.e., nuber of 3

4 stochastic gradients) for stochastic strongly convex probles, and hence iplies at least O(1/ ɛ) bound for counication coplexity. An asynchronous version was later proposed by Feyzahdavian et al. in 10 that aintained the above convergence rate for regularized stochastic strongly convex probles. It should be pointed out that these ini-batch based distributed algoriths require sapling fro all network agents iteratively and hence leads to at least O(/ ɛ) rate of convergence in ters of counication costs aong server and agents. It is unknown whether there exists an algorith which only requires a significantly saller counication rounds (e.g. O(log 1/ɛ)), but can achieve the optial O(1/ɛ) sapling coplexity for solving the stochastic finite-su proble in (1.5). The ain contribution of this paper is to introduce a new randoized increental gradient type ethod to solve (1.1) and (1.5). Firstly, we develop a rando gradient extrapolation ethod (RGEM) for solving (1.1) that does not require any exact gradient evaluations of f. For strongly convex probles, we deonstrate that RGEM can still achieve the optial rate of convergence (1.6) under the assuption that the average of gradients of f i at the initial point x 0 is bounded by σ 0. To the best of our knowledge, this is the first tie that such an optial RIG ethods without any exact gradient evaluations has been presented for solving (1.1) in the literature. In fact, without any full gradient coputation, RGEM possesses iteration costs as low as pure stochastic gradient descent (SGD) ethods, but achieves a uch faster and optial linear rate of convergence for solving deterinistic finite-su probles. In coparison with the well-known randoized Kaczarz ethod 33, which can be viewed as an enhanced version of SGD, but can achieve a linear rate of convergence for solving linear systes, RGEM has a better convergence rate in ters of the dependence on the condition nuber L/. Secondly, we develop a stochastic version of RGEM and establish its optial convergence properties for solving stochastic finite-su probles (1.5). More specifically, we assue that only noisy first-order inforation of one randoly selected coponent function f i can be accessed via a stochastic first-order (SFO) oracle iteratively. In other words, at each iteration only one randoly selected network agent needs to copute an estiator of its gradient by sapling fro its local data using a SFO oracle instead of perforing exact gradient evaluation of its coponent function f i. Note that for these probles, it is difficult to copute the exact gradients even at the initial point. Under standard assuptions for centralized stochastic optiization, i.e., the gradient estiators coputed by the SFO oracle are unbiased and have bounded variance σ, the nuber of stochastic gradient evaluations perfored by RGEM to solve (1.5) can be bounded by 1 { } σ 0 Õ /+σ ɛ + x0 x +ψ(x0 ) ψ ɛ, (1.8) for finding a point x X s.t. E x x ɛ. Moreover, by utilizing the ini-batch technique, RGEM can achieve an {( ) } O + ˆL log 1 ɛ, (1.9) coplexity bound in ters of the nuber of counication rounds, and each round only involves the counication between the server and a randoly selected agent. This bound sees to be optial, since it atches the lower coplexity bound for RIG ethods to solve deterinistic finite-su probles. It is worth noting that the forer bound (1.8) is independent of the nuber of agents, while the latter one (1.9) only linearly depends on or even for ill-conditioned probles. To the best of our knowledge, this is the first tie that such a RIG type ethod has been developed for solving stochastic finite-su probles (1.5) that can achieve the optial counication coplexity and nearly optial (up to a logarithic factor) sapling coplexity in the literature. RGEM is developed based on a novel algorithic fraework, naely gradient extrapolation ethod (GEM), that we introduce in this paper for solving black-box convex optiization (i.e., = 1). The developent of GEM was inspired by our recent studies on the relation between accelerated gradient ethods and the prial-dual gradient ethods. In particular, it is observed in that Nesterov s accelerated gradient ethod is a special prial-dual gradient (PDG) ethod where the extrapolation step is perfored in the prial space. 1 Õ indicates the rate of convergence is up to a logarithic factor - log(1/ɛ). 4

5 Such a prial extrapolation step, however, ight result in a search point outside the feasible region under the randoized setting in the RPDG ethod entioned above. In view of this deficiency of PDG and RPDG ethods, we propose to switch the prial and dual spaces for prial-dual gradient ethods, and to perfor the extrapolation step in the dual (gradient) space. The resulting new first-order ethod, i.e., GEM, can be viewed as a dual version of Nesterov s accelerated gradient ethod, and we show that it can also achieve the optial rate of convergence for black-box convex optiization. RGEM is a randoized version of GEM which only coputes the gradient of a randoly selected coponent function f i at each iteration. It utilizes the gradient extrapolation step also for estiating exact gradients in addition to predicting dual inforation as in GEM. As a result, it has several advantages over RPDG. Firstly, RPDG requires a restricted assuption that each f i has to be differentiable and has Lipschitz continuous gradients over the whole R n due to its prial extrapolation step. RGEM relaxes this assuption to having Lipschitz gradients over the feasible set X (see (1.)), and hence can be applied to a uch broader class of probles. Secondly, RGEM possesses sipler convergence analysis carried out in the prial space due to its siplified algorithic schee. However, RPDG has a coplicated algorithic schee, which contains a prial extrapolation step and a gradient (dual) prediction step in addition to solving a prial proxial subproble, and thus leads to an intricate prial-dual convergence analysis. Last but not least, it is unknown whether RPDG could aintain the optial convergence rate (1.6) without the exact gradient evaluation of f during initialization. This paper is organized as follows. In Section we present the proposed rando gradient extrapolation ethods (RGEM), and their convergence properties for solving (1.1) and (1.5). In order to provide ore insights into the design of the algorithic schee of RGEM, we provide an introduction to the gradient extrapolation ethod (GEM) and its relation to the prial-dual gradient ethod, as well as Nesterov s ethod in Section 3. Section 4 is devoted to the convergence analysis of RGEM. Soe concluding rearks are ade in Section Notation and terinology. We use to denote a general nor in R n without specific ention. We also use to denote the conjugate nor of. For any p 1, p denotes the standard p-nor in R n, i.e., x p p = n i=1 x i p, for any x R n. For any convex function h, h(x) is the set of subdifferential at x. For a given strongly convex function w with odulus 1 (see (1.1)), we define a prox-function associated with w as P (x 0, x) P w (x 0, x) := w(x) w(x 0 ) + w (x 0 ), x x 0, (1.10) where w (x 0 ) w(x 0 ) is an arbitrary subgradient of w at x 0. By the strong convexity of w, we have P (x 0, x) 1 x x0, x, x 0 X. (1.11) It should be pointed out that the prox-function P (, ) described above is a generalized Bregan distance in the sense that w is not necessarily differentiable. This is different fro the standard definition for Bregan distance 5,, 3, 17, 6. Throughout this paper, we assue that the prox-apping associated with X and w, given by M X (g, x 0, η) := argin x X { g, x + w(x) + ηp (x 0, x) }, (1.1) is easily coputable for any x 0 X, g R n, 0, η > 0. For any real nuber r, r and r denote the nearest integer to r fro above and below, respectively. R + and R ++, respectively, denote the set of nonnegative and positive real nubers.. Algoriths and ain results. This section contains three subsections. We first present in Subsection.1 an optial rando gradient extrapolation ethod (RGEM) for solving the distributed finite-su proble in (1.1), and then discuss in Subsection., a stochastic version of RGEM for solving the stochastic finite-su proble in (1.5). Subsection.3 is devoted to the ipleentation of RGEM in a distributed setting and the discussion about its counication coplexity. 5

6 .1. RGEM for deterinistic finite-su optiization. The basic schee of RGEM is forally stated in Algorith 1. This algorith siply initializes the gradient as y 1 = y 0 = 0. At each iteration, RGEM requires the new gradient inforation of only one randoly selected coponent function f i, but aintains pairs of search points and gradients (x t i, yt i ), i = 1,...,, which are stored by their corresponding agents in the distributed network. More specifically, it first perfors a gradient extrapolation step in (.1) and the prial proxial apping in (.). Then a randoly selected block x t is updated in (.3) and the corresponding coponent gradient f it is coputed in (.4). As can be seen fro Algorith 1, RGEM does not require any exact gradient evaluations. Algorith 1 A rando gradient extrapolation ethod (RGEM) Input: Let x 0 X, and the nonnegative paraeters {α t }, {η t }, and {τ t } be given. Initialization: Set x 0 i = x0, i = 1,...,, y 1 = y 0 = 0. No exact gradient evaluation for initialization for t = 1,..., k do Choose according to Prob{ = i} = 1, i = 1,...,. Update z t = (x t, y t ) according to end for Output: For soe θ t > 0, t = 1,..., k, set ỹ t = y t 1 + α t (y t 1 y t ), (.1) x t = M X ( 1 i=1ỹt i, xt 1, η t ), { (.) x t (1 + τ t ) 1 (x t + τ t x t 1 i ), i =, i = x t 1 i, i. { (.3) y f i (x t i = t, y t 1 i, i. (.4) x k := ( t=1 θ t) 1 t=1 θ tx t. (.5) Note that the coputation of x t in (.) requires an involved coputation of 1 i=1ỹt i. In order to save coputational tie when ipleenting this algorith, we suggest to copute this quantity in a recursive anner as follows. Let us denote g t 1 i=1 yt i, t = 1,..., k. Clearly, in view of the fact that yt i = yt 1 i, i, we have g t = g t (yt y t 1 ). (.6) Also, by the definition of g t and (.1), we have 1 i=1ỹt i = 1 i=1 yt 1 i + αt (yt 1 1 y t 1 ) = g t 1 + αt (yt 1 1 y t 1 ). (.7) Using these two ideas entioned above, we can copute 1 i=1ỹt i in two steps: i) initialize g0 = 0, and update g t as in (.6) after the gradient evaluation step (.4); ii) replace (.1) by (.7) to copute 1 i=1ỹt i. Also note that the difference y t y t 1 can be saved as it is used in both (.6) and (.7) for the next iteration. These enhanceents will be incorporated into the distributed setting in Subsection.3 to possibly save counication costs. It is also interesting to observe the differences between RGEM and RPDG. RGEM has only one extrapolation step (.1) which cobines two types of predictions. One is to predict future gradients using historic data, and the other is to obtain an estiator of the current exact gradient of f fro the randoly updated gradient inforation of f i. However, RPDG ethod needs two extrapolation steps in both the 6

7 prial and dual spaces. Due to the existence of the prial extrapolation step, RPDG cannot guarantee the search points where it perfors gradient evaluations to fall within the feasible set X. Hence, it requires the assuption that f i s are differentiable with Lipschitz continuous gradients over R n. Such a strong assuption is not required by RGEM, since all the prial iterates generated by RGEM stay within the feasible region X. As a result, RGEM can deal with a uch wider class of probles than RPDG. Moreover, RGEM allows no exact gradient coputation for initialization, which provides a fully-distributed algorithic fraework under the assuption that there exists σ 0 0 such that 1 i=1 f i(x 0 ) σ 0, (.8) where x 0 is the given initial point. We now provide a constant step-size policy for RGEM to solve strongly convex probles given in the for of (1.1) and show that the resulting algorith exhibits an optial linear rate of convergence in Theore.1. The proof of Theore.1 can be found in Subsection 4.1. Theore.1. Let x be an optial solution of (1.1), x k and x k be defined in (.) and (.5), respectively, and ˆL = ax i=1,..., L i. Also let {τ t }, {η t } and {α t } be set to If (.8) holds and α is set as then τ t τ = 1 (1 α) 1, η t η = α 1 α, and α t α. (.9) 1 α = 1, (.10) ˆL/ where EP (x k, x ) 0,σ 0 αk, { } (.11) Eψ(x k ) ψ(x ) 16 ax 0,σ0 α k/, (.1), ˆL 0,σ0 := P (x 0, x ) + ψ(x 0 ) ψ + σ 0. (.13) In view of Theore.1, we can provide bounds on the total nuber of gradient evaluations perfored by RGEM to find a stochastic ɛ-solution of proble (1.1), i.e., a point x X s.t. Eψ( x) ψ ɛ. Theore.1 iplies the nuber of gradient evaluations of f i perfored by RGEM to find a stochastic ɛ-solution of (1.1) can be bounded by K(ɛ, C, σ 0) = ( + ) + 16C log 16 ax{,c} 0,σ 0 ɛ {( = O + ) } ˆL log 1 ɛ. (.14) Here C = ˆL/. Therefore, whenever C log(1/ɛ) is doinating, and L f and ˆL are in the sae order of agnitude, RGEM can save up to O( ) gradient evaluations of the coponent function f han the optial deterinistic first-order ethods. More specifically, RGEM does not require any exact gradient coputation and its counication cost is siilar to pure stochastic gradient descent. To the best of our knowledge, it is the first tie that such an optial RIG ethod is presented for solving (1.1) in the literature. It should be pointed out that while the rates of convergence of RGEM obtained in Theore.1 is stated in ters of expectation, we can develop large-deviation results for these rates of convergence using siilar techniques in for solving strongly convex probles. Furtherore, if a one-tie exact gradient evaluation is available at the initial point, i.e., y 1 = y 0 = ( f 1 (x 0 ),..., f (x 0 )), we can drop the assuption in (.8) and eploy a ore aggressive stepsize policy with α = 1 +, +8 ˆL/ 7

8 Siilarly, we can deonstrate that the nuber of gradient evaluations of f i perfored by RGEM with this initialization ethod to find a stochastic ɛ-solution can be bounded by ( + ) ( ) {( ) } 6 ax{,c} 0,0 + 8C log ɛ + = O + ˆL log 1 ɛ... RGEM for stochastic finite-su optiization. We discuss in this subsection the stochastic finite-su optiization and online learning probles, where only noisy gradient inforation of f i can be accessed via a stochastic first-order (SFO) oracle. In particular, for any given point x t i X, the SFO oracle outputs a vector G i (x t i, ξt i ) s.t. EξGi(x t i, ξt i) = f i (x t i ), i = 1,...,, (.15) E ξ G i (x t i, ξt i) f i (x t i ) σ, i = 1,...,. (.16) We also assue that throughout this subsection that the is associated with the inner product,. As shown in Algorith, the RGEM for stochastic finite-su optiization is naturally obtained by replacing the gradient evaluation of f i in Algorith 1 (see (.4)) with a stochastic gradient estiator of f i given in (.17). In particular, at each iteration, we collect B t nuber of stochastic gradients of only one randoly selected coponent f i and take their average as the stochastic estiator of f i. Moreover, it needs to be entioned that the way RGEM initializes gradients, i.e, y 1 = y 0 = 0, is very iportant for stochastic optiization, since it is usually ipossible to copute exact gradient for expectation functions even at the initial point. Algorith RGEM for stochastic finite-su optiization This algorith is the sae as Algorith 1 except that (.4) is replaced by { 1 Bt y = B t j=1 G i(x t i, ξt i,j ), i =, y t 1 i, i. (.17) Here, G i (x t i, ξt i,j ), j = 1,..., B t, are stochastic gradients of f i coputed by the SFO oracle at x t i. Under the standard assuptions in (.15) and (.16) for stochastic optiization, and with proper choices of algorithic paraeters, Theore. shows that RGEM can achieve the optial O{σ / ɛ} rate of convergence (up to a certain logarithic factor) for solving strongly convex probles given in the for of (1.5) in ters of the nuber of stochastic gradients of f i. The proof of the this result can be found in Subsection 4.. Theore.. Let x be an optial solution of (1.5), x k and x k be generated by Algorith, and ˆL = ax i=1,..., L i. Suppose that σ 0 and σ are defined in (.8) and (.16), respectively. Given the iteration liit k, let {τ t }, {η t } and {α t } be set to (.9) with α being set as (.10), and we also set then where the expectation is taken w.r.t. { } and {ξ t i } and B t = k(1 α) α t, t = 1,..., k, (.18) EP (x k, x ) αk 0,σ0,σ, { } (.19) Eψ(x k ) ψ(x ) 6 ax 0,σ0,σα k/, (.0), ˆL 0,σ0,σ := P (x 0, x ) + ψ(x 0 ) ψ(x ) + σ 0 /+5σ. (.1) 8

9 In view of (.0), the nuber of iterations perfored by RGEM to find a stochastic ɛ-solution of (1.5), can be bounded by ( ˆK(ɛ, C, σ0, σ ) := + ) + 16C log 6 ax{,c} 0,σ 0,σ ɛ. (.) Furtherore, in view of (.19) this iteration coplexity bound can be iproved to K(ɛ, α, σ 0, σ ) := log 1/α 0,σ0,σ ɛ, (.3) in ters of finding a point x X s.t. EP ( x, x ) ɛ. Therefore, the corresponding nuber of stochastic gradient evaluations perfored by RGEM for solving proble (1.5) can be bounded by t=1 B t k {( k t=1 (1 α) α t 0,σ0,σ + k = O ɛ + + ) } C log 0,σ 0,σ ɛ, (.4) which together with (.1) iply that the total nuber of required stochastic gradients or saples of the rando variables ξ i, i = 1,...,, can be bounded by { } σ Õ 0 /+σ ɛ + P (x0,x )+ψ(x 0 ) ψ ɛ + + ˆL. Observe that this bound does not depend on the nuber of ters for sall enough ɛ. To the best of our knowledge, it is the first tie that such a convergence result is established for RIG algoriths to solve distributed stochastic finite-su probles. This coplexity bound in fact is in the sae order of agnitude (up to a logarithic factor) as the coplexity bound achieved by the optial accelerated stochastic approxiation ethods 11, 1, 19, which uniforly saple all the rando variables ξ i, i = 1,...,. However, this latter approach will thus involve uch higher counication costs in the distributed setting (see Subsection.3 for ore discussions)..3. RGEM for distributed optiization and achine learning. This subsection is devoted to RGEMs (see Algorith 1 and Algorith ) fro two different perspectives, i.e., the server and the activated agent under a distributed setting. We also discuss the counication costs incurred by RGEM under this setting. Both the server and agents in the distributed network start with the sae global initial point x 0, i.e., x 0 i = x0, i = 1,...,, and the server also sets y = 0 and g 0 = 0. During the process of RGEM, the server updates iterate x t and calculates the output solution x k (cf. (.5)) which is given by sux/suθ. Each agent only stores its local variable x t i and updates it according to the inforation received fro the server (i.e., xt ) when activated. The activated agent also needs to upload the changes of gradient y o the server. Observe that since y ight be sparse, uploading it will incur saller aount of counication costs than uploading the new gradient y t i. Note that line 5 of RGEM fro the -th agent s perspective is optional if the agent saves historic gradient inforation fro the last update. 9

10 RGEM The server s perspective 1: while t k do : x t M X (g t 1 + αt y, xt 1, η t ) 3: sux sux + θ t x t 4: suθ suθ + θ t 5: Send signal to the -th agent where is selected uniforly fro {1,..., } 6: if -th agent is responsive then 7: Send current iterate x t to -th agent 8: if Receive feedback y then 9: g t g t 1 + y 10: t t : else goto Line 5 1: end if 13: else goto Line 5 14: end if 15: end while RGEM The activated -th agent s perspective 1: Download the current iterate x t fro the server : if t = 1 then 3: y t 1 i 0 4: else 5: y t 1 i f i (x t 1 i ) Optional 6: end if 7: x t i (1 + τ t) 1 (x t + τ t x t 1 i ) 8: y t i f i(x t i ) 9: Upload the local changes to the server, i.e., y i = y t i yt 1 i We now add soe rearks about the potential benefits of RGEM for distributed optiization and achine learning. Firstly, since RGEM does not require any exact gradient evaluation of f, it does not need to wait for the responses fro all agents in order to copute an exact gradient. Each iteration of RGEM only involves counication between the server and the activated -th agent. In fact, RGEM will ove to the next iteration in case no response is received fro the -th agent. This schee works under the assuption that the probability for any agent being responsive or available at a certain point of tie is equal. However, all other optial RIG algoriths, except RPDG, need the exact gradient inforation fro all network agents once in a while, which incurs high counication costs and synchronous delays as long as one agent is not responsive. Even RPDG requires a full round of counications and synchronization at the initial point. Secondly, since each iteration of RGEM involves only constant nuber of counication rounds between the server and one selected agent, the counication coplexity for RGEM under distributed setting can be bounded by {( ) } O + ˆL log 1 ɛ. Therefore, it can save up to O{ } rounds of counication than the optial deterinistic first-order ethods. For solving distributed stochastic finite-su optiization probles (1.5), RGEM fro the -th agent s perspective will be slightly odified as follows. RGEM The activated -th agent s perspective for solving (1.5) 1: Download the current iterate x t fro the server : if t = 1 then 3: y t 1 i 0 Assuing RGEM saves y t 1 i for t at the latest update 4: end if 5: x t i (1 + τ t) 1 (x t + τ t x t 1 i ) 6: y 1 Bt B t j=1 G i(x t i, ξt i,j ) B t is the batch size, and G i s are the stochastic gradients given by SFO 7: Upload the local changes to the server, i.e., y i = y yt 1 i Siilar to the case for the deterinistic finite-su optiization, the total nuber of counication 10

11 rounds perfored by the above RGEM can be bounded by {( O + ˆL ) log 1 ɛ for solving (1.5). Each round of counication only involves the server and a randoly selected agent. This counication coplexity sees to be optial, since it atches the lower coplexity bound (1.7) established in. Moreover, the sapling coplexity, i.e., the total nuber of saples to be collected by all the agents, is also nearly optial and coparable to the case when all these saples are collected in a centralized location and processed by an optial stochastic approxiation ethod. On the other hand, if one applies an existing optial stochastic approxiation ethod to solve the distributed stochastic optiization proble, the counication coplexity will be as high as O(1/ ɛ), which is uch worse than RGEM. 3. Gradient extrapolation ethod: dual of Nesterov s acceleration. Our goal in this section is to introduce a new algorithic fraework, referred to as the gradient extrapolation ethod (GEM), for solving the convex optiization proble given by }, ψ := in {ψ(x) := f(x) + w(x)}. (3.1) x X We show that GEM can be viewed as a dual of Nesterov s accelerated gradient ethod although these two algoriths appear to be quite different. Moreover, GEM possess soe nice properties which enable us to develop and analyze the rando gradient extrapolation ethod for distributed and stochastic optiization Generalized Bregan distance. In this subsection, we provide a brief introduction to the generalized Bregan distance defined in (1.10) and soe properties for its associated prox-apping defined in(1.1). Note that whenever w is non-differentiable, we need to specify a particular selection of the subgradient w before perforing the prox-apping. We assue throughout this paper that such a selection of w is defined recursively as follows. Denote x 1 M X (g, x 0, η). By the optiality condition of (1.1), we have g + ( + η)w (x 1 ) ηw (x 0 ) N X (x 1 ), where N X (x 1 ) := {v R n : v T (x x 1 ) 0, x X} denotes the noral cone of X at x 1. Once such a w (x 1 ) satisfying the above relation is identified, we will use it as a subgradient when defining P (x 1, x) in the next iteration. Note that such a subgradient can be identified as long as x 1 is obtained, since it satisfies the optiality condition of (1.1). The following lea, which generalizes Lea 6 of 0 and Lea of 11, characterizes the solutions to (1.1). The proof of this result can be found in Lea 5 of. Lea 3.1. Let U be a closed convex set and a point ũ U be given. Also let w : U R be a convex function and W (ũ, u) = w(u) w(ũ) w (ũ), u ũ for soe w (ũ) w(ũ). Assue that the function q : U R satisfies q(u 1 ) q(u ) q (u ), u 1 u 0 W (u, u 1 ), u 1, u U for soe 0 0. Also assue that the scalars 1 and are chosen such that If u Argin{q(u) + 1 w(u) + W (ũ, u) : u U}, then for any u U, we have q(u ) + 1 w(u ) + W (ũ, u ) + ( )W (u, u) q(u) + 1 w(u) + W (ũ, u). 11

12 3.. The algorith. As shown in Algorith 3, GEM starts with a gradient extrapolation step (3.) to copute g t fro the two previous gradients g t 1 and g t. Based on g t, it perfors a proxial gradient descent step in (3.3) and updates the output solution x t. Finally, the gradient at x t is coputed for gradient extrapolation in the next iteration. This algorith is a special case of RGEM in Algorith 1 (with = 1). Algorith 3 An optial gradient extrapolation ethod (GEM) Input: Let x 0 X, and the nonnegative paraeters {α t }, {η t }, and {τ t } be given. Set x 0 = x 0 and g 1 = g 0 = f(x 0 ). for t = 1,,..., k do end for Output: x k. g t = α t (g t 1 g t ) + g t 1. (3.) x t = M X ( g t, x t 1, η t ). (3.3) x t = ( x t + τ t x t 1) /(1 + τ t ). (3.4) g t = f(x t ). (3.5) We now show that GEM can be viewed as the dual of the well-known Nesterov s accelerated gradient (NAG) ethod as studied in. To see such a relationship, we will first rewrite GEM in a prial-dual for. Let us consider the dual space G, where the gradients of f reside, and equip it with the conjugate nor. Let J f : G R be the conjugate function of f such that f(x) := ax g G { x, g J f (g)}. We can reforulate the original proble in (3.1) as the following saddle point proble: ψ := in x X { ax g G { x, g J f (g)} + w(x) }. (3.6) It is clear that J f is strongly convex with odulus 1/L f w.r.t. (See Chapter E in 15 for details). Therefore, we can define its associated dual generalized Bregan distance and dual prox-appings as D f (g 0, g) := J f (g) J f (g 0 ) + J f (g 0 ), g g 0, (3.7) M G ( x, g 0 {, τ) := arg in x, g + Jf (g) + τd f (g 0, g) }, (3.8) g G for any g 0, g G. The following result, whose proof is given in Lea 1 of, shows that the coputation of the dual prox-apping associated with D f is equivalent to the coputation of f. Lea 3.. Let x X and g 0 G be given and D f (g 0, g) be defined in (3.7). For any τ > 0, let us denote z = x + τj f (g0 )/(1 + τ). Then we have f(z) = M G ( x, g 0, τ). Using this result, we can see that the GEM iteration can be written a prial-dual for. Given (x 0, g 1, g 0 ) X G G, it updates (x t, g t ) by g t = α t (g t 1 g t ) + g t 1, (3.9) x t = M X ( g t, x t 1, η t ), (3.10) g t = M G ( x t, g t 1, τ t ), (3.11) with a specific selection of J f (gt 1 ) = x t 1 in D f (g t 1, g). Indeed, by denoting x 0 = x 0, we can easily see fro g 0 = f(x 0 ) that x 0 J f (g 0 ). Now assue that g t 1 = f(x t 1 ) and hence that x t 1 J f (g t 1 ). By the definition of g t in (3.11) and Lea 3., we conclude that g t = f(x t ) with x t = (x t +τ t x t 1 )/(1+τ t ), which are exactly the definitions given in (3.4) and (3.5). 1

13 Recall that in a siple version of the NAG ethod (e.g., 8, 34, 19, 11, 1, 13), given (x t 1, x t 1 ) X X, it updates (x t, x t ) by x t = (1 λ t ) x t 1 + λ t x t 1, (3.1) g t = f(x t ), (3.13) x t = M X (g t, x t 1, η t ), (3.14) x t = (1 λ t ) x t 1 + λ t x t, (3.15) for soe λ t 0, 1. Moreover, we have shown in that (3.1)-(3.15) can be viewed as a specific instantiation of the following prial-dual updates: x t = α t (x t 1 x t ) + x t 1, (3.16) g t = M G ( x t, g t 1, τ t ), (3.17) x t = M X (g t, x t 1, η t ). (3.18) Coparing (3.9)-(3.11) with (3.16)-(3.18), we can clearly see that GEM is a dual version of NAG, obtained by switching the prial and dual variables in each equation of (3.16)-(3.18). The ajor difference exists in that the extrapolation step in GEM is perfored in the dual space while the one in NAG is perfored in the prial space. In fact, extrapolation in the dual space will help us to greatly siplify and further enhance the randoized increental gradient ethods developed in based on NAG. Another interesting fact is that in GEM, the gradients are coputed for the output solutions {x t }. On the other hand, the output solutions in the NAG ethod are given by { x t } while the gradients are coputed for the extrapolation sequence {x t } Convergence of GEM. Our goal in this subsection is to establish the convergence properties of the GEM ethod for solving (3.1). Observe that our analysis is carried out copletely in the prial space and does not rely on the prial-dual interpretation described in the previous section. This type of analysis technique appears to be new for solving proble (3.1) in the literature as it also differs significantly fro that of NAG. We first establish soe general convergence properties for GEM for both sooth convex ( = 0) and strongly convex cases ( > 0). Theore 3.3. Suppose that {η t }, {τ t }, and {α t } in GEM satisfy θ t 1 = α t θ t, t =,..., k, (3.19) θ t η t θ t 1 ( + η t 1 ), t =,..., k, (3.0) θ t τ t = θ t 1 (1 + τ t 1 ), t =,..., k, (3.1) α t L f τ t 1 η t, t =,..., k, (3.) L f τ k ( + η k ), (3.3) for soe θ t 0, t = 1,..., k. Then, for any k 1 and any given x X, we have θ k (1 + τ k )ψ(x k ) ψ(x) + θ k(+η k ) P (x k, x) θ 1 τ 1 ψ(x 0 ) ψ(x) + θ 1 η 1 P (x 0, x). (3.4) Proof. Applying Lea 3.1 to (3.3), we obtain x t x, α t (g t 1 g t ) + g t 1 + w(x t ) w(x) η t P (x t 1, x) ( + η t )P (x t, x) η t P (x t 1, x t ). (3.5) 13

14 Moreover, using the definition of ψ, the convexity of f, and the fact that g t = f(x t ), we have (1 + τ t )f(x t ) + w(x t ) ψ(x) (1 + τ t )f(x t ) + w(x t ) w(x) f(x t ) + g t, x x t = τ t f(x t ) g t, x t x t 1 g t, x x t + w(x t ) w(x) τt L f g t g t 1 + τ t f(x t 1 ) g t, x x t + w(x t ) w(x) τt L f g t g t 1 + τ t f(x t 1 ) + x t x, g t g t 1 α t (g t 1 g t ) + η t P (x t 1, x) ( + η t )P (x t, x) η t P (x t 1, x t ), where the first equality follows fro the definition of x t in (3.4), the second inequality follows fro the soothness of f (see Theore.1.5 in 8), and the last inequality follows fro (3.5). Multiplying both sides of the above inequality by θ t, and suing up the resulting inequalities fro t = 1 to k, we obtain t=1 θ t(1 + τ t )f(x t ) + t=1 θ tw(x t ) ψ(x) θ tτ t t=1 L f g t g t 1 + t=1 θ tτ t f(x t 1 ) + t=1 θ t x t x, g t g t 1 α t (g t 1 g t ) + t=1 θ tη t P (x t 1, x) ( + η t )P (x t, x) η t P (x t 1, x t ). (3.6) Now by (3.19) and the fact that g 1 = g 0, we have t=1 θ t x t x, g t g t 1 α t (g t 1 g t ) = t=1 θ t x t x, g t g t 1 α t x t 1 x, g t 1 g t t= θ tα t x t x t 1, g t 1 g t = θ k x k x, g k g k 1 t= θ tα t x t x t 1, g t 1 g t. Moreover, in view of (3.0), (3.1) and the definition of x t (3.4), we obtain t=1 θ tη t P (x t 1, x) ( + η t )P (x t (3.0), x) θ 1 η 1 P (x 0, x) θ k ( + η k )P (x k, x), t=1 θ t(1 + τ t )f(x t ) τ t f(x t 1 ) (3.1) = θ k (1 + τ k )f(x k ) θ 1 τ 1 f(x 0 ), t=1 θ (3.1) t = t= θ tτ t θ t 1 τ t 1 + θ k = θ k (1 + τ k ) θ 1 τ 1, θ k (1 + τ k )x k(3.4) = θ k (x k + τ k 1+τ k 1 x k k (3.1) = t=1 θ tx t + θ 1 τ 1 x 0. The last two relations, in view of the convexity of w( ), also iply that θ k (1 + τ k )w(x k ) t=1 θ tw(x t ) + θ 1 τ 1 w(x 0 ). Therefore, by (3.6), the above relations, and the definition of ψ, we conclude that τ t t= 1+τ t 1 x 1 + k t= τ t 1+τ t 1 τ 1 x 0 ) θ k (1 + τ k )ψ(x k ) ψ(x) k t= θt 1τt 1 L f g t 1 g t θ t α t x t x t 1, g t 1 g t θ t η t P (x t 1, x t ) θ k τk L f g k g k 1 x k x, g k g k 1 + ( + η k )P (x k, x) + θ 1 η 1 P (x 0, x) + θ 1 τ 1 ψ(x 0 ) ψ(x) θ 1 η 1 P (x 0, x 1 ). (3.7) By the strong convexity of P (, ) in (1.11), the siple relation that b u, v a v / b u /(a), a > 0, 14

15 and the conditions in (3.) and (3.3), we have t= θt 1τ t 1 L f g t 1 g t + θ t α t x t x t 1, g t 1 g t + θ t η t P (x t 1, x t ) ( ) k θ t αtl f t= τ t 1 η t x t 1 x t 0 θ τk k L f g k g k 1 x k x, g k g k 1 + (+η k) P (x k, x) ) x k x 0. θ k ( Lf τ k +η k Using the above relations in (3.7), we obtain (3.4). We are now ready to establish the optial convergence behavior of GEM as a consequence of Theore 3.3. We first provide a constant step-size policy which guarantees an optial linear rate of convergence for the strongly convex case ( > 0). Corollary 3.4. Let x be an optial solution of (1.1), x k and x k be defined in (3.3) and (3.4), respectively. Suppose that > 0, and that {τ t }, {η t } and {α t } are set to Lf τ t τ =, η t η = Lf / L f, and α t α =, t = 1,..., k. (3.8) Then, 1+ L f / P (x k, x ) α k P (x 0, x ) + 1 (ψ(x0 ) ψ ), (3.9) ψ(x k ) ψ α k P (x 0, x ) + ψ(x 0 ) ψ. (3.30) Proof. Let us set θ t = α t, t = 1,..., k. It is easy to check that the selection of {τ t }, {η t } and {α t } in (3.8) satisfies conditions (3.19)-(3.3). In view of Theore 3.3 and (3.8), we have ψ(x k ) ψ(x ) + +η (1+τ) P (xk, x ) θ1τ θ k (1+τ) ψ(x0 ) ψ(x ) + = α k ψ(x 0 ) ψ(x ) + P (x 0, x ). It also follows fro the above relation, the fact ψ(x k ) ψ(x ) 0, and (3.8) that θ1η θ k (1+τ) P (x0, x ) P (x k, x ) (1+τ)αk +η P (x 0, x ) + ψ(x 0 ) ψ(x ) = α k P (x 0, x ) + 1 (ψ(x0 ) ψ(x )). We now provide a stepsize policy which guarantees the optial rate of convergence for the sooth case ( = 0). Observe that in sooth case we can estiate the solution quality for the sequence {x k } only. Corollary 3.5. Let x be an optial solution of (1.1), and x k be defined in (3.4). Suppose that = 0, and that {τ t }, {η t } and {α t } are set to Then, τ t = t, η t = 4L f t, and α t = t t+1, t = 1,..., k. (3.31) ψ(x k ) ψ(x ) = f(x k ) f(x ) (k+1)(k+) f(x0 ) f(x ) + 8L f P (x 0, x ). (3.3) Proof. Let us set θ t = t+1, t = 1,..., k. It is easy to check that the paraeters in (3.31) satisfy conditions (3.)-(3.3). In view of (3.4) and (3.31), we conclude that ψ(x k ) ψ(x ) (k+1)(k+) ψ(x0 ) ψ(x ) + 8L f P (x 0, x ). 15

16 In Corollary 3.6, we iprove the above coplexity result in ters of the dependence on f(x 0 ) f(x ) by using a different step-size policy and a slightly ore involved analysis for the sooth case ( = 0). Corollary 3.6. Let x be an optial solution of (1.1), x k and x k be defined in (3.3) and (3.4), respectively. Suppose that = 0, and that {τ t }, {η t } and {α t } are set to Then, for any k 1, τ t = t 1, η t = 6L f t, and α t = t 1 t, t = 1,..., k. (3.33) ψ(x k ) ψ(x ) = f(x k ) f(x ) 1L f k(k+1) P (x0, x ). (3.34) Proof. If we set θ t = t, t = 1,..., k. It is easy to check that the paraeters in (3.33) satisfy conditions (3.19)-(3.1) and (3.3). However, condition (3.) only holds for t = 3,..., k, i.e., In view of (3.7) and the fact that τ 1 = 0, we have α t L f τ t 1 η t, t = 3,..., k. (3.35) θ k (1 + τ k )ψ(x k ) ψ(x) θ α x x 1, g 1 g 0 + η P (x 1, x ) θ 1 η 1 P (x 0, x 1 ) k θt 1τ t 1 t=3 L f g t 1 g t + θ t α t x t x t 1, g t 1 g t + θ t η t P (x t 1, x t ) θ τk k L f g k g k 1 x k x, g k g k 1 + ( + η k )P (x k, x) + θ 1 η 1 P (x 0, x) θ1α η g 1 g 0 θ1η1 x1 x 0 + ( ) k θ t αtl f t=3 τ t 1 η t x t 1 x t ( ) + θ k Lf τ k η k x k x + θ 1 η 1 P (x 0, x) θ kη k P (xk, x) θ1αl f η x 1 x 0 θ1η1 x1 x 0 + θ 1 η 1 P (x 0, x) θ kη k P (xk, x) ( ) αl θ f 1 η η 1 x 1 x 0 + θ 1 η 1 P (x 0, x) θ kη k P (xk, x), where the second inequality follows fro the siple relation that b u, v a v / b u /(a), a > 0 and (1.11), the third inequality follows fro (3.35), (3.3), the definition of g t in (3.5) and (1.4), and the last inequality follows fro the facts that x 0 = x 0 and x 1 = x 1 (due to τ 1 = 0). Therefore, by plugging the paraeter setting in (3.33) into the above inequality, we conclude that ψ(x k ) ψ = f(x k ) f(x ) θ k (1 + τ k ) 1 θ 1 η 1 P (x 0, x ) θ kη k P (xk, x) 1L f k(k+1) P (x0, x ). In view of the results obtained in the above three corollaries, GEM exhibits optial rates of convergence for both strongly convex and sooth cases. Different fro the classical NAG ethod, GEM perfors extrapolation on the gradients, rather than the iterates. This fact will help us to develop an enhanced randoized increental gradient ethod than RPDG in, i.e., the Rando Gradient Extrapolation Method, with a uch sipler analysis. 4. Convergence analysis of RGEM. Our ain goal in this section is to establish the convergence properties of RGEM for solving (1.1) and (1.5), i.e., the ain results stated in Theore.1 and.. In fact, coparing RGEM in Algorith 1 with GEM in Algorith 3, RGEM is a direct randoization of GEM. Therefore, inheriting fro GEM, its convergence analysis is carried out copletely in the prial space. However, the analysis for RGEM is ore challenging especially because we need to 1) build up the relationship 16

17 between 1 i=1 f i(x k i ) and f(xk ), for which we exploit the function Q defined in (4.3) as an interediate tool; ) bound the error caused by inexact gradients at the initial point and 3) analyze the accuulated error caused by randoization and noisy stochastic gradients. Before proving Theore.1 and., we first need to provide soe iportant technical results. The following siple result deonstrates a few identities related to x t i (cf. (.3)) and yt (cf. (.4) or (.17)). as Lea 4.1. Let x t and y t be defined in (.) and (.4) (or (.17)), respectively, and ˆx t i and ŷt be defined ˆx t i = (1 + τ t) 1 (x t + τ t x t 1 i ), i = 1,...,, t 1, (4.1) { ŷ f i (ˆx t i = ), if yt is defined in (.4), 1 Bt B t j=1 G i(ˆx t i, ξt i,j ), if i = 1,...,, t 1, (4.) yt is defined in (.17), respectively. Then we have, for any i = 1,..., and t = 1,..., k, E t yi t = 1 ŷt i + (1 1 )yt 1 i, E t x t i = 1 ˆxt i + (1 1 )xt 1 i, E t f i (x t i) = 1 f i(ˆx t i) + (1 1 )f i(x t 1 i ), E t f i (x t i ) f i(x t 1 i ) = 1 f i(ˆx t i ) f i(x t 1 i ), where E t denotes the conditional expectation w.r.t. given i 1,..., 1 when y t is defined in (.4), and w.r.t. given i 1,..., 1, ξ1, t..., ξ t when y t is defined in (.17), respectively. Proof. This first equality follows iediately fro the facts that Prob t {y = ŷt i } = Prob t{ = i} = 1 and Prob t {y = yt 1 i } = 1 1. Here Prob t denotes the conditional probability w.r.t. given i 1,..., 1 when y t is defined in (.4) and w.r.t given i 1,..., 1, ξ1, t..., ξ t when y t is defined in (.17), respectively. Siilarly, we can prove the rest equalities. We define the following function Q to help us analyze the convergence properties of RGEM. Let x, x X be two feasible solutions of (1.1) (or (1.5)), we define the corresponding Q(x, x) by Q(x, x) := f(x), x x + w(x) w(x). (4.3) It is obvious that if we fix x = x, an optial solution of (1.1) (or (1.5)), by the convexity of w and the optiality condition of x, for any feasible solution x, we can conclude that Q(x, x ) f(x ) + w (x ), x x 0. Moreover, observing that f is sooth, we conclude that Q(x, x ) = f(x ) + f(x ), x x + w(x) ψ(x ) L f x x + ψ(x) ψ(x ). (4.4) The following lea establishes an iportant relationship regarding Q. Lea 4.. Let x t be defined in (.), and x X be any feasible solution of (1.1) or (1.5). Suppose that τ t in RGEM satisfy for soe θ t 0, t = 1,..., k. Then, we have θ t ((1 + τ t ) 1) = θ t 1 (1 + τ t 1 ), t =,..., k, (4.5) t=1 θ teq(x t, x) θ k (1 + τ k ) i=1 Ef i(x k i ) + t=1 θ tew(x t ) ψ(x) θ 1 ((1 + τ 1 ) 1) x 0 x, f(x) + f(x). (4.6) 17

Stochastic Subgradient Methods

Stochastic Subgradient Methods Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,

More information

Tight Complexity Bounds for Optimizing Composite Objectives

Tight Complexity Bounds for Optimizing Composite Objectives Tight Coplexity Bounds for Optiizing Coposite Objectives Blake Woodworth Toyota Technological Institute at Chicago Chicago, IL, 60637 blake@ttic.edu Nathan Srebro Toyota Technological Institute at Chicago

More information

Distributed Subgradient Methods for Multi-agent Optimization

Distributed Subgradient Methods for Multi-agent Optimization 1 Distributed Subgradient Methods for Multi-agent Optiization Angelia Nedić and Asuan Ozdaglar October 29, 2007 Abstract We study a distributed coputation odel for optiizing a su of convex objective functions

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher

More information

Sharp Time Data Tradeoffs for Linear Inverse Problems

Sharp Time Data Tradeoffs for Linear Inverse Problems Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used

More information

Asynchronous Gossip Algorithms for Stochastic Optimization

Asynchronous Gossip Algorithms for Stochastic Optimization Asynchronous Gossip Algoriths for Stochastic Optiization S. Sundhar Ra ECE Dept. University of Illinois Urbana, IL 680 ssrini@illinois.edu A. Nedić IESE Dept. University of Illinois Urbana, IL 680 angelia@illinois.edu

More information

1 Bounding the Margin

1 Bounding the Margin COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost

More information

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

e-companion ONLY AVAILABLE IN ELECTRONIC FORM OPERATIONS RESEARCH doi 10.1287/opre.1070.0427ec pp. ec1 ec5 e-copanion ONLY AVAILABLE IN ELECTRONIC FORM infors 07 INFORMS Electronic Copanion A Learning Approach for Interactive Marketing to a Custoer

More information

A Simple Regression Problem

A Simple Regression Problem A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where

More information

An incremental mirror descent subgradient algorithm with random sweeping and proximal step

An incremental mirror descent subgradient algorithm with random sweeping and proximal step An increental irror descent subgradient algorith with rando sweeping and proxial step Radu Ioan Boţ Axel Böh April 6, 08 Abstract We investigate the convergence properties of increental irror descent type

More information

An incremental mirror descent subgradient algorithm with random sweeping and proximal step

An incremental mirror descent subgradient algorithm with random sweeping and proximal step An increental irror descent subgradient algorith with rando sweeping and proxial step Radu Ioan Boţ Axel Böh Septeber 7, 07 Abstract We investigate the convergence properties of increental irror descent

More information

CS Lecture 13. More Maximum Likelihood

CS Lecture 13. More Maximum Likelihood CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood

More information

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search Quantu algoriths (CO 781, Winter 2008) Prof Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search ow we begin to discuss applications of quantu walks to search algoriths

More information

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay A Low-Coplexity Congestion Control and Scheduling Algorith for Multihop Wireless Networks with Order-Optial Per-Flow Delay Po-Kai Huang, Xiaojun Lin, and Chih-Chun Wang School of Electrical and Coputer

More information

Convex Programming for Scheduling Unrelated Parallel Machines

Convex Programming for Scheduling Unrelated Parallel Machines Convex Prograing for Scheduling Unrelated Parallel Machines Yossi Azar Air Epstein Abstract We consider the classical proble of scheduling parallel unrelated achines. Each job is to be processed by exactly

More information

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering

More information

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices CS71 Randoness & Coputation Spring 018 Instructor: Alistair Sinclair Lecture 13: February 7 Disclaier: These notes have not been subjected to the usual scrutiny accorded to foral publications. They ay

More information

Constrained Consensus and Optimization in Multi-Agent Networks arxiv: v2 [math.oc] 17 Dec 2008

Constrained Consensus and Optimization in Multi-Agent Networks arxiv: v2 [math.oc] 17 Dec 2008 LIDS Report 2779 1 Constrained Consensus and Optiization in Multi-Agent Networks arxiv:0802.3922v2 [ath.oc] 17 Dec 2008 Angelia Nedić, Asuan Ozdaglar, and Pablo A. Parrilo February 15, 2013 Abstract We

More information

OPTIMIZATION in multi-agent networks has attracted

OPTIMIZATION in multi-agent networks has attracted Distributed constrained optiization and consensus in uncertain networks via proxial iniization Kostas Margellos, Alessandro Falsone, Sione Garatti and Maria Prandini arxiv:603.039v3 [ath.oc] 3 May 07 Abstract

More information

A Theoretical Analysis of a Warm Start Technique

A Theoretical Analysis of a Warm Start Technique A Theoretical Analysis of a War Start Technique Martin A. Zinkevich Yahoo! Labs 701 First Avenue Sunnyvale, CA Abstract Batch gradient descent looks at every data point for every step, which is wasteful

More information

Fairness via priority scheduling

Fairness via priority scheduling Fairness via priority scheduling Veeraruna Kavitha, N Heachandra and Debayan Das IEOR, IIT Bobay, Mubai, 400076, India vavitha,nh,debayan}@iitbacin Abstract In the context of ulti-agent resource allocation

More information

Recovering Data from Underdetermined Quadratic Measurements (CS 229a Project: Final Writeup)

Recovering Data from Underdetermined Quadratic Measurements (CS 229a Project: Final Writeup) Recovering Data fro Underdeterined Quadratic Measureents (CS 229a Project: Final Writeup) Mahdi Soltanolkotabi Deceber 16, 2011 1 Introduction Data that arises fro engineering applications often contains

More information

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes

More information

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds

More information

Data Dependent Convergence For Consensus Stochastic Optimization

Data Dependent Convergence For Consensus Stochastic Optimization Data Dependent Convergence For Consensus Stochastic Optiization Avleen S. Bijral, Anand D. Sarwate, Nathan Srebro Septeber 8, 08 arxiv:603.04379v ath.oc] Sep 06 Abstract We study a distributed consensus-based

More information

A note on the multiplication of sparse matrices

A note on the multiplication of sparse matrices Cent. Eur. J. Cop. Sci. 41) 2014 1-11 DOI: 10.2478/s13537-014-0201-x Central European Journal of Coputer Science A note on the ultiplication of sparse atrices Research Article Keivan Borna 12, Sohrab Aboozarkhani

More information

Soft Computing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis

Soft Computing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis Soft Coputing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis Beverly Rivera 1,2, Irbis Gallegos 1, and Vladik Kreinovich 2 1 Regional Cyber and Energy Security Center RCES

More information

arxiv: v1 [cs.lg] 8 Jan 2019

arxiv: v1 [cs.lg] 8 Jan 2019 Data Masking with Privacy Guarantees Anh T. Pha Oregon State University phatheanhbka@gail.co Shalini Ghosh Sasung Research shalini.ghosh@gail.co Vinod Yegneswaran SRI international vinod@csl.sri.co arxiv:90.085v

More information

Fast Montgomery-like Square Root Computation over GF(2 m ) for All Trinomials

Fast Montgomery-like Square Root Computation over GF(2 m ) for All Trinomials Fast Montgoery-like Square Root Coputation over GF( ) for All Trinoials Yin Li a, Yu Zhang a, a Departent of Coputer Science and Technology, Xinyang Noral University, Henan, P.R.China Abstract This letter

More information

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

A Smoothed Boosting Algorithm Using Probabilistic Output Codes A Soothed Boosting Algorith Using Probabilistic Output Codes Rong Jin rongjin@cse.su.edu Dept. of Coputer Science and Engineering, Michigan State University, MI 48824, USA Jian Zhang jian.zhang@cs.cu.edu

More information

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians Using EM To Estiate A Probablity Density With A Mixture Of Gaussians Aaron A. D Souza adsouza@usc.edu Introduction The proble we are trying to address in this note is siple. Given a set of data points

More information

COS 424: Interacting with Data. Written Exercises

COS 424: Interacting with Data. Written Exercises COS 424: Interacting with Data Hoework #4 Spring 2007 Regression Due: Wednesday, April 18 Written Exercises See the course website for iportant inforation about collaboration and late policies, as well

More information

On the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation

On the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation journal of coplexity 6, 459473 (2000) doi:0.006jco.2000.0544, available online at http:www.idealibrary.co on On the Counication Coplexity of Lipschitzian Optiization for the Coordinated Model of Coputation

More information

1 Rademacher Complexity Bounds

1 Rademacher Complexity Bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Max Goer March 07, 2013 1 Radeacher Coplexity Bounds Recall the following theore fro last lecture: Theore 1. With probability

More information

Inexact Proximal Gradient Methods for Non-Convex and Non-Smooth Optimization

Inexact Proximal Gradient Methods for Non-Convex and Non-Smooth Optimization The Thirty-Second AAAI Conference on Artificial Intelligence AAAI-8) Inexact Proxial Gradient Methods for Non-Convex and Non-Sooth Optiization Bin Gu, De Wang, Zhouyuan Huo, Heng Huang * Departent of Electrical

More information

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical IEEE TRANSACTIONS ON INFORMATION THEORY Large Alphabet Source Coding using Independent Coponent Analysis Aichai Painsky, Meber, IEEE, Saharon Rosset and Meir Feder, Fellow, IEEE arxiv:67.7v [cs.it] Jul

More information

arxiv: v3 [cs.lg] 7 Jan 2016

arxiv: v3 [cs.lg] 7 Jan 2016 Efficient and Parsionious Agnostic Active Learning Tzu-Kuo Huang Alekh Agarwal Daniel J. Hsu tkhuang@icrosoft.co alekha@icrosoft.co djhsu@cs.colubia.edu John Langford Robert E. Schapire jcl@icrosoft.co

More information

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material Consistent Multiclass Algoriths for Coplex Perforance Measures Suppleentary Material Notations. Let λ be the base easure over n given by the unifor rando variable (say U over n. Hence, for all easurable

More information

Soft-margin SVM can address linearly separable problems with outliers

Soft-margin SVM can address linearly separable problems with outliers Non-linear Support Vector Machines Non-linearly separable probles Hard-argin SVM can address linearly separable probles Soft-argin SVM can address linearly separable probles with outliers Non-linearly

More information

On Constant Power Water-filling

On Constant Power Water-filling On Constant Power Water-filling Wei Yu and John M. Cioffi Electrical Engineering Departent Stanford University, Stanford, CA94305, U.S.A. eails: {weiyu,cioffi}@stanford.edu Abstract This paper derives

More information

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis City University of New York (CUNY) CUNY Acadeic Works International Conference on Hydroinforatics 8-1-2014 Experiental Design For Model Discriination And Precise Paraeter Estiation In WDS Analysis Giovanna

More information

PAC-Bayes Analysis Of Maximum Entropy Learning

PAC-Bayes Analysis Of Maximum Entropy Learning PAC-Bayes Analysis Of Maxiu Entropy Learning John Shawe-Taylor and David R. Hardoon Centre for Coputational Statistics and Machine Learning Departent of Coputer Science University College London, UK, WC1E

More information

Combining Classifiers

Combining Classifiers Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/

More information

Complex Quadratic Optimization and Semidefinite Programming

Complex Quadratic Optimization and Semidefinite Programming Coplex Quadratic Optiization and Seidefinite Prograing Shuzhong Zhang Yongwei Huang August 4 Abstract In this paper we study the approxiation algoriths for a class of discrete quadratic optiization probles

More information

Hybrid System Identification: An SDP Approach

Hybrid System Identification: An SDP Approach 49th IEEE Conference on Decision and Control Deceber 15-17, 2010 Hilton Atlanta Hotel, Atlanta, GA, USA Hybrid Syste Identification: An SDP Approach C Feng, C M Lagoa, N Ozay and M Sznaier Abstract The

More information

Randomized Accuracy-Aware Program Transformations For Efficient Approximate Computations

Randomized Accuracy-Aware Program Transformations For Efficient Approximate Computations Randoized Accuracy-Aware Progra Transforations For Efficient Approxiate Coputations Zeyuan Allen Zhu Sasa Misailovic Jonathan A. Kelner Martin Rinard MIT CSAIL zeyuan@csail.it.edu isailo@it.edu kelner@it.edu

More information

Graphical Models in Local, Asymmetric Multi-Agent Markov Decision Processes

Graphical Models in Local, Asymmetric Multi-Agent Markov Decision Processes Graphical Models in Local, Asyetric Multi-Agent Markov Decision Processes Ditri Dolgov and Edund Durfee Departent of Electrical Engineering and Coputer Science University of Michigan Ann Arbor, MI 48109

More information

Lower Bounds for Quantized Matrix Completion

Lower Bounds for Quantized Matrix Completion Lower Bounds for Quantized Matrix Copletion Mary Wootters and Yaniv Plan Departent of Matheatics University of Michigan Ann Arbor, MI Eail: wootters, yplan}@uich.edu Mark A. Davenport School of Elec. &

More information

1 Identical Parallel Machines

1 Identical Parallel Machines FB3: Matheatik/Inforatik Dr. Syaantak Das Winter 2017/18 Optiizing under Uncertainty Lecture Notes 3: Scheduling to Miniize Makespan In any standard scheduling proble, we are given a set of jobs J = {j

More information

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t. CS 493: Algoriths for Massive Data Sets Feb 2, 2002 Local Models, Bloo Filter Scribe: Qin Lv Local Models In global odels, every inverted file entry is copressed with the sae odel. This work wells when

More information

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs BULLETIN OF THE POLISH ACADEMY OF SCIENCES TECHNICAL SCIENCES Vol. 57, No. 3, 2009 Algoriths for parallel processor scheduling with distinct due windows and unit-tie obs A. JANIAK 1, W.A. JANIAK 2, and

More information

Boosting with log-loss

Boosting with log-loss Boosting with log-loss Marco Cusuano-Towner Septeber 2, 202 The proble Suppose we have data exaples {x i, y i ) i =... } for a two-class proble with y i {, }. Let F x) be the predictor function with the

More information

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab Support Vector Machines Machine Learning Series Jerry Jeychandra Bloh Lab Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a

More information

Ch 12: Variations on Backpropagation

Ch 12: Variations on Backpropagation Ch 2: Variations on Backpropagation The basic backpropagation algorith is too slow for ost practical applications. It ay take days or weeks of coputer tie. We deonstrate why the backpropagation algorith

More information

arxiv: v1 [cs.ds] 3 Feb 2014

arxiv: v1 [cs.ds] 3 Feb 2014 arxiv:40.043v [cs.ds] 3 Feb 04 A Bound on the Expected Optiality of Rando Feasible Solutions to Cobinatorial Optiization Probles Evan A. Sultani The Johns Hopins University APL evan@sultani.co http://www.sultani.co/

More information

Support Vector Machines MIT Course Notes Cynthia Rudin

Support Vector Machines MIT Course Notes Cynthia Rudin Support Vector Machines MIT 5.097 Course Notes Cynthia Rudin Credit: Ng, Hastie, Tibshirani, Friedan Thanks: Şeyda Ertekin Let s start with soe intuition about argins. The argin of an exaple x i = distance

More information

Estimating Entropy and Entropy Norm on Data Streams

Estimating Entropy and Entropy Norm on Data Streams Estiating Entropy and Entropy Nor on Data Streas Ait Chakrabarti 1, Khanh Do Ba 1, and S. Muthukrishnan 2 1 Departent of Coputer Science, Dartouth College, Hanover, NH 03755, USA 2 Departent of Coputer

More information

Multi-Dimensional Hegselmann-Krause Dynamics

Multi-Dimensional Hegselmann-Krause Dynamics Multi-Diensional Hegselann-Krause Dynaics A. Nedić Industrial and Enterprise Systes Engineering Dept. University of Illinois Urbana, IL 680 angelia@illinois.edu B. Touri Coordinated Science Laboratory

More information

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information Cite as: Straub D. (2014). Value of inforation analysis with structural reliability ethods. Structural Safety, 49: 75-86. Value of Inforation Analysis with Structural Reliability Methods Daniel Straub

More information

C na (1) a=l. c = CO + Clm + CZ TWO-STAGE SAMPLE DESIGN WITH SMALL CLUSTERS. 1. Introduction

C na (1) a=l. c = CO + Clm + CZ TWO-STAGE SAMPLE DESIGN WITH SMALL CLUSTERS. 1. Introduction TWO-STGE SMPLE DESIGN WITH SMLL CLUSTERS Robert G. Clark and David G. Steel School of Matheatics and pplied Statistics, University of Wollongong, NSW 5 ustralia. (robert.clark@abs.gov.au) Key Words: saple

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic

More information

Extension of CSRSM for the Parametric Study of the Face Stability of Pressurized Tunnels

Extension of CSRSM for the Parametric Study of the Face Stability of Pressurized Tunnels Extension of CSRSM for the Paraetric Study of the Face Stability of Pressurized Tunnels Guilhe Mollon 1, Daniel Dias 2, and Abdul-Haid Soubra 3, M.ASCE 1 LGCIE, INSA Lyon, Université de Lyon, Doaine scientifique

More information

arxiv: v2 [cs.lg] 30 Mar 2017

arxiv: v2 [cs.lg] 30 Mar 2017 Batch Renoralization: Towards Reducing Minibatch Dependence in Batch-Noralized Models Sergey Ioffe Google Inc., sioffe@google.co arxiv:1702.03275v2 [cs.lg] 30 Mar 2017 Abstract Batch Noralization is quite

More information

Polygonal Designs: Existence and Construction

Polygonal Designs: Existence and Construction Polygonal Designs: Existence and Construction John Hegean Departent of Matheatics, Stanford University, Stanford, CA 9405 Jeff Langford Departent of Matheatics, Drake University, Des Moines, IA 5011 G

More information

A Probabilistic and RIPless Theory of Compressed Sensing

A Probabilistic and RIPless Theory of Compressed Sensing A Probabilistic and RIPless Theory of Copressed Sensing Eanuel J Candès and Yaniv Plan 2 Departents of Matheatics and of Statistics, Stanford University, Stanford, CA 94305 2 Applied and Coputational Matheatics,

More information

Bootstrapping Dependent Data

Bootstrapping Dependent Data Bootstrapping Dependent Data One of the key issues confronting bootstrap resapling approxiations is how to deal with dependent data. Consider a sequence fx t g n t= of dependent rando variables. Clearly

More information

Approximation in Stochastic Scheduling: The Power of LP-Based Priority Policies

Approximation in Stochastic Scheduling: The Power of LP-Based Priority Policies Approxiation in Stochastic Scheduling: The Power of -Based Priority Policies Rolf Möhring, Andreas Schulz, Marc Uetz Setting (A P p stoch, r E( w and (B P p stoch E( w We will assue that the processing

More information

Tight Information-Theoretic Lower Bounds for Welfare Maximization in Combinatorial Auctions

Tight Information-Theoretic Lower Bounds for Welfare Maximization in Combinatorial Auctions Tight Inforation-Theoretic Lower Bounds for Welfare Maxiization in Cobinatorial Auctions Vahab Mirrokni Jan Vondrák Theory Group, Microsoft Dept of Matheatics Research Princeton University Redond, WA 9805

More information

arxiv: v1 [math.na] 10 Oct 2016

arxiv: v1 [math.na] 10 Oct 2016 GREEDY GAUSS-NEWTON ALGORITHM FOR FINDING SPARSE SOLUTIONS TO NONLINEAR UNDERDETERMINED SYSTEMS OF EQUATIONS MÅRTEN GULLIKSSON AND ANNA OLEYNIK arxiv:6.395v [ath.na] Oct 26 Abstract. We consider the proble

More information

In this chapter, we consider several graph-theoretic and probabilistic models

In this chapter, we consider several graph-theoretic and probabilistic models THREE ONE GRAPH-THEORETIC AND STATISTICAL MODELS 3.1 INTRODUCTION In this chapter, we consider several graph-theoretic and probabilistic odels for a social network, which we do under different assuptions

More information

A Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science

A Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science A Better Algorith For an Ancient Scheduling Proble David R. Karger Steven J. Phillips Eric Torng Departent of Coputer Science Stanford University Stanford, CA 9435-4 Abstract One of the oldest and siplest

More information

Composite optimization for robust blind deconvolution

Composite optimization for robust blind deconvolution Coposite optiization for robust blind deconvolution Vasileios Charisopoulos Daek Davis Mateo Díaz Ditriy Drusvyatskiy Abstract The blind deconvolution proble seeks to recover a pair of vectors fro a set

More information

Non-Parametric Non-Line-of-Sight Identification 1

Non-Parametric Non-Line-of-Sight Identification 1 Non-Paraetric Non-Line-of-Sight Identification Sinan Gezici, Hisashi Kobayashi and H. Vincent Poor Departent of Electrical Engineering School of Engineering and Applied Science Princeton University, Princeton,

More information

Optimal Jamming Over Additive Noise: Vector Source-Channel Case

Optimal Jamming Over Additive Noise: Vector Source-Channel Case Fifty-first Annual Allerton Conference Allerton House, UIUC, Illinois, USA October 2-3, 2013 Optial Jaing Over Additive Noise: Vector Source-Channel Case Erah Akyol and Kenneth Rose Abstract This paper

More information

Lecture 9: Multi Kernel SVM

Lecture 9: Multi Kernel SVM Lecture 9: Multi Kernel SVM Stéphane Canu stephane.canu@litislab.eu Sao Paulo 204 April 6, 204 Roadap Tuning the kernel: MKL The ultiple kernel proble Sparse kernel achines for regression: SVR SipleMKL:

More information

Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression

Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression Sha M Kakade Microsoft Research and Wharton, U Penn skakade@icrosoftco Varun Kanade SEAS, Harvard University vkanade@fasharvardedu

More information

Efficient Learning with Partially Observed Attributes

Efficient Learning with Partially Observed Attributes Nicolò Cesa-Bianchi DSI, Università degli Studi di Milano, Italy Shai Shalev-Shwartz The Hebrew University, Jerusale, Israel Ohad Shair The Hebrew University, Jerusale, Israel Abstract We describe and

More information

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon Model Fitting CURM Background Material, Fall 014 Dr. Doreen De Leon 1 Introduction Given a set of data points, we often want to fit a selected odel or type to the data (e.g., we suspect an exponential

More information

A PROBABILISTIC AND RIPLESS THEORY OF COMPRESSED SENSING. Emmanuel J. Candès Yaniv Plan. Technical Report No November 2010

A PROBABILISTIC AND RIPLESS THEORY OF COMPRESSED SENSING. Emmanuel J. Candès Yaniv Plan. Technical Report No November 2010 A PROBABILISTIC AND RIPLESS THEORY OF COMPRESSED SENSING By Eanuel J Candès Yaniv Plan Technical Report No 200-0 Noveber 200 Departent of Statistics STANFORD UNIVERSITY Stanford, California 94305-4065

More information

arxiv: v1 [cs.ds] 17 Mar 2016

arxiv: v1 [cs.ds] 17 Mar 2016 Tight Bounds for Single-Pass Streaing Coplexity of the Set Cover Proble Sepehr Assadi Sanjeev Khanna Yang Li Abstract arxiv:1603.05715v1 [cs.ds] 17 Mar 2016 We resolve the space coplexity of single-pass

More information

Gradient Sliding for Composite Optimization

Gradient Sliding for Composite Optimization Noname manuscript No. (will be inserted by the editor) Gradient Sliding for Composite Optimization Guanghui Lan the date of receipt and acceptance should be inserted later Abstract We consider in this

More information

Exact tensor completion with sum-of-squares

Exact tensor completion with sum-of-squares Proceedings of Machine Learning Research vol 65:1 54, 2017 30th Annual Conference on Learning Theory Exact tensor copletion with su-of-squares Aaron Potechin Institute for Advanced Study, Princeton David

More information

Probability Distributions

Probability Distributions Probability Distributions In Chapter, we ephasized the central role played by probability theory in the solution of pattern recognition probles. We turn now to an exploration of soe particular exaples

More information

Error Exponents in Asynchronous Communication

Error Exponents in Asynchronous Communication IEEE International Syposiu on Inforation Theory Proceedings Error Exponents in Asynchronous Counication Da Wang EECS Dept., MIT Cabridge, MA, USA Eail: dawang@it.edu Venkat Chandar Lincoln Laboratory,

More information

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space Journal of Machine Learning Research 3 (2003) 1333-1356 Subitted 5/02; Published 3/03 Grafting: Fast, Increental Feature Selection by Gradient Descent in Function Space Sion Perkins Space and Reote Sensing

More information

A Note on Online Scheduling for Jobs with Arbitrary Release Times

A Note on Online Scheduling for Jobs with Arbitrary Release Times A Note on Online Scheduling for Jobs with Arbitrary Release Ties Jihuan Ding, and Guochuan Zhang College of Operations Research and Manageent Science, Qufu Noral University, Rizhao 7686, China dingjihuan@hotail.co

More information

On the Use of A Priori Information for Sparse Signal Approximations

On the Use of A Priori Information for Sparse Signal Approximations ITS TECHNICAL REPORT NO. 3/4 On the Use of A Priori Inforation for Sparse Signal Approxiations Oscar Divorra Escoda, Lorenzo Granai and Pierre Vandergheynst Signal Processing Institute ITS) Ecole Polytechnique

More information

Interactive Markov Models of Evolutionary Algorithms

Interactive Markov Models of Evolutionary Algorithms Cleveland State University EngagedScholarship@CSU Electrical Engineering & Coputer Science Faculty Publications Electrical Engineering & Coputer Science Departent 2015 Interactive Markov Models of Evolutionary

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lessons 7 20 Dec 2017 Outline Artificial Neural networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

Structured signal recovery from quadratic measurements: Breaking sample complexity barriers via nonconvex optimization

Structured signal recovery from quadratic measurements: Breaking sample complexity barriers via nonconvex optimization Structured signal recovery fro quadratic easureents: Breaking saple coplexity barriers via nonconvex optiization Mahdi Soltanolkotabi Ming Hsieh Departent of Electrical Engineering University of Southern

More information

Necessity of low effective dimension

Necessity of low effective dimension Necessity of low effective diension Art B. Owen Stanford University October 2002, Orig: July 2002 Abstract Practitioners have long noticed that quasi-monte Carlo ethods work very well on functions that

More information

Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence

Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence Best Ar Identification: A Unified Approach to Fixed Budget and Fixed Confidence Victor Gabillon Mohaad Ghavazadeh Alessandro Lazaric INRIA Lille - Nord Europe, Tea SequeL {victor.gabillon,ohaad.ghavazadeh,alessandro.lazaric}@inria.fr

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic

More information

MSEC MODELING OF DEGRADATION PROCESSES TO OBTAIN AN OPTIMAL SOLUTION FOR MAINTENANCE AND PERFORMANCE

MSEC MODELING OF DEGRADATION PROCESSES TO OBTAIN AN OPTIMAL SOLUTION FOR MAINTENANCE AND PERFORMANCE Proceeding of the ASME 9 International Manufacturing Science and Engineering Conference MSEC9 October 4-7, 9, West Lafayette, Indiana, USA MSEC9-8466 MODELING OF DEGRADATION PROCESSES TO OBTAIN AN OPTIMAL

More information

Proceedings of the 2016 Winter Simulation Conference T. M. K. Roeder, P. I. Frazier, R. Szechtman, E. Zhou, T. Huschka, and S. E. Chick, eds.

Proceedings of the 2016 Winter Simulation Conference T. M. K. Roeder, P. I. Frazier, R. Szechtman, E. Zhou, T. Huschka, and S. E. Chick, eds. Proceedings of the 2016 Winter Siulation Conference T. M. K. Roeder, P. I. Frazier, R. Szechtan, E. Zhou, T. Huschka, and S. E. Chick, eds. THE EMPIRICAL LIKELIHOOD APPROACH TO SIMULATION INPUT UNCERTAINTY

More information

Optimal Resource Allocation in Multicast Device-to-Device Communications Underlaying LTE Networks

Optimal Resource Allocation in Multicast Device-to-Device Communications Underlaying LTE Networks 1 Optial Resource Allocation in Multicast Device-to-Device Counications Underlaying LTE Networks Hadi Meshgi 1, Dongei Zhao 1 and Rong Zheng 2 1 Departent of Electrical and Coputer Engineering, McMaster

More information

Bipartite subgraphs and the smallest eigenvalue

Bipartite subgraphs and the smallest eigenvalue Bipartite subgraphs and the sallest eigenvalue Noga Alon Benny Sudaov Abstract Two results dealing with the relation between the sallest eigenvalue of a graph and its bipartite subgraphs are obtained.

More information

Analyzing Simulation Results

Analyzing Simulation Results Analyzing Siulation Results Dr. John Mellor-Cruey Departent of Coputer Science Rice University johnc@cs.rice.edu COMP 528 Lecture 20 31 March 2005 Topics for Today Model verification Model validation Transient

More information