Data Dependent Convergence For Consensus Stochastic Optimization

Size: px

Start display at page:

Download "Data Dependent Convergence For Consensus Stochastic Optimization"

Elfreda Richardson
5 years ago
Views:

1 Data Dependent Convergence For Consensus Stochastic Optiization Avleen S. Bijral, Anand D. Sarwate, Nathan Srebro Septeber 8, 08 arxiv: v ath.oc] Sep 06 Abstract We study a distributed consensus-based stochastic gradient descent (SGD) algorith and show that the rate of convergence involves the spectral properties of two atrices: the standard spectral gap of a weight atrix fro the network topology and a new ter depending on the spectral nor of the saple covariance atrix of the data. This data-dependent convergence rate shows that distributed SGD algoriths perfor better on datasets with sall spectral nor. Our analysis ethod also allows us to find data-dependent convergence rates as we liit the aount of counication. Spreading a fixed aount of data across ore nodes slows convergence; for asyptotically growing data sets we show that adding ore achines can help when iniizing twice-differentiable losses. Introduction Decentralized optiization algoriths for statistical coputation and achine learning on large data sets try to trade off efficiency (in ters of estiation error) and speed (fro parallelization). Fro an epirical perspective, it is often unclear when these ethods will work for a particular data set, and to what degree additional counication can iprove perforance. For exaple, in high-diensional probles counication can be costly. We would therefore like to know when liiting counication is feasible or beneficial. The theoretical analysis of distributed optiization ethods has focused on providing strong data-independent convergence rates under analytic assuptions on the objective function such as convexity and soothness. In this paper we show how the tradeoff between efficiency and speed is affected by the data distribution itself. We study a class of distributed optiization algoriths and prove an upper bound on the error that depends on the spectral nor of the data covariance. By tuning the frequency with which nodes counicate, we obtain a bound that depends on data distribution, network size and topology, and aount of counication. This allows us to interpolate between regies where counication is cheap (e.g. shared eory systes) and those where it is not (clusters and sensor networks). We study the proble of iniizing a regularized convex function ] of the for J(w) = N l(w x i ; y i ) + µ N w () i= = x ˆP l(w x; y) ] + µ w where l( ) is convex and Lipschitz and the expectation is with respect to the epirical distribution ˆP corresponding to a given data set with N total data points {(x i, y i )}. We will assue x i R d and y i R. This regularized epirical risk iniization forulation encopasses algoriths such as support vector achine classification, ridge regression, logistic regression, and others ]. For exaple x could represent d pixels in a grayscale iage and y a binary label indicating whether the iage is of a face: w x gives a confidence value about whether the iage is of a face or not. We would like to solve such probles using a network of processors connected via a network (represented by a graph indicating which nodes can counicate with each other). The syste would distribute these N points across the nodes, inducing local objective functions J j (w) approxiating ().

2 In such a coputational odel, nodes can perfor local coputations and send essages to each other to jointly iniize (). The strategy we analyze is what is referred to as distributed prial averaging 3]: each node in the network processes points sequentially, perforing a SGD update locally and averaging the current iterate values of their neighbors after each gradient step. This can also be thought of as a distributed consensus-based version of Pegasos 4] when the loss function is the hinge loss. We consider a general topology with nodes attepting to iniize a global objective function J(w) that decoposes into a su of local objectives: J(w) = i= J i(w). This is a odel for optiization in systes such as data centers, distributed control systes, and sensor networks. Main Results. Our goal in this paper is to characterize how the spectral nor ρ = σ ( ˆPxx ]) of the saple covariance affects the rate of convergence of stochastic consensus schees under different counication requireents. lucidating this dependence can help guide epirical practice by providing insight into when these ethods will work well. We prove an upper bound on the suboptiality gap for distributed prial averaging that depends on ρ as well as the ixing tie of the weight atrix associated to the algorith. Our result shows that networks of size < ρ gain fro parallelization. To understand the counication-liited regie, we extend our analysis to interittent counication. In a setting with finite data and sparse connectivity, convergence will deteriorate with increasing because we split the data to ore achines that are farther apart. We also show that by using a ini-batching strategy we can offset the penalty of infrequent counication by counicating after a ini-batch (sub)gradient step. Finally, in an asyptotic regie with infinite data at every node we show using results of Bianchi et al. 5] that for twice-differentiable loss functions this network effect disappears and that we gain fro additional parallelization. Related Work. Several authors have proposed distributed algoriths involving nodes coputing local gradient steps and averaging iterates, gradients, or other functions of their neighbors 3, 6, 7]. By alternating local updates and consensus with neighbors, estiates at the nodes converge to the optiizer of J( ). In these works no assuption is ade on the local objective functions and they can be arbitrary. Consequently the convergence guarantees do not reflect the setting when the data is hoogenous (for e.g. when data has the sae distribution), specifically error increases as we add ore achines. This is counterintuitive, especially in the large scale regie, since this suggests that despite hoogeneity the ethods perfor worse than the centralized setting (all data on one node). We provide a first data-dependent analysis of a consensus based stochastic gradient ethod in the hoogenous setting and deonstrate that there exist regies where we benefit fro having ore achines in any network. In contrast to our stochastic gradient based results, data dependence via the Hessian of the objective has also been deonstrated in parallel coordinate descent based approaches of Liu et al. 8] and the Shotgun algorith of Bradley et al. 9]. The assuptions differ fro us in that the objective function is assued to be sooth 8] or L regularized 9]. Most iportantly, our results hold for arbitrary networks of copute nodes, while the coordinate descent based results hold only for networks where all nodes counicate with a central aggregator (soeties referred to as a aster-slave architecture, or a star network), which can be used to odel shared-eory systes. Another interesting line of work is the ipact of delay on convergence in distributed optiization 0]. These results show that delays in the gradient coputation for a star network are asyptotically negligible when optiizing sooth loss functions. We study general network topologies but with interittent, rather than delayed counication. Our result suggest that certain datasets are ore tolerant of skipped counication rounds, based on the spectral nor of their covariance. We take an approach siilar to that of Takáč et al. ] who developed a spectral-nor based analysis of ini-batching for non-sooth functions. We decopose the iterate in ters of the data points encountered in the saple path ]. This differs fro analysis based on soothness considerations alone 0, 4] and gives practical insight into how counication (full or interittent) ipacts the perforance of these algoriths. Note that our work is fundaentally different in that these other works either assue a centralized setting 4] or iplicitly assue a specific network topology (e.g. 5] uses a star topology). For the ain results we only assue strong convexity while the existing guarantees for the cited ethods depend on a variety of regularity and soothness conditions. Liitation. In the stochastic convex optiization (see for e.g. 6]) setting the quantity of interest is the population objective corresponding to proble. When iniizing this population objective our results suggest that adding ore achines worsens convergence (See Theore ). For finite data our convergence

3 results satisfy the intuition that adding ore nodes in an arbitrary network will hurt convergence. The finite hoogenous setting is ost relevant in settings such as data centers, where the processors hold data which essentially looks the sae. In the infinite or large scale data setting, coon in achine learning applications, this is counterintuitive since when each node has infinite data, any distributed schee including one on arbitrary networks shouldn t perfor worse than the centralized schee (all data on one node). Thus our analysis is liited in that it doesn t unify the stochastic optiization and the consensus setting in a copletely satisfactory anner. To partially reedy this we explore consensus SGD for sooth strongly convex objectives in the asyptotic regie and show that one can gain fro adding ore achines in any network. In this paper we focus on a siple and well-studied protocol 3]. However, our analysis approach and insights ay yield data-dependent bounds for other ore coplex algoriths such as distributed dual averaging 6]. More sophisticated gradient averaging schees such as that of Mokhtari and Ribeiro 7] can exploit dependence across iterations 8, 9] to iprove the convergence rate; analyzing the ipact of the data distribution is considerably ore coplex in these algoriths. We believe that our results provide a first step towards understanding data-dependent bounds for distributed stochastic optiization in settings coon to achine learning. Our analysis coincides with phenoenon seen in practice: for data sets with sall ρ, distributing the coputation across any achines is beneficial, but for data with larger ρ ore achines is not necessarily better. Our work suggests that taking into account the data dependence can iprove the epirical perforance of these ethods. Model We will use boldface for vectors. Let k] = {,,..., k}. Unless otherwise specified, the nor is the standard uclidean nor. The spectral nor of a atrix A is defined to be the largest singular value σ (A) of the atrix A or equivalently the square root of the largest eigenvalue of A A. For a graph G = (V, ) with vertex set V and edge set, we will denote the neighbors of a vertex i V by N (i) V. Data odel. Let P be a distribution on R d+ such that for (x, y) P, we have x alost surely. Let S = {x, x,..., x N } be i.i.d saple of d-diensional vectors fro P and let ˆP be the epirical distribution of S. Let ˆΣ = x ˆPxx ] be the saple second-oent atrix of S. Our goal is to express the perforance of our algoriths in ters of ρ = σ ( ˆΣ), the spectral nor of ˆΣ. The spectral nor ρ can vary significantly across different data sets. For exaple, for sparse data sets ρ is often sall. This can also happen if the data happens to lie in low-diensional subspace (saller than the abient diension d). Proble. Our proble is to iniize a particular instance of () where the expectation is over a finite collection of data points: w def = argin J(w) () w Let ŵ j (t) be the estiate of w at node j ] in the t-th iteration. We bound the expected gap (over the data distribution) at iteration T between J(w ) and the value J(ŵ i (T )) of the global objective J(ŵ j (T )) at the output ŵ j (T ) of each node j in our distributed network. We will denote the subgradient set of J(w) by J(w) and a subgradient of J(w) by J(w) J(w). In our analysis we will ake the following assuptions about the individual functions l(w x): (a) The loss functions {l( )} are convex, and (b) The loss functions {l( ; y)} are L-Lipschitz for soe L > 0 and all y. Note that J(w) is µ-strongly convex due to the l -regularization. Our analysis will not depend on the the response y except through the Lipschitz bound L so we will oit the explicit dependence on y to siplify the notation in the future. Network Model. We consider a odel in which iniization in () ust be carried out by nodes. These nodes are arranged in a network whose topology is given by a graph G an edge (i, j) in the graph eans nodes i and j can counicate. A atrix P is called graph conforant if P ij > 0 only if the edge (i, j) is in the graph. We will consider algoriths which use a doubly stochastic and graph conforant sequence of atrices P(t). Sapling Model. We assue the N data points are divided evenly and uniforly at rando aong the nodes, and define n def = N/ to be the nuber of points at each node. This is a necessary assuption 3

4 since our bounds are data dependent and depend on subsapling bounds of spectral nor of certain rando subatrices. However our data independent bound holds for arbitrary splits. Let S i be the subset of n points at node i. The local stochastic gradient procedure consists of each node i ] sapling fro S i with replaceent. This is an approxiation to the local objective function J i (w) = j S i l(w x i,j ) n + µ w. (3) Algorith. In the subsequent sections we analyze the distributed version (Algorith ) of standard SGD. This algorith is not new 3, 7] and has been analyzed extensively in the literature. The step-size η t = /(µt) is coonly used for large scale strongly convex achine learning probles like SVMs (e.g.- 4]) and ridge regression: to avoid an extra paraeter in the bounds, we take this setting. In Algorith node i saples a point uniforly with replaceent fro a local pool of n points and then updates its iterate by coputing a weighted su with its neighbors followed by a local subgradient step. The selection is unifor to guarantee that the subgradient is an unbiased estiate of a true subgradient of the local objective J i (w), and greatly siplifies the analysis. Different choices of P(t) will allow us to understand the effect of liiting counication in this distributed optiization algorith. Algorith Consensus Strongly Convex Optiization Input: {x i,j },where i ] and j n] and N = n, atrix sequence P(t), µ > 0, T {ach i ] executes} Initialize: set w i () = 0 R d. for t = to T do Saple x i,t uniforly with replaceent fro S i. Copute g i (t) l(w i (t) x i,t )x i,t + µw i (t) w i (t + ) = j= w j(t)p ij (t) η t g i (t) end for Output: ŵ i (T ) = T T t= w i(t) for any i ]. xpectations and probabilities. There are two sources of stochasticity in our odel: the first in the split of data points to the individual nodes, and the second in sapling the points during the gradient descent procedure. We assue that the split is done uniforly at rando, which iplies that the expected covariance atrix at each node is the sae as the population covariance atrix ˆΣ. Conditioned on the split, we assue that the sapling at each node is uniforly at rando fro the data point at that node, which akes the stochastic subgradient an unbiased estiate of the subgradient of the local objective function. Let F t be the siga algebra generated by the rando point selections of the algorith up to tie t, so that the iterates {w i (t) : i ]} are easurable with respect to F t. 3 Convergence and Iplications Methods like Algorith, also referred to as prial averaging, have been analyzed previously 3, 7, 0]. In these works it is shown that the convergence properties depend on the structure of the underlying network via the second largest eigenvalue of P. We consider in this section the case when P(t) = P for all t where P is a fixed Markov atrix. This corresponds to a synchronous setting where counication occurs at every iteration. We analyze the use of the step-size η t = /(µt) in Algorith and show that the convergence depends on the spectral nor ρ = σ ( ˆΣ) of the saple covariance atrix. Theore. Fix a Markov atrix P and let ρ = σ ( ˆΣ) denote the spectral nor of the covariance atrix of the data distribution. Consider Algorith when the objective J(w) is strongly convex, P(t) = P for all t, and η t = /(µt). Let λ (P) denote the second largest eigenvalue of P. Then if the nuber of saples on 4

5 each achine n satisfies and the nuber of iterations T satisfies n > 4 log (d) (4) 3ρ then the expected error for each node i satisfies T > e log(/ λ (P)) (5) T log(t ) > ax 4 ( 8 5) 4 /ρ log (d),, (6) 3ρ log(/λ (P)) J(ŵ i (T )) J(w )] ( + 00 ) ρ log T λ (P) L µ log T T. (7) Reark : Theore indicates that the nuber of achines should be chosen as a function of ρ. We can identify three sub-cases of interest: Case (a): : In this regie since / > ρ ρ (ignoring the constants and the log T ter) we /3 always benefit fro adding ore achines. Case (b): < ρ /3 ρ : The result tells us that there is no degradation in the error and the bound iproves by a factor ρ. Sparse data sets generally have a saller value of ρ (as seen in Takáč et al. ]); Theore suggests that for such data sets we can use a larger nuber of achines without losing perforance. However the requireents on the nuber of iterations also increases. This provides additional perspective on the observation by Takáč et al ] that sparse datasets are ore aenable to parallelization via ini-batching. The sae holds for our type of parallelization as well. Case (c): > ρ : In this case we pay a penalty ρ suggesting that for datasets with large ρ we should expect to lose perforance even with relatively fewer achines. Note that > is iplicit in the condition T > e log(/ λ )) since λ = 0 for =. This excludes the single node Pegasos ] case. Additionally in the case of general strongly convex losses (not necessarily dependent on w x) we can obtain a convergence rate of O(log (T )/T ). We do not provide the proof here. 4 Stochastic Counication In this section we generalize our analysis in Theore to handle tie-varying and stochastic counication atrices P(t). In particular, we study the case where the atrices are chosen i.i.d. over tie. Any strategy that doesn t involve counicating at every step will incur a larger gap between the local node estiates and their average. We call this the network error. Our goal is to show how knowing ρ can help us balance the network error and optiality gap. First we bound the network error for the case of stochastic tie varying counication atrices P (t) and then a siple extension leads to a generalized version of Theore. Lea. Let {P(t)} be a i.i.d sequence of doubly stochastic Markov atrices and consider Algorith when the objective J(w) is strongly convex. We have the following inequality for the expected squared error between the iterate w i (t) at node i at tie t and the average w(t) defined in Algorith : where b = log ( /λ ( P (t) ])). w(t) w i (t) ] L µ b log(bet ). (8) t Ared with Lea we prove the following theore for Algorith in the case of stochastic counication. 5

6 Theore 3. Let {P(t)} be an i.i.d sequence of doubly stochastic atrices and ρ = σ ( ˆΣ) denote the spectral nor of the saple covariance atrix. Consider Algorith when the objective J(w) is strongly convex, and η t = /(µt). Then if the nuber of saples on each achine n satisfies and the nuber of iterations T satisfies and n > 4 log (d) (9) 3ρ T > e log(/ λ ( P (t)])) (0) ( T log(t ) > ax 4 3ρ log(d), 8 5 ρ then the expected error for the output of each node i satisfies J(ŵ i (T )) J(w )] ( + 00 ) ρ log T λ ( P (t)]) ) log(/λ ( P (t)])), () L µ log T T. () Reark: This result generalizes the conclusions of Theore to the case of stochastic counication schees. Thus allowing for the data dependent interpretations of convergence in a ore general setting. 5 Liiting Counication As an application of the stochastic counication scenario of Theore (3) we now analyze the effect of reducing the counication overhead of Algorith. This reduction can iprove the overall running tie ( wall tie ) of the algorith because counication latency can hinder the convergence of any algoriths in practice ]. A natural way of liiting counication is to counicate only a fraction ν of the T total iterations; at other ties nodes siply perfor local gradient steps. We consider a sequence of i.i.d rando atrices {P(t)} for Algorith where P(t) {I, P} with probabilities ν and ν, respectively, where I is the identity atrix (iplying no counication since P ij (t) = 0 for i j) and, as in the previous section, P is a fixed doubly stochastic atrix respecting the graph constraints. For this odel the expected nuber of ties counication takes place is siply νt. Note that now we have an additional randoization due to the Bernoulli distribution over the doubly stochastic atrices. Analyzing a atrix P(t) that depends on the current value of the iterates is considerably ore coplicated. A straightforward application of Theore 3 reveals that the optiization error is proportional to ν and decays as O( ν log (T ) T ). However, this ignores the effect of the local counication-free iterations. A ini-batch approach. To account for local counication free iterations we odify the interittent counication schee to follow a deterinistic schedule of counication every /ν steps. However, instead of taking single gradient steps between counication rounds, each node gathers the (sub)gradients and then takes an aggregate gradient step. That is, after the t-th round of counication, the node saples a batch I t of indices sapled with replaceent fro its local data set with I t = /ν. We can think of this as the base algorith with a better gradient estiate at each step. The update rule is now w i (t + ) = w j (t)p ij (t) η t ν g i (t). (3) j N i i I i We define g /ν i (t) = i I i g i (t). Now the iteration count is over the counication steps and g /ν i (t) is the aggregated ini-batch (sub)gradient of size /ν. Note that this is analogous to the rando schee above but the analysis is ore tractable. 6

7 Theore 4. Fix a Markov atrix P and let ρ = σ ( ˆΣ) denote the spectral nor of the covariance atrix of the data distribution. Consider Algorith when the objective J(w) is strongly convex, P(t) = P for all t, and η t = /(µt) for schee (3). Let λ (P) denote the second largest eigenvalue of P. Then if the nuber of saples on each achine n satisfies n > 4 log (d) (4) 3ρ and T > e ν log(/ λ (P)) T log(νt ) > ax 4 3νρ log(d), 4 /ρ log(/λ ) ( 8 5) ν > 4 log(d) (5) 3ρ and then the expected error for each node i satisfies J(ŵ i (T )) J(w )] ( ) + 00 ρ4 log(νt ) 5 λ µ log(νt ). (6) T L where ν is the frequency of counication and where λ = λ (P). Reark: Theore (4) suggests that if the inverse frequency of counication is large enough then we can obtain a sharper bound on the error by a factor of ρ. This is significantly better than a O( ρ log νt νt ) baseline guarantee fro a direct application of Theore when the nuber of iterations is νt. Additionally the result suggests that if we counicate on a ini batch(where batch size b = /ν) that is large enough we can iprove Theore, specifically now we get a / iproveent when /ρ 4/3. 6 Asyptotic Regie In this section we explore the sub-optiality of distributed prial averaging when T for the case of sooth strongly convex objectives. The results of Section (3) suggest that we never gain fro adding ore achines in any network. Now we investigate the behaviour of Algorith in the asyptotic regie and show that the network effect disappears and we do indeed gain fro ore achines in any network. Our analysis depends on the asyptotic norality of a variation of Algorith 5, Theore 5]. The ain differences between Algorith and the consensus algorith of Bianchi et al. 5] is that we average the iterates before aking the local update. We ake the following assuptions for the analysis in this section: () The loss function differentials { (l( ))} are differentiable and G-Lipschitz for soe G > 0, () the stochastic gradients are of the for g i (t) = J(w i (t))+ξ t where ξ t ] = 0 and ξ t ξ t ] = C, and (3) there exists p > 0 such that ξ t +p] <. Our results hold for all sooth strongly convex objectives not necessarily dependent on w x. Lea 5. Fix a Markov atrix P. Consider Algorith when the objective J(w) is strongly convex and twice differentiable, P(t) = P for all t, and η t = /(λt). then the expected error for each node i satisfies for 7

8 Table : Data sets and paraeters for experients data set training test di. λ ρ RCV 78, 65 3, 49 47, Covertype 5, 9 58, 00 47, a arbitrary split of N saples into nodes li sup T J P ij w j (T ) J(w ) T j N (i) j= (P ij ) Tr (H) G µ (7) where H is the solution to the equation J (w )H + H J (w ) T = C. (8) Reark: This result shows that asyptotically the network effect fro Theore 3 disappears and that additional nodes can speed convergence. An application of Lea 5 to proble () gives us the following result for the specialized case of a k-regular graph with constant weight atrix P. Theore 6. Consider Algorith when the objective J(w) has the for, P(t) = P and corresponds to a k-regular graph with unifor weights for all t, and η t = /(λt). then the expected error for each node i satisfies li sup T 5ρL k T J P ij w j (T ) J(w ) j= Tr ( J(w ) ) G µ where the expectation is with respect to the history of the sapled gradients as well as the unifor rando splits of N data points across achines. Reark: For objective () we obtain a /k variance reduction and the network effect disappears. 7 xperients Our goals in our experiental evaluation are to validate the theoretical dependence of the convergence rate on ρ and to see if the conclusions hold when the assuptions we ake in the analysis are violated. Note that all our experients are based on siulations on a ulticore coputer. 7. Data sets, tasks, and paraeter settings The data sets used in our experients are suarized in Table (7.). Covertype is the forest covertype dataset ] used in 4] obtained fro the UC Irvine Machine Learning Repository 3], and rcv is fro the Reuters collection 3] obtained fro libsv collection 4]. The RCV data set has a sall value of ˆρ, whereas Covertype has a larger value. In all the experients we looked at l -regularized classification objectives for proble (). ach plot is averaged over 5 runs. The data consists of pairs {(x, y ),..., (x N, y N )} where x i R d and y i {, +}. In all experients we optiize the l -regularized epirical hinge loss where l(w x) = ( w xy) +. The values of the regularization paraeter µ are chosen fro to be the sae as those in Shalev-Shwarz et al. 4]. (9) 8

9 0 8 Interittent on RCV Interittent on Covertype 6 8 log(prial) 4 Frequency log(prial) 4 Frequency e+00 e+05 e+05 3e+05 4e+05 Iteration Iteration Figure : Perforance of Algorith () with interittent counication schee on datasets with very different ρ. The algorith works better for saller ρ and there is less decay in perforance for RCV as we decrease the nuber of counication rounds as opposed to Covertype (ρ = 0.0 vs ρ = 0.). We siulated networks of copute nodes of varying size () arranged in a k-regular graph with k = 0.5 or a fixed degree (not dependent on ). Note that the dependence of the convergence rate of procedures like Algorith () on the properties of the underlying network has been investigated before and we refer the reader to Agarwal and Duchi 0] for ore details. In this paper we experient only with k-regular graphs. The weights on the Markov atrix P are set by using the ax-degree Markov chain (see 5]). One can also optiize for the fastest ixing Markov chain ( 5], 6]). ach node is randoly assigned n = N/ points. 7. Interittent Counication In this experient we show the objective function for RCV and Covertype as we change the frequency of counication (Figure ), counicating after every, 0, 50 and 500 iterations. Indeed as predicted we see that the dataset with the larger ρ appears to be affected ore by interittent counication. This indicates that network bandwidth can be conserved for datasets with a saller ρ. 7.3 Coparison of Different Schees We copare the three different schees proposed in this paper. On a network of = 64 achines we plot the perforance of the ini batch extension of Algorith () with batch size 8 against the interittent schee that counicates after every 8 iterations and also the standard version of the algorith. In Figure 3-(a) we see that as predicted in Theore (4) the ini batch schee proposed in (3) does better than the vanilla and the interittent schee. 7.4 Infinite Data To provide soe epirical evidence of Lea 5 we generate a very large (N = 0 7 ) synthetic dataset fro a ultivariate Noral distribution and created a siple binary classification task using a rando hyperplane. 9

10 Infinite Data 5.0 log(prial).5 Machines Iteration Figure : No network effect with increasing benefit of adding ore achines in the case of infinite data. As we can see in figure for the SVM proble and a k-regular network we continue to gain as we add ore achines and then eventually we stabilize but never lose fro ore achines. We only show the first few thousand iterations for clarity. 7.5 Diinishing Counication To test if our conclusions apply when the i.i.d assuption for the atrices P(t) does not hold we siulate a diinishing counication regie. Such a schee can be useful when the nodes are already close to the optial solution and counicating their respective iterate is wasteful. Intuitively it is in the beginning the nodes should counicate ore frequently. To foralize the intuition we propose the following counication odel { P w.p. Ct p P(t) = I w.p. Ct p (0) where C, p > 0. Thus the sequence of atrices are not identically distributed and the conclusions of Theore (3) do not apply. However in Figure 3-(b) (C=,p=0.5) we see that on a network of = 8 nodes the perforance for the diinishing regie is siilar to the full counication case and we can hypothesize that our results also hold for non i.i.d counication atrices. 0

11 0 Different Schees on Covertype Diinishing vs Full on Convertype log(prial) Schee =64 b=8 =64 b= Full =64 b= Int log(prial) 5.0 Schee Diinishing Full e+00 e+05 e+05 3e+05 4e Iteration Iteration Figure 3: a) Coparison of three different schees a) Algorith () with Mini-Batching b) Standard c) Interittent with b = (/ν) = 8. As predicted the ini-batch schee perfors uch better than the others. b) The perforance on Covertype with a full and a diinishing counication schee is siilar.

12 8 Discussion and Iplications In this paper we described a consensus stochastic gradient descent algorith and analyzed its perforance in ters of the spectral nor ρ of the data covariance atrix under a hoogenous assuption. In the consensus proble this setting has not been analyzed before and existing work corresponds to weaker results when this assuption holds. For certain strongly convex objectives we showed that the objective value gap between any node s iterate and the optiu centralized estiate decreases as O(log (T )/T ); crucially, the constant depended on ρ and the spectral gap of the network atrix. We showed how liiting counication can iprove the total runtie and reduce network costs by extending our analysis with a siilar data dependent bound. Moreover we show that in the asyptotic regie the network penalty disappears. Our analysis suggests that distribution-dependent bounds can help us understand how data properties can ediate the tradeoff between coputation and counication in distributed optiization. In a sense, data distributions with saller ρ are easier to optiize over in a distributed setting. This set of distributions includes sparse data sets, an iportant class for applications. In the future we will extend data dependent guarantees to serial algoriths as well as the average-atend schee 4, 5]. xtending our fixed batch-size to rando size can help us understand the benefit of counication-free iterations. Finally, we can also study the ipact of asynchrony and ore general tie-varying topologies. 9 Appendix We gather here the proof details and technical leas needed to establish our results. 0 Proof of Theore Theore provides a bound on the suboptiality gap for the output ŵ i (T ) of Algorith at node i, which is the average of that node s iterates. In the analysis we relate this local average to the average iterate across nodes at tie t: w(t) = We will also consider the average of w(t) over tie. The proof consists of three ain steps. We establish the following inequality for the objective error: i= J( w(t)) J(w )] (ηt µ) w(t) w ] η t w(t + ) w ] + η t g i (t) i= + w(t) w i (t) ] i= w i (t). () ( J i (w i (t)) + J i ( w(t)) ) ] /, () where w(t) is the average of the iterates at all nodes and the expectation is with respect to F t while conditioned on the saple split across nodes. All expectations, except when explicitly stated, will be conditioned on this split.

13 We bound J(w i (t)) ] and ηt i= g i(t) ] in ters of the spectral nor of the covariance atrix of the distribution P by additionally taking expectation with respect to the saple S. We bound the network error w(t) w i (t) ] in ter of the network size and a spectral property of the atrix P. Cobining the bounds using inequality () and applying the definition of subgradients yields the result of Theore. 0. Spectral Nor of Rando Subatrices In this section we establish a Lea pertaining to the spectral nor of subatrices that is central to our results. Specifically we prove the following inequality, which follows by applying the Matrix Bernstein inequality of Tropp 7]. Lea 7. Let P be a distribution on R d with second oent atrix Σ = Z P ZZ ] such that Z k alost surely. Let ζ = σ (Σ). Let Z, Z,..., Z K be an i.i.d. saple fro P and let Q K = K Z k Z k k= be the epirical second oent atrix of the data. Then for K > 4 3ζ log(d), ] σ (Q K ) 5ζ. (3) K ] Thus when P is the epirical distribution we get that σ(q K ) K 5ζ. Reark: We can replace the abient diension d in the requireent for K by an intrinsic diensionality ter but this requires a lower bound on the nor of any data point in the saple. Proof. Let Z be the d K atrix whose coluns are {Z k }. Define X k = Z k Z k Σ. Then X k] = 0 and λ ax (X k ) = λ ax ( Zk Z k Σ ) Z k, because Σ is positive seidefinite and x i for all i. Furtherore, σ ( K k= X k] ) ( = Kσ Zk Z k Z k Z ] k Σ ) ]) Kσ ( Y k Z k Z k + Kσ (Σ) K(ζ + ζ 4 ) Kζ, since ρ. Applying the Matrix Bernstein inequality of Tropp 7, Theore 6.]: ( ( K ) ) { d exp ( 3 r P σ X k r k= d exp ( ) 3 r 8 6Kζ ) r K ζ r K ζ. (4) 3

14 Now, note that ( K ) ( K ) σ X k = σ Z k Z k KΣ, ( K ) so σ k= X k r is iplied by k= k= ( K ) K σ Z k Z k σ (Σ) r K. k= Therefore ( σ (Q K ) P K ζ ( ) d exp r ( d exp ) 3Kr 6ζ 3Kr 8 ) r ζ r ζ. (5) Integrating (5) yields For K > 4 3ζ log d, ] σ (Q K ) = K ( ) σ (Q K ) P x dx 0 K ( ) 3ζ σ (Q t ) + P 3ζ K ζ x ζ dx ( ) 3ζ σ (Q t ) + P ζ K ζ r dr 3ζ + d exp ( 38 ) Kr dr ζ = 3ζ d ( K exp 3 ) 4 ζ K ] σ (Q K ) K 3ζ ζ. ζ log d 0. Decoposing the expected suboptiality gap The proof in part follows 3]. It is easy to verify that because P is doubly stochastic the average of the iterates across the nodes at tie t, the average of the iterates across the nodes in () satisfies the following update rule: w(t + ) = w(t) η t i= g i (t). (6) We ephasize that in Algorith we do not perfor a final averaging across nodes at the end as in (). Rather, we analyze the average at a single node across its iterates (soeties called Polyak averaging). Analyzing () provides us with a way to understand how the objective J(w i (t)) evaluated at any node i s iterate w i (t) copares to the iniu value J(w ). The details can be found in Section (0.7). 4

15 To siplify notation, we treat all expectations as conditioned on the saple S. Then (6), ] w(t + ) w Ft ] = w(t) w F t + ηt g i (t) Ft i= i= η t ( w(t) w ) g i (t) F t ] i= ] = w(t) w F t + ηt g i (t) Ft η t i= ( w(t) w ) g i(t) F t ] Note that J i (w i (t)) = g i (t) F t ], so for the last ter, for each i we have J i (w i (t)) ( w(t) w ) = J i (w i (t)) ( w(t) w i (t)) + J i (w i (t)) (w i (t) w ) J i (w i (t)) w(t) w i (t) + J i (w i (t)) (w i (t) w ) J i (w i (t)) w(t) w i (t) + J i (w i (t)) J i (w ) + µ w i(t) w = J i (w i (t)) w(t) w i (t) + J i (w i (t)) J i ( w(t)) + µ w i(t) w + J i ( w(t)) J i (w ) J i (w i (t)) w(t) w i (t) + J i ( w(t)) (w i (t) w(t)) + µ w i(t) w + J i ( w(t)) J i (w ) ( J i (w i (t)) + J i ( w(t)) ) w(t) w i (t). (7) + µ w i(t) w + J i ( w(t)) J i (w ), (8) where the second and third lines coes fro applying the Cauchy-Shwartz inequality and strong convexity, the fifth line coes fro the definition of subgradient, and the last line is another application of the Cauchy- Shwartz inequality. Averaging over all the nodes, using convexity of, the definition of J( ), and Jensen s inequality yields 5

16 the following inequality: η t i= ( w(t) w ) g i(t) F t ] w(t) w i (t) ( J i (w i (t)) + J i ( w(t)) ) η t i= ( ) J i ( w(t)) J i (w ) η t η t µη t i= i= i= w i (t) w Substituting inequality (9) in recursion (7), w(t + ) w ] Ft ] w(t) w F t + ηt g i (t) Ft + η t w(t) w i (t) ( J i (w i (t)) + J i ( w(t)) ) η t (J( w(t)) J(w )) µη t w(t) w (9) i= i= w(t) w i (t) ( J i (w i (t)) + J i ( w(t)) ) η t (J( w(t)) J(w )) µη t w(t) w. (30) Taking expectations with respect to the entire history F t, w(t + ) w ] w(t) w ] + ηt i= g i (t) + η t w(t) w i (t) ( J i (w i (t)) + J i ( w(t)) )] i= η t ( J( w(t)) J(w )]) µη t w(t) w ] η t ( J( w(t)) J(w )]) + ( µη t ) w(t) w ] + ηt i= + η t w(t) w i (t) ] i= g i (t) ( J i (w i (t)) + J i ( w(t)) ) ] (3) 6

17 This lets us bound the expected suboptiality gap J( w(t)) J(w )] via three ters: T = (η t µ) w(t) w ] η t w(t + ) w ] (3) T = η t g i (t) (33) i= T3 = w(t) w i (t) ] i= ( J i (w i (t)) + J i ( w(t)) ) ], (34) where J( w(t)) J(w )] T + T + T3. (35) The reainder of the proof is to bound these three ters separately. 0.3 Network rror Bound We need to prove an interediate bound first to handle ter T3. Lea 8. Fix a Markov atrix P and consider Algorith when the objective J(w) is strongly convex we have the following inequality for the expected squared error between the iterate w i (t) at node i at tie t and the average w(t) defined in Algorith : where b = (/) log(/λ (P)). w(t) w i (t) ] L µ b log(bet ), (36) t Proof. We follow a siilar analysis as others 3, Prop. 3] 6, IV.A] 0]. Let W(t) be the d atrix whose i-th row is w i (t) and G(t) be the d atrix whose i-th row is g i (t). Then the iteration can be copactly written as W(t + ) = P(t)W(t) η t G(t) and the network average atrix W(t) = W(t). Then we can write the difference using the fact that 7

18 P(t) = P for all t: W(t + ) W(t + ) = ( ) I (PW(t) η t G(t)) = = ( ) ( ) P W(t) η t I G(t) ( ) P (PW(t ) η t G(t )) ( ) η t I G(t) = ( ) P W(t ) ( ) η t P G(t ) ( ) η t I G(t) = ( ) P W(t ) t ( ) η s P t s G(s). (37) s=t Continuing the expansion and using the fact that W() = 0, W(t + ) W(t + ) = ( ) P t W() = t ( ) η s P t s G(s) s= t ( ) η s P t s G(s) s= t ( ) = η s P t s G(s) s= η t ( I ) G(t). (38) 8

19 Now looking at the nor of the i-th row of (38) and using the bound on the gradient nor: w(t) w i (t) t η s t s= s= j= ( ) (Pt s ) ij g j (s) + η t g j(t) g i (t) j= (39) L µs (Pt s ) i + L µt. (40) We handle the ter (Pt s ) i using a bound on the ixing rate of Markov chains (c.f. (74) in Tsianos and Rabbat 0]): t s= L µs (Pt s ) i L t µ s= ( ) t s λ (P) s. (4) Define a = λ (P) and b = log(a) > 0. Then we have the following identities: t τ= a t τ+ τ = t τ= a τ t τ + = t τ= exp( bτ) t τ +. (4) Now using the fact that when x > we have exp( x) < /( + x) and using the integral upper bound we get t τ= = Using (4) and (43) in (40) we get a t τ+ τ= τ t ( + bτ)(t τ + ) ( + b)t + ( + b)t + t dτ ( + bτ)(t τ + ) log(bτ + ) log(t τ + ) bt + b + ] t τ= log(bt + ) log(b + ) + log(t) = + ( + b)t bt + b + log(et(bt + )) bt log(bet ). (43) bt Therefore we have w(t) w i (t) L log(bet ) + L µ bt µt L µ w(t) w i (t) ] L µ log(bet ). (44) bt log(bet ). (45) bt 9

20 0.4 Bounds for expected gradient nors 0.4. Bounding J i ( w(t)) ] Let β j,t l( w(t) x i,j ) denote a subgradient for the j-th point at node i and β t = (β,t, β,t,..., β n,t ) be the vector of subgradients at tie t. Let Q Si be the n n Gra atrix of the data set S i. Fro the definition of J i ( w(t)) and using the Lipschitz property of the loss functions, we have the following bound: J i ( w(t)) n n j S i β j,tx i,j j S i β j,tx i,j + µ w(t) + µ w(t) = j S i j S i β j,t β j,tx i,j x i,j n + µ w(t) = n β t Q Si β t + µ w(t) n β t σ (Q Si ) + µ w(t) L σ (Q Si ) n + µ w(t). (46) We rewrite the update (6) in ters of {x i,t }, the points sapled at the nodes at tie t: w(t + ) = w(t)( µη t ) η t i= l(w i (t) x i,t )x i,t. (47) Now fro equation (47), after unrolling the recursion as in Shalev-Shwarz et al. 4] we see w(t) = t i= l(w i(τ) x i,τ )x i,τ. (48) µ(t ) τ= Let γτ i l(w i (τ) x i,τ ) the subgradient set for the ith node coputed at tie τ, then we have w(t) µ(t ) t γτ i x i,τ. (49) Let us in turn bound for each node i the ter t. τ= γi τ x i,τ Let γ i τ l(w i (τ) x i,τ ) denote a subgradient for the point sapled at tie τ at node i and γ i = (γ, i γ, i..., γt ) i be the vector of subgradients up to tie t. We have i= τ= t γτ i x i,τ = γτ i γτ i x i,τ x i,τ τ,τ τ= = (γ i ) Q i,t γ i γ i σ (Q i,t ) (t )L σ (Q i,t ), (50) where Q i,t is the (t ) (t ) Gra subatrix corresponding to the points sapled at the i-th node until tie t. 0

21 Further bounding (49): ( w(t) i= (t )L σ (Q i,t ) µ(t ) ( ) L σ (Q i,t ) µ. t i= Since as stated before everything is conditioned on the saple split we take expectations w.r.t the history and the rando split and using the Cauchy-Schwarz inequality again, and the fact that the points are sapled i.i.d. fro the sae distribution, w(t) ] L µ L µ = L µ ) ] σ (Q i,t )σ (Q j,t ) t i= j= ] ] σ (Q i,t ) σ (Q j,t ) t t i= j= σ (Q i,t ) t ]. (5) The last line follows fro the expectation over the sapling odel: the data at node i and node j have the sae expected covariance since they are sapled uniforly at rando fro the total data. Taking the expectation in (46) and substituting (5) we have J i ( w(t)) ] ] L σ (Q Si ) n ] + L σ (Q i,t ). (5) t Since S i is a unifor rando draw fro S and by assuing both t and n to be greater than 4/(3ρ ) log(d), applying Lea 7 gives us J i ( w(t)) ] 0L ρ. (53) 0.4. Bounding J i (w i (t)) ] We have just as in the previous subsection J i (w i (t)) L σ (Q Si ) n + µ w i (t). Using the triangle inequality, the fact that (a + a ) a + a, the bounds (44) and (5), and Lea 7: w i (t) ] w i (t) w(t) ] + w(t) ] 8L µ log (bet ) b (t ) + 5L ρ µ. (54) Since the second ter does not scale with t, fro (54) we can infer that for the second ter to doinate the first we require t 8 log(t) > 5 ρb.

22 This gives us w i (t) ] 0L ρ µ, (55) and therefore J i (w i (t)) ] 30L ρ. (56) 0.5 Bound for T Because the gradients are bounded, g i (t) i= = g i (t) g i (t) i,j g i (t) ] = + g i (t) g j (t) ] i= i j g i (t) g j (t) ] L + i j i j = L + ]] F t gi (t) g j (t) F t. Now using the fact that the gradients g i (t) are unbiased estiates of J i (w t ) and that g i (t) and g j (t) are independent given past history and inequality (56) for node i and j we get i j F t gi (t) g j (t) F t ]] = i j Ft Ji (w i (t)) J j (w j (t)) ] Ft J i (w i (t)) ] Ft J j (w j (t)) ] i j ( ) = 30L ρ 30L ρ. (57) Therefore to bound the ter T in (35) we can use g i (t) L + 30L ρ. (58) i=

23 0.6 Bound for T 3 Applying (45), (53), and (56) to T3 in (35), as well as Lea 8 and the fact that (a + a ) a + a we obtain the following bound: T 3 w(t) w i (t) ] i= ( J i (w i (t)) + J i ( w(t)) ) ] L µ i= 0L µ 0.7 Cobining the Bounds b log(bet ) bt log(t ) t 0Lρ ρ. (59) Finally cobining (58) and (59) in (35) and applying the step size assuption η t = /(µt): where K 0 = ( ( 30ρ + / + 60 ρ log(t ) J( w(t)) J(w )] (η t µ) w(t) w ] η t w(t + ) w ] ( 30L ρ ) + + L µ µ t + 0L µ log(bet ) ρ b t µ(t ) w(t) w ] µt w(t + ) w ] + K 0 L ) µt, (60) ) /b, using t T and assuing T > be. Let us now define two new sequences, the average of the average of iterates over nodes fro t = to T and the average for any node i ] ŵ(t ) = T ŵ i (T ) = T T w(t) (6) t= T w i (t). (6) Then suing (60) fro t = to T, using the convexity of J and collapsing the telescoping su in the first two ters of (60), t= J(ŵ(T )) J(w )] T J( w(t)) J(w )] T t= µt w(t + ) w ] + K 0 L µ T t= /t K 0 L µ log(t ). (63) T T 3

24 Now using the definition of subgradient, Cauchy-Schwarz, and Jensen s inequality we have J(ŵ i (T )) J(w ) J(ŵ(T )) J(w ) + J(ŵ i (T )) (ŵ i (t) ŵ(t )) J(ŵ(T )) J(w ) + J(ŵ i (T ) ŵ i (t) ŵ(t ) J(ŵ(T )) J(w ) T w i (t) w(t) + J(ŵ i (T )). (64) T t= To proceed we ust bound J(ŵ i (T )) ] in a siilar way as the bound (53). First, let α i = l(ŵ i (T ) x i ) denote the subgradient for the i-th loss function of J( ) in (), evaluated at ŵ i (T ), and α T = (α, α,..., α N ) be the vector of subgradients. As before, J(ŵ i (T )) N = α i x i + µŵ i (T ) N i= N α Qα + µ ŵ i (T ) 0L ρ + µ ŵ i (T ) 0L ρ + µ T Taking expectations of both sides and using (55) as before: J(ŵ i (T )) ] 30L ρ. T w i (t). Taking expectations of both sides of (64) and using the Cauchy-Schwarz inequality, (63), the preceding gradient bound, Lea 8 and the definition of K 0 we get t= J(ŵ i (T )) J(w )] K 0 L µ log(t ) + 30L ρ log(t ) T µ b T ( K ρ ) log T log T b T ( ) 30ρ ρ log T b T t= t L µ log T T. (65) Recalling that b = log(/λ (P)) λ (P ), assuing T > be and subsuing the first ter in the third and taking expectations with respect to the saple split the above bound can be written as ( J(ŵ i (T )) J(w )] + 00 ) ρ log T λ (P ) L µ log T T. (66) Proof of Lea Proof. Let us define the product of the sequence of rando atrices {P(τ) : s τ t}: Φ(s : t) = P(t) P(s). (67) 4

25 Then proceeding as in proof of Lea 8 and using the step size η t = /(µt), we get t ( ) w(t) w i (t) η s Φ(s : t) ij g j (s) s= j= + η t g j(t) g i (t) t s= j= (68) L µs Φ(s : t)e i + L µt. (69) Let e i be a vector with 0 s everywhere except at the the ith position, then ] Φ(s : t)e i ] Φ(s : t)e i. (70) Consider the recursion u(t + ) = P(t)u(t) and let v(t + ) = P(t)u(t) then we have v(t + ) v(t + ) v(t) ] = v(t) P (t)v(t) v(t) ] = v(t) P (t) ] v(t) v(t) λ ( P (t) ]), (7) since v(t) is orthogonal to the largest eigenvector of P(t). Taking expectations w.r.t to v(t) we get v(t + ) ] v(t) ] ( λ P (t) ]). (7) Recursively expanding (7) we obtain v(t + ) ] v(0) ( λ P (t) ]) t s+. (73) Consider an initial vector u(0) = e i. We see that v(t + ) = Φ(s : t) i, this finally gives us ] Φ(s : t) i ] Φ(s : t) i e i ( λ P (t) ]) t s+ λ ( P (t) ]) t s+. (74) ( Proceeding like the proof of Lea where a = λ P (t) ]) and b = log(a) we get w(t) w i (t) ] L log(bet ). (75) µ bt Proof of Theore 3 The proof follows easily fro the proof of Theore. Proof. Since (35) still holds, we erely apply Lea in (35) and continue in the sae way as the proof of Theore. 5

26 3 Proof of Theore 4 We will first establish the network lea for schee (3). Lea 9. Fix a Markov atrix P and consider Algorith when the objective J(w) is strongly convex and the frequency of counication satisfies /ν > 4 log(d) (76) 3ρ we have the following inequality for the expected squared error between the iterate w i (t) at node i at tie t and the average w(t) defined in Algorith for schee (3) w(t) w i (t) ] 4L 5ρ log(bet ) (77) µ bt where b = (/) log(/λ (P)). Proof. It is easy to see that we can write the update equation in Algorith as w i (t + ) = P ij (t)w j (t) η t g /ν i (t) (78) where and g i (t) = g /ν i (t) + µw i (t). We need first a bound on g /ν j g /ν i j= { Pij (t) when i j P ij (t) = P ii (t) t when i = j (s) using the definition of the inibatch (sub)gradient: (s) = Fro (40) and the inibatch (sub)gradient bound w(t) w i (t) t η s s= j= + η t i ks H l(w s i i (s) x kis )x kis /ν (79) L ν Q /ν (80) ( ) ( P t s ) ij j= g/ν L ν t Q/ν L ν Q/ν t s= s= + L ν Q/ν µt L ν t Q/ν g /ν j (s) j (t) g /ν i (t) ( P t s ) i µs + L ν Q /ν µt (Pt s ) i + (P t s ) i ( P t s ) i s= µs (Pt s ) i + L ν Q/ν µs µt 6

27 have Continuing as in the proof of Lea 8, taking expectations and using Lea 7, for /ν > 4 3ρ log(d) we ν w(t) w i (t) ] 4L Q /ν ] µ 4L 5ρ µ log(bet ) bt log(bet ) bt (8) For the schee (3) all the steps until bound (35) fro proof of Theore 3 reain the sae. The difference in the rest of the proof arises priarily fro the ini batch gradient nor factor in Lea 9. We have the sae decoposition as (35) with T, T, and T3 as in (3), (33), and (34). The gradient nor bounds also don t change since the inibatch gradient is also an unbiased gradient of the true gradient J( ). Thus substituting Lea 9 in the above and following the sae steps as in proof of Theore 3, replacing T by νt where T is now the total iterations including the counication as well as the inibatch gathering rounds, we get Theore Proof of Lea 5 In the proof we will use the corresponding ultivariate norality result of Bianchi et al. 5, Theore 5]. Finally using soothness and strong convexity we shall get Lea 5. It is easy to verify that Algorith satisfies all the assuptions necessary (Assuptions, 4, 6, 7, 8a, and 8b in Bianchi et al. 5]) for the result to hold. Assuption requires the weight atrix P(t) to be row stochastic alost surely, identically distributed over tie, and that P(t)] is colun stochastic. Our Markov atrix is constant over tie and doubly stochastic. Assuption b follows because P is constant and independent of the stochastic gradients, which are sapled uniforly with replaceent. Assuption 4 requires square integrability of the gradients as well as a regularity condition. In our setting, this follows since the sapled gradients are bounded alost everywhere. Assuption 6 iposes soe analytic conditions at the optiu value. These hold since the gradient is assued to be differentiable and the Hessian atrix at w is positive definite with its sallest eigenvalue is at least µ > 0 (this follows fro strong convexity). Assuption 7 of Bianchi et al. 5] follows fro our existing assuptions. Assuptions 8a and 8b are standard stochastic approxiation assuptions on the step size that are easily satisfied by η t = µt. Next it is straightforward to show that the average over the nodes of the iterates w i (t), w i (t) for Algorith and distributed algorith of 5] are the sae and satisfy Now note that w i (t) w = w(t + ) = w(t) η t i= g i(t) w i (t + ) = w i (t + ) η t i= g i(t) w i (t) w i (t) }{{} + w i (t) w }{{} T=Network rror T=Asyptotically Noral (8) (83) Fro Lea 8 we know that the network error (T) decays and fro update equation (8) we know that the averaged iterates for both the versions are the sae. Then the proof of Theore 5 of Bianchi et 7

Stochastic Subgradient Methods

Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods