Data Dependent Convergence For Consensus Stochastic Optimization

Size: px
Start display at page:

Download "Data Dependent Convergence For Consensus Stochastic Optimization"

Transcription

1 Data Dependent Convergence For Consensus Stochastic Optiization Avleen S. Bijral, Anand D. Sarwate, Nathan Srebro Septeber 8, 08 arxiv: v ath.oc] Sep 06 Abstract We study a distributed consensus-based stochastic gradient descent (SGD) algorith and show that the rate of convergence involves the spectral properties of two atrices: the standard spectral gap of a weight atrix fro the network topology and a new ter depending on the spectral nor of the saple covariance atrix of the data. This data-dependent convergence rate shows that distributed SGD algoriths perfor better on datasets with sall spectral nor. Our analysis ethod also allows us to find data-dependent convergence rates as we liit the aount of counication. Spreading a fixed aount of data across ore nodes slows convergence; for asyptotically growing data sets we show that adding ore achines can help when iniizing twice-differentiable losses. Introduction Decentralized optiization algoriths for statistical coputation and achine learning on large data sets try to trade off efficiency (in ters of estiation error) and speed (fro parallelization). Fro an epirical perspective, it is often unclear when these ethods will work for a particular data set, and to what degree additional counication can iprove perforance. For exaple, in high-diensional probles counication can be costly. We would therefore like to know when liiting counication is feasible or beneficial. The theoretical analysis of distributed optiization ethods has focused on providing strong data-independent convergence rates under analytic assuptions on the objective function such as convexity and soothness. In this paper we show how the tradeoff between efficiency and speed is affected by the data distribution itself. We study a class of distributed optiization algoriths and prove an upper bound on the error that depends on the spectral nor of the data covariance. By tuning the frequency with which nodes counicate, we obtain a bound that depends on data distribution, network size and topology, and aount of counication. This allows us to interpolate between regies where counication is cheap (e.g. shared eory systes) and those where it is not (clusters and sensor networks). We study the proble of iniizing a regularized convex function ] of the for J(w) = N l(w x i ; y i ) + µ N w () i= = x ˆP l(w x; y) ] + µ w where l( ) is convex and Lipschitz and the expectation is with respect to the epirical distribution ˆP corresponding to a given data set with N total data points {(x i, y i )}. We will assue x i R d and y i R. This regularized epirical risk iniization forulation encopasses algoriths such as support vector achine classification, ridge regression, logistic regression, and others ]. For exaple x could represent d pixels in a grayscale iage and y a binary label indicating whether the iage is of a face: w x gives a confidence value about whether the iage is of a face or not. We would like to solve such probles using a network of processors connected via a network (represented by a graph indicating which nodes can counicate with each other). The syste would distribute these N points across the nodes, inducing local objective functions J j (w) approxiating ().

2 In such a coputational odel, nodes can perfor local coputations and send essages to each other to jointly iniize (). The strategy we analyze is what is referred to as distributed prial averaging 3]: each node in the network processes points sequentially, perforing a SGD update locally and averaging the current iterate values of their neighbors after each gradient step. This can also be thought of as a distributed consensus-based version of Pegasos 4] when the loss function is the hinge loss. We consider a general topology with nodes attepting to iniize a global objective function J(w) that decoposes into a su of local objectives: J(w) = i= J i(w). This is a odel for optiization in systes such as data centers, distributed control systes, and sensor networks. Main Results. Our goal in this paper is to characterize how the spectral nor ρ = σ ( ˆPxx ]) of the saple covariance affects the rate of convergence of stochastic consensus schees under different counication requireents. lucidating this dependence can help guide epirical practice by providing insight into when these ethods will work well. We prove an upper bound on the suboptiality gap for distributed prial averaging that depends on ρ as well as the ixing tie of the weight atrix associated to the algorith. Our result shows that networks of size < ρ gain fro parallelization. To understand the counication-liited regie, we extend our analysis to interittent counication. In a setting with finite data and sparse connectivity, convergence will deteriorate with increasing because we split the data to ore achines that are farther apart. We also show that by using a ini-batching strategy we can offset the penalty of infrequent counication by counicating after a ini-batch (sub)gradient step. Finally, in an asyptotic regie with infinite data at every node we show using results of Bianchi et al. 5] that for twice-differentiable loss functions this network effect disappears and that we gain fro additional parallelization. Related Work. Several authors have proposed distributed algoriths involving nodes coputing local gradient steps and averaging iterates, gradients, or other functions of their neighbors 3, 6, 7]. By alternating local updates and consensus with neighbors, estiates at the nodes converge to the optiizer of J( ). In these works no assuption is ade on the local objective functions and they can be arbitrary. Consequently the convergence guarantees do not reflect the setting when the data is hoogenous (for e.g. when data has the sae distribution), specifically error increases as we add ore achines. This is counterintuitive, especially in the large scale regie, since this suggests that despite hoogeneity the ethods perfor worse than the centralized setting (all data on one node). We provide a first data-dependent analysis of a consensus based stochastic gradient ethod in the hoogenous setting and deonstrate that there exist regies where we benefit fro having ore achines in any network. In contrast to our stochastic gradient based results, data dependence via the Hessian of the objective has also been deonstrated in parallel coordinate descent based approaches of Liu et al. 8] and the Shotgun algorith of Bradley et al. 9]. The assuptions differ fro us in that the objective function is assued to be sooth 8] or L regularized 9]. Most iportantly, our results hold for arbitrary networks of copute nodes, while the coordinate descent based results hold only for networks where all nodes counicate with a central aggregator (soeties referred to as a aster-slave architecture, or a star network), which can be used to odel shared-eory systes. Another interesting line of work is the ipact of delay on convergence in distributed optiization 0]. These results show that delays in the gradient coputation for a star network are asyptotically negligible when optiizing sooth loss functions. We study general network topologies but with interittent, rather than delayed counication. Our result suggest that certain datasets are ore tolerant of skipped counication rounds, based on the spectral nor of their covariance. We take an approach siilar to that of Takáč et al. ] who developed a spectral-nor based analysis of ini-batching for non-sooth functions. We decopose the iterate in ters of the data points encountered in the saple path ]. This differs fro analysis based on soothness considerations alone 0, 4] and gives practical insight into how counication (full or interittent) ipacts the perforance of these algoriths. Note that our work is fundaentally different in that these other works either assue a centralized setting 4] or iplicitly assue a specific network topology (e.g. 5] uses a star topology). For the ain results we only assue strong convexity while the existing guarantees for the cited ethods depend on a variety of regularity and soothness conditions. Liitation. In the stochastic convex optiization (see for e.g. 6]) setting the quantity of interest is the population objective corresponding to proble. When iniizing this population objective our results suggest that adding ore achines worsens convergence (See Theore ). For finite data our convergence

3 results satisfy the intuition that adding ore nodes in an arbitrary network will hurt convergence. The finite hoogenous setting is ost relevant in settings such as data centers, where the processors hold data which essentially looks the sae. In the infinite or large scale data setting, coon in achine learning applications, this is counterintuitive since when each node has infinite data, any distributed schee including one on arbitrary networks shouldn t perfor worse than the centralized schee (all data on one node). Thus our analysis is liited in that it doesn t unify the stochastic optiization and the consensus setting in a copletely satisfactory anner. To partially reedy this we explore consensus SGD for sooth strongly convex objectives in the asyptotic regie and show that one can gain fro adding ore achines in any network. In this paper we focus on a siple and well-studied protocol 3]. However, our analysis approach and insights ay yield data-dependent bounds for other ore coplex algoriths such as distributed dual averaging 6]. More sophisticated gradient averaging schees such as that of Mokhtari and Ribeiro 7] can exploit dependence across iterations 8, 9] to iprove the convergence rate; analyzing the ipact of the data distribution is considerably ore coplex in these algoriths. We believe that our results provide a first step towards understanding data-dependent bounds for distributed stochastic optiization in settings coon to achine learning. Our analysis coincides with phenoenon seen in practice: for data sets with sall ρ, distributing the coputation across any achines is beneficial, but for data with larger ρ ore achines is not necessarily better. Our work suggests that taking into account the data dependence can iprove the epirical perforance of these ethods. Model We will use boldface for vectors. Let k] = {,,..., k}. Unless otherwise specified, the nor is the standard uclidean nor. The spectral nor of a atrix A is defined to be the largest singular value σ (A) of the atrix A or equivalently the square root of the largest eigenvalue of A A. For a graph G = (V, ) with vertex set V and edge set, we will denote the neighbors of a vertex i V by N (i) V. Data odel. Let P be a distribution on R d+ such that for (x, y) P, we have x alost surely. Let S = {x, x,..., x N } be i.i.d saple of d-diensional vectors fro P and let ˆP be the epirical distribution of S. Let ˆΣ = x ˆPxx ] be the saple second-oent atrix of S. Our goal is to express the perforance of our algoriths in ters of ρ = σ ( ˆΣ), the spectral nor of ˆΣ. The spectral nor ρ can vary significantly across different data sets. For exaple, for sparse data sets ρ is often sall. This can also happen if the data happens to lie in low-diensional subspace (saller than the abient diension d). Proble. Our proble is to iniize a particular instance of () where the expectation is over a finite collection of data points: w def = argin J(w) () w Let ŵ j (t) be the estiate of w at node j ] in the t-th iteration. We bound the expected gap (over the data distribution) at iteration T between J(w ) and the value J(ŵ i (T )) of the global objective J(ŵ j (T )) at the output ŵ j (T ) of each node j in our distributed network. We will denote the subgradient set of J(w) by J(w) and a subgradient of J(w) by J(w) J(w). In our analysis we will ake the following assuptions about the individual functions l(w x): (a) The loss functions {l( )} are convex, and (b) The loss functions {l( ; y)} are L-Lipschitz for soe L > 0 and all y. Note that J(w) is µ-strongly convex due to the l -regularization. Our analysis will not depend on the the response y except through the Lipschitz bound L so we will oit the explicit dependence on y to siplify the notation in the future. Network Model. We consider a odel in which iniization in () ust be carried out by nodes. These nodes are arranged in a network whose topology is given by a graph G an edge (i, j) in the graph eans nodes i and j can counicate. A atrix P is called graph conforant if P ij > 0 only if the edge (i, j) is in the graph. We will consider algoriths which use a doubly stochastic and graph conforant sequence of atrices P(t). Sapling Model. We assue the N data points are divided evenly and uniforly at rando aong the nodes, and define n def = N/ to be the nuber of points at each node. This is a necessary assuption 3

4 since our bounds are data dependent and depend on subsapling bounds of spectral nor of certain rando subatrices. However our data independent bound holds for arbitrary splits. Let S i be the subset of n points at node i. The local stochastic gradient procedure consists of each node i ] sapling fro S i with replaceent. This is an approxiation to the local objective function J i (w) = j S i l(w x i,j ) n + µ w. (3) Algorith. In the subsequent sections we analyze the distributed version (Algorith ) of standard SGD. This algorith is not new 3, 7] and has been analyzed extensively in the literature. The step-size η t = /(µt) is coonly used for large scale strongly convex achine learning probles like SVMs (e.g.- 4]) and ridge regression: to avoid an extra paraeter in the bounds, we take this setting. In Algorith node i saples a point uniforly with replaceent fro a local pool of n points and then updates its iterate by coputing a weighted su with its neighbors followed by a local subgradient step. The selection is unifor to guarantee that the subgradient is an unbiased estiate of a true subgradient of the local objective J i (w), and greatly siplifies the analysis. Different choices of P(t) will allow us to understand the effect of liiting counication in this distributed optiization algorith. Algorith Consensus Strongly Convex Optiization Input: {x i,j },where i ] and j n] and N = n, atrix sequence P(t), µ > 0, T {ach i ] executes} Initialize: set w i () = 0 R d. for t = to T do Saple x i,t uniforly with replaceent fro S i. Copute g i (t) l(w i (t) x i,t )x i,t + µw i (t) w i (t + ) = j= w j(t)p ij (t) η t g i (t) end for Output: ŵ i (T ) = T T t= w i(t) for any i ]. xpectations and probabilities. There are two sources of stochasticity in our odel: the first in the split of data points to the individual nodes, and the second in sapling the points during the gradient descent procedure. We assue that the split is done uniforly at rando, which iplies that the expected covariance atrix at each node is the sae as the population covariance atrix ˆΣ. Conditioned on the split, we assue that the sapling at each node is uniforly at rando fro the data point at that node, which akes the stochastic subgradient an unbiased estiate of the subgradient of the local objective function. Let F t be the siga algebra generated by the rando point selections of the algorith up to tie t, so that the iterates {w i (t) : i ]} are easurable with respect to F t. 3 Convergence and Iplications Methods like Algorith, also referred to as prial averaging, have been analyzed previously 3, 7, 0]. In these works it is shown that the convergence properties depend on the structure of the underlying network via the second largest eigenvalue of P. We consider in this section the case when P(t) = P for all t where P is a fixed Markov atrix. This corresponds to a synchronous setting where counication occurs at every iteration. We analyze the use of the step-size η t = /(µt) in Algorith and show that the convergence depends on the spectral nor ρ = σ ( ˆΣ) of the saple covariance atrix. Theore. Fix a Markov atrix P and let ρ = σ ( ˆΣ) denote the spectral nor of the covariance atrix of the data distribution. Consider Algorith when the objective J(w) is strongly convex, P(t) = P for all t, and η t = /(µt). Let λ (P) denote the second largest eigenvalue of P. Then if the nuber of saples on 4

5 each achine n satisfies and the nuber of iterations T satisfies n > 4 log (d) (4) 3ρ then the expected error for each node i satisfies T > e log(/ λ (P)) (5) T log(t ) > ax 4 ( 8 5) 4 /ρ log (d),, (6) 3ρ log(/λ (P)) J(ŵ i (T )) J(w )] ( + 00 ) ρ log T λ (P) L µ log T T. (7) Reark : Theore indicates that the nuber of achines should be chosen as a function of ρ. We can identify three sub-cases of interest: Case (a): : In this regie since / > ρ ρ (ignoring the constants and the log T ter) we /3 always benefit fro adding ore achines. Case (b): < ρ /3 ρ : The result tells us that there is no degradation in the error and the bound iproves by a factor ρ. Sparse data sets generally have a saller value of ρ (as seen in Takáč et al. ]); Theore suggests that for such data sets we can use a larger nuber of achines without losing perforance. However the requireents on the nuber of iterations also increases. This provides additional perspective on the observation by Takáč et al ] that sparse datasets are ore aenable to parallelization via ini-batching. The sae holds for our type of parallelization as well. Case (c): > ρ : In this case we pay a penalty ρ suggesting that for datasets with large ρ we should expect to lose perforance even with relatively fewer achines. Note that > is iplicit in the condition T > e log(/ λ )) since λ = 0 for =. This excludes the single node Pegasos ] case. Additionally in the case of general strongly convex losses (not necessarily dependent on w x) we can obtain a convergence rate of O(log (T )/T ). We do not provide the proof here. 4 Stochastic Counication In this section we generalize our analysis in Theore to handle tie-varying and stochastic counication atrices P(t). In particular, we study the case where the atrices are chosen i.i.d. over tie. Any strategy that doesn t involve counicating at every step will incur a larger gap between the local node estiates and their average. We call this the network error. Our goal is to show how knowing ρ can help us balance the network error and optiality gap. First we bound the network error for the case of stochastic tie varying counication atrices P (t) and then a siple extension leads to a generalized version of Theore. Lea. Let {P(t)} be a i.i.d sequence of doubly stochastic Markov atrices and consider Algorith when the objective J(w) is strongly convex. We have the following inequality for the expected squared error between the iterate w i (t) at node i at tie t and the average w(t) defined in Algorith : where b = log ( /λ ( P (t) ])). w(t) w i (t) ] L µ b log(bet ). (8) t Ared with Lea we prove the following theore for Algorith in the case of stochastic counication. 5

6 Theore 3. Let {P(t)} be an i.i.d sequence of doubly stochastic atrices and ρ = σ ( ˆΣ) denote the spectral nor of the saple covariance atrix. Consider Algorith when the objective J(w) is strongly convex, and η t = /(µt). Then if the nuber of saples on each achine n satisfies and the nuber of iterations T satisfies and n > 4 log (d) (9) 3ρ T > e log(/ λ ( P (t)])) (0) ( T log(t ) > ax 4 3ρ log(d), 8 5 ρ then the expected error for the output of each node i satisfies J(ŵ i (T )) J(w )] ( + 00 ) ρ log T λ ( P (t)]) ) log(/λ ( P (t)])), () L µ log T T. () Reark: This result generalizes the conclusions of Theore to the case of stochastic counication schees. Thus allowing for the data dependent interpretations of convergence in a ore general setting. 5 Liiting Counication As an application of the stochastic counication scenario of Theore (3) we now analyze the effect of reducing the counication overhead of Algorith. This reduction can iprove the overall running tie ( wall tie ) of the algorith because counication latency can hinder the convergence of any algoriths in practice ]. A natural way of liiting counication is to counicate only a fraction ν of the T total iterations; at other ties nodes siply perfor local gradient steps. We consider a sequence of i.i.d rando atrices {P(t)} for Algorith where P(t) {I, P} with probabilities ν and ν, respectively, where I is the identity atrix (iplying no counication since P ij (t) = 0 for i j) and, as in the previous section, P is a fixed doubly stochastic atrix respecting the graph constraints. For this odel the expected nuber of ties counication takes place is siply νt. Note that now we have an additional randoization due to the Bernoulli distribution over the doubly stochastic atrices. Analyzing a atrix P(t) that depends on the current value of the iterates is considerably ore coplicated. A straightforward application of Theore 3 reveals that the optiization error is proportional to ν and decays as O( ν log (T ) T ). However, this ignores the effect of the local counication-free iterations. A ini-batch approach. To account for local counication free iterations we odify the interittent counication schee to follow a deterinistic schedule of counication every /ν steps. However, instead of taking single gradient steps between counication rounds, each node gathers the (sub)gradients and then takes an aggregate gradient step. That is, after the t-th round of counication, the node saples a batch I t of indices sapled with replaceent fro its local data set with I t = /ν. We can think of this as the base algorith with a better gradient estiate at each step. The update rule is now w i (t + ) = w j (t)p ij (t) η t ν g i (t). (3) j N i i I i We define g /ν i (t) = i I i g i (t). Now the iteration count is over the counication steps and g /ν i (t) is the aggregated ini-batch (sub)gradient of size /ν. Note that this is analogous to the rando schee above but the analysis is ore tractable. 6

7 Theore 4. Fix a Markov atrix P and let ρ = σ ( ˆΣ) denote the spectral nor of the covariance atrix of the data distribution. Consider Algorith when the objective J(w) is strongly convex, P(t) = P for all t, and η t = /(µt) for schee (3). Let λ (P) denote the second largest eigenvalue of P. Then if the nuber of saples on each achine n satisfies n > 4 log (d) (4) 3ρ and T > e ν log(/ λ (P)) T log(νt ) > ax 4 3νρ log(d), 4 /ρ log(/λ ) ( 8 5) ν > 4 log(d) (5) 3ρ and then the expected error for each node i satisfies J(ŵ i (T )) J(w )] ( ) + 00 ρ4 log(νt ) 5 λ µ log(νt ). (6) T L where ν is the frequency of counication and where λ = λ (P). Reark: Theore (4) suggests that if the inverse frequency of counication is large enough then we can obtain a sharper bound on the error by a factor of ρ. This is significantly better than a O( ρ log νt νt ) baseline guarantee fro a direct application of Theore when the nuber of iterations is νt. Additionally the result suggests that if we counicate on a ini batch(where batch size b = /ν) that is large enough we can iprove Theore, specifically now we get a / iproveent when /ρ 4/3. 6 Asyptotic Regie In this section we explore the sub-optiality of distributed prial averaging when T for the case of sooth strongly convex objectives. The results of Section (3) suggest that we never gain fro adding ore achines in any network. Now we investigate the behaviour of Algorith in the asyptotic regie and show that the network effect disappears and we do indeed gain fro ore achines in any network. Our analysis depends on the asyptotic norality of a variation of Algorith 5, Theore 5]. The ain differences between Algorith and the consensus algorith of Bianchi et al. 5] is that we average the iterates before aking the local update. We ake the following assuptions for the analysis in this section: () The loss function differentials { (l( ))} are differentiable and G-Lipschitz for soe G > 0, () the stochastic gradients are of the for g i (t) = J(w i (t))+ξ t where ξ t ] = 0 and ξ t ξ t ] = C, and (3) there exists p > 0 such that ξ t +p] <. Our results hold for all sooth strongly convex objectives not necessarily dependent on w x. Lea 5. Fix a Markov atrix P. Consider Algorith when the objective J(w) is strongly convex and twice differentiable, P(t) = P for all t, and η t = /(λt). then the expected error for each node i satisfies for 7

8 Table : Data sets and paraeters for experients data set training test di. λ ρ RCV 78, 65 3, 49 47, Covertype 5, 9 58, 00 47, a arbitrary split of N saples into nodes li sup T J P ij w j (T ) J(w ) T j N (i) j= (P ij ) Tr (H) G µ (7) where H is the solution to the equation J (w )H + H J (w ) T = C. (8) Reark: This result shows that asyptotically the network effect fro Theore 3 disappears and that additional nodes can speed convergence. An application of Lea 5 to proble () gives us the following result for the specialized case of a k-regular graph with constant weight atrix P. Theore 6. Consider Algorith when the objective J(w) has the for, P(t) = P and corresponds to a k-regular graph with unifor weights for all t, and η t = /(λt). then the expected error for each node i satisfies li sup T 5ρL k T J P ij w j (T ) J(w ) j= Tr ( J(w ) ) G µ where the expectation is with respect to the history of the sapled gradients as well as the unifor rando splits of N data points across achines. Reark: For objective () we obtain a /k variance reduction and the network effect disappears. 7 xperients Our goals in our experiental evaluation are to validate the theoretical dependence of the convergence rate on ρ and to see if the conclusions hold when the assuptions we ake in the analysis are violated. Note that all our experients are based on siulations on a ulticore coputer. 7. Data sets, tasks, and paraeter settings The data sets used in our experients are suarized in Table (7.). Covertype is the forest covertype dataset ] used in 4] obtained fro the UC Irvine Machine Learning Repository 3], and rcv is fro the Reuters collection 3] obtained fro libsv collection 4]. The RCV data set has a sall value of ˆρ, whereas Covertype has a larger value. In all the experients we looked at l -regularized classification objectives for proble (). ach plot is averaged over 5 runs. The data consists of pairs {(x, y ),..., (x N, y N )} where x i R d and y i {, +}. In all experients we optiize the l -regularized epirical hinge loss where l(w x) = ( w xy) +. The values of the regularization paraeter µ are chosen fro to be the sae as those in Shalev-Shwarz et al. 4]. (9) 8

9 0 8 Interittent on RCV Interittent on Covertype 6 8 log(prial) 4 Frequency log(prial) 4 Frequency e+00 e+05 e+05 3e+05 4e+05 Iteration Iteration Figure : Perforance of Algorith () with interittent counication schee on datasets with very different ρ. The algorith works better for saller ρ and there is less decay in perforance for RCV as we decrease the nuber of counication rounds as opposed to Covertype (ρ = 0.0 vs ρ = 0.). We siulated networks of copute nodes of varying size () arranged in a k-regular graph with k = 0.5 or a fixed degree (not dependent on ). Note that the dependence of the convergence rate of procedures like Algorith () on the properties of the underlying network has been investigated before and we refer the reader to Agarwal and Duchi 0] for ore details. In this paper we experient only with k-regular graphs. The weights on the Markov atrix P are set by using the ax-degree Markov chain (see 5]). One can also optiize for the fastest ixing Markov chain ( 5], 6]). ach node is randoly assigned n = N/ points. 7. Interittent Counication In this experient we show the objective function for RCV and Covertype as we change the frequency of counication (Figure ), counicating after every, 0, 50 and 500 iterations. Indeed as predicted we see that the dataset with the larger ρ appears to be affected ore by interittent counication. This indicates that network bandwidth can be conserved for datasets with a saller ρ. 7.3 Coparison of Different Schees We copare the three different schees proposed in this paper. On a network of = 64 achines we plot the perforance of the ini batch extension of Algorith () with batch size 8 against the interittent schee that counicates after every 8 iterations and also the standard version of the algorith. In Figure 3-(a) we see that as predicted in Theore (4) the ini batch schee proposed in (3) does better than the vanilla and the interittent schee. 7.4 Infinite Data To provide soe epirical evidence of Lea 5 we generate a very large (N = 0 7 ) synthetic dataset fro a ultivariate Noral distribution and created a siple binary classification task using a rando hyperplane. 9

10 Infinite Data 5.0 log(prial).5 Machines Iteration Figure : No network effect with increasing benefit of adding ore achines in the case of infinite data. As we can see in figure for the SVM proble and a k-regular network we continue to gain as we add ore achines and then eventually we stabilize but never lose fro ore achines. We only show the first few thousand iterations for clarity. 7.5 Diinishing Counication To test if our conclusions apply when the i.i.d assuption for the atrices P(t) does not hold we siulate a diinishing counication regie. Such a schee can be useful when the nodes are already close to the optial solution and counicating their respective iterate is wasteful. Intuitively it is in the beginning the nodes should counicate ore frequently. To foralize the intuition we propose the following counication odel { P w.p. Ct p P(t) = I w.p. Ct p (0) where C, p > 0. Thus the sequence of atrices are not identically distributed and the conclusions of Theore (3) do not apply. However in Figure 3-(b) (C=,p=0.5) we see that on a network of = 8 nodes the perforance for the diinishing regie is siilar to the full counication case and we can hypothesize that our results also hold for non i.i.d counication atrices. 0

11 0 Different Schees on Covertype Diinishing vs Full on Convertype log(prial) Schee =64 b=8 =64 b= Full =64 b= Int log(prial) 5.0 Schee Diinishing Full e+00 e+05 e+05 3e+05 4e Iteration Iteration Figure 3: a) Coparison of three different schees a) Algorith () with Mini-Batching b) Standard c) Interittent with b = (/ν) = 8. As predicted the ini-batch schee perfors uch better than the others. b) The perforance on Covertype with a full and a diinishing counication schee is siilar.

12 8 Discussion and Iplications In this paper we described a consensus stochastic gradient descent algorith and analyzed its perforance in ters of the spectral nor ρ of the data covariance atrix under a hoogenous assuption. In the consensus proble this setting has not been analyzed before and existing work corresponds to weaker results when this assuption holds. For certain strongly convex objectives we showed that the objective value gap between any node s iterate and the optiu centralized estiate decreases as O(log (T )/T ); crucially, the constant depended on ρ and the spectral gap of the network atrix. We showed how liiting counication can iprove the total runtie and reduce network costs by extending our analysis with a siilar data dependent bound. Moreover we show that in the asyptotic regie the network penalty disappears. Our analysis suggests that distribution-dependent bounds can help us understand how data properties can ediate the tradeoff between coputation and counication in distributed optiization. In a sense, data distributions with saller ρ are easier to optiize over in a distributed setting. This set of distributions includes sparse data sets, an iportant class for applications. In the future we will extend data dependent guarantees to serial algoriths as well as the average-atend schee 4, 5]. xtending our fixed batch-size to rando size can help us understand the benefit of counication-free iterations. Finally, we can also study the ipact of asynchrony and ore general tie-varying topologies. 9 Appendix We gather here the proof details and technical leas needed to establish our results. 0 Proof of Theore Theore provides a bound on the suboptiality gap for the output ŵ i (T ) of Algorith at node i, which is the average of that node s iterates. In the analysis we relate this local average to the average iterate across nodes at tie t: w(t) = We will also consider the average of w(t) over tie. The proof consists of three ain steps. We establish the following inequality for the objective error: i= J( w(t)) J(w )] (ηt µ) w(t) w ] η t w(t + ) w ] + η t g i (t) i= + w(t) w i (t) ] i= w i (t). () ( J i (w i (t)) + J i ( w(t)) ) ] /, () where w(t) is the average of the iterates at all nodes and the expectation is with respect to F t while conditioned on the saple split across nodes. All expectations, except when explicitly stated, will be conditioned on this split.

13 We bound J(w i (t)) ] and ηt i= g i(t) ] in ters of the spectral nor of the covariance atrix of the distribution P by additionally taking expectation with respect to the saple S. We bound the network error w(t) w i (t) ] in ter of the network size and a spectral property of the atrix P. Cobining the bounds using inequality () and applying the definition of subgradients yields the result of Theore. 0. Spectral Nor of Rando Subatrices In this section we establish a Lea pertaining to the spectral nor of subatrices that is central to our results. Specifically we prove the following inequality, which follows by applying the Matrix Bernstein inequality of Tropp 7]. Lea 7. Let P be a distribution on R d with second oent atrix Σ = Z P ZZ ] such that Z k alost surely. Let ζ = σ (Σ). Let Z, Z,..., Z K be an i.i.d. saple fro P and let Q K = K Z k Z k k= be the epirical second oent atrix of the data. Then for K > 4 3ζ log(d), ] σ (Q K ) 5ζ. (3) K ] Thus when P is the epirical distribution we get that σ(q K ) K 5ζ. Reark: We can replace the abient diension d in the requireent for K by an intrinsic diensionality ter but this requires a lower bound on the nor of any data point in the saple. Proof. Let Z be the d K atrix whose coluns are {Z k }. Define X k = Z k Z k Σ. Then X k] = 0 and λ ax (X k ) = λ ax ( Zk Z k Σ ) Z k, because Σ is positive seidefinite and x i for all i. Furtherore, σ ( K k= X k] ) ( = Kσ Zk Z k Z k Z ] k Σ ) ]) Kσ ( Y k Z k Z k + Kσ (Σ) K(ζ + ζ 4 ) Kζ, since ρ. Applying the Matrix Bernstein inequality of Tropp 7, Theore 6.]: ( ( K ) ) { d exp ( 3 r P σ X k r k= d exp ( ) 3 r 8 6Kζ ) r K ζ r K ζ. (4) 3

14 Now, note that ( K ) ( K ) σ X k = σ Z k Z k KΣ, ( K ) so σ k= X k r is iplied by k= k= ( K ) K σ Z k Z k σ (Σ) r K. k= Therefore ( σ (Q K ) P K ζ ( ) d exp r ( d exp ) 3Kr 6ζ 3Kr 8 ) r ζ r ζ. (5) Integrating (5) yields For K > 4 3ζ log d, ] σ (Q K ) = K ( ) σ (Q K ) P x dx 0 K ( ) 3ζ σ (Q t ) + P 3ζ K ζ x ζ dx ( ) 3ζ σ (Q t ) + P ζ K ζ r dr 3ζ + d exp ( 38 ) Kr dr ζ = 3ζ d ( K exp 3 ) 4 ζ K ] σ (Q K ) K 3ζ ζ. ζ log d 0. Decoposing the expected suboptiality gap The proof in part follows 3]. It is easy to verify that because P is doubly stochastic the average of the iterates across the nodes at tie t, the average of the iterates across the nodes in () satisfies the following update rule: w(t + ) = w(t) η t i= g i (t). (6) We ephasize that in Algorith we do not perfor a final averaging across nodes at the end as in (). Rather, we analyze the average at a single node across its iterates (soeties called Polyak averaging). Analyzing () provides us with a way to understand how the objective J(w i (t)) evaluated at any node i s iterate w i (t) copares to the iniu value J(w ). The details can be found in Section (0.7). 4

15 To siplify notation, we treat all expectations as conditioned on the saple S. Then (6), ] w(t + ) w Ft ] = w(t) w F t + ηt g i (t) Ft i= i= η t ( w(t) w ) g i (t) F t ] i= ] = w(t) w F t + ηt g i (t) Ft η t i= ( w(t) w ) g i(t) F t ] Note that J i (w i (t)) = g i (t) F t ], so for the last ter, for each i we have J i (w i (t)) ( w(t) w ) = J i (w i (t)) ( w(t) w i (t)) + J i (w i (t)) (w i (t) w ) J i (w i (t)) w(t) w i (t) + J i (w i (t)) (w i (t) w ) J i (w i (t)) w(t) w i (t) + J i (w i (t)) J i (w ) + µ w i(t) w = J i (w i (t)) w(t) w i (t) + J i (w i (t)) J i ( w(t)) + µ w i(t) w + J i ( w(t)) J i (w ) J i (w i (t)) w(t) w i (t) + J i ( w(t)) (w i (t) w(t)) + µ w i(t) w + J i ( w(t)) J i (w ) ( J i (w i (t)) + J i ( w(t)) ) w(t) w i (t). (7) + µ w i(t) w + J i ( w(t)) J i (w ), (8) where the second and third lines coes fro applying the Cauchy-Shwartz inequality and strong convexity, the fifth line coes fro the definition of subgradient, and the last line is another application of the Cauchy- Shwartz inequality. Averaging over all the nodes, using convexity of, the definition of J( ), and Jensen s inequality yields 5

16 the following inequality: η t i= ( w(t) w ) g i(t) F t ] w(t) w i (t) ( J i (w i (t)) + J i ( w(t)) ) η t i= ( ) J i ( w(t)) J i (w ) η t η t µη t i= i= i= w i (t) w Substituting inequality (9) in recursion (7), w(t + ) w ] Ft ] w(t) w F t + ηt g i (t) Ft + η t w(t) w i (t) ( J i (w i (t)) + J i ( w(t)) ) η t (J( w(t)) J(w )) µη t w(t) w (9) i= i= w(t) w i (t) ( J i (w i (t)) + J i ( w(t)) ) η t (J( w(t)) J(w )) µη t w(t) w. (30) Taking expectations with respect to the entire history F t, w(t + ) w ] w(t) w ] + ηt i= g i (t) + η t w(t) w i (t) ( J i (w i (t)) + J i ( w(t)) )] i= η t ( J( w(t)) J(w )]) µη t w(t) w ] η t ( J( w(t)) J(w )]) + ( µη t ) w(t) w ] + ηt i= + η t w(t) w i (t) ] i= g i (t) ( J i (w i (t)) + J i ( w(t)) ) ] (3) 6

17 This lets us bound the expected suboptiality gap J( w(t)) J(w )] via three ters: T = (η t µ) w(t) w ] η t w(t + ) w ] (3) T = η t g i (t) (33) i= T3 = w(t) w i (t) ] i= ( J i (w i (t)) + J i ( w(t)) ) ], (34) where J( w(t)) J(w )] T + T + T3. (35) The reainder of the proof is to bound these three ters separately. 0.3 Network rror Bound We need to prove an interediate bound first to handle ter T3. Lea 8. Fix a Markov atrix P and consider Algorith when the objective J(w) is strongly convex we have the following inequality for the expected squared error between the iterate w i (t) at node i at tie t and the average w(t) defined in Algorith : where b = (/) log(/λ (P)). w(t) w i (t) ] L µ b log(bet ), (36) t Proof. We follow a siilar analysis as others 3, Prop. 3] 6, IV.A] 0]. Let W(t) be the d atrix whose i-th row is w i (t) and G(t) be the d atrix whose i-th row is g i (t). Then the iteration can be copactly written as W(t + ) = P(t)W(t) η t G(t) and the network average atrix W(t) = W(t). Then we can write the difference using the fact that 7

18 P(t) = P for all t: W(t + ) W(t + ) = ( ) I (PW(t) η t G(t)) = = ( ) ( ) P W(t) η t I G(t) ( ) P (PW(t ) η t G(t )) ( ) η t I G(t) = ( ) P W(t ) ( ) η t P G(t ) ( ) η t I G(t) = ( ) P W(t ) t ( ) η s P t s G(s). (37) s=t Continuing the expansion and using the fact that W() = 0, W(t + ) W(t + ) = ( ) P t W() = t ( ) η s P t s G(s) s= t ( ) η s P t s G(s) s= t ( ) = η s P t s G(s) s= η t ( I ) G(t). (38) 8

19 Now looking at the nor of the i-th row of (38) and using the bound on the gradient nor: w(t) w i (t) t η s t s= s= j= ( ) (Pt s ) ij g j (s) + η t g j(t) g i (t) j= (39) L µs (Pt s ) i + L µt. (40) We handle the ter (Pt s ) i using a bound on the ixing rate of Markov chains (c.f. (74) in Tsianos and Rabbat 0]): t s= L µs (Pt s ) i L t µ s= ( ) t s λ (P) s. (4) Define a = λ (P) and b = log(a) > 0. Then we have the following identities: t τ= a t τ+ τ = t τ= a τ t τ + = t τ= exp( bτ) t τ +. (4) Now using the fact that when x > we have exp( x) < /( + x) and using the integral upper bound we get t τ= = Using (4) and (43) in (40) we get a t τ+ τ= τ t ( + bτ)(t τ + ) ( + b)t + ( + b)t + t dτ ( + bτ)(t τ + ) log(bτ + ) log(t τ + ) bt + b + ] t τ= log(bt + ) log(b + ) + log(t) = + ( + b)t bt + b + log(et(bt + )) bt log(bet ). (43) bt Therefore we have w(t) w i (t) L log(bet ) + L µ bt µt L µ w(t) w i (t) ] L µ log(bet ). (44) bt log(bet ). (45) bt 9

20 0.4 Bounds for expected gradient nors 0.4. Bounding J i ( w(t)) ] Let β j,t l( w(t) x i,j ) denote a subgradient for the j-th point at node i and β t = (β,t, β,t,..., β n,t ) be the vector of subgradients at tie t. Let Q Si be the n n Gra atrix of the data set S i. Fro the definition of J i ( w(t)) and using the Lipschitz property of the loss functions, we have the following bound: J i ( w(t)) n n j S i β j,tx i,j j S i β j,tx i,j + µ w(t) + µ w(t) = j S i j S i β j,t β j,tx i,j x i,j n + µ w(t) = n β t Q Si β t + µ w(t) n β t σ (Q Si ) + µ w(t) L σ (Q Si ) n + µ w(t). (46) We rewrite the update (6) in ters of {x i,t }, the points sapled at the nodes at tie t: w(t + ) = w(t)( µη t ) η t i= l(w i (t) x i,t )x i,t. (47) Now fro equation (47), after unrolling the recursion as in Shalev-Shwarz et al. 4] we see w(t) = t i= l(w i(τ) x i,τ )x i,τ. (48) µ(t ) τ= Let γτ i l(w i (τ) x i,τ ) the subgradient set for the ith node coputed at tie τ, then we have w(t) µ(t ) t γτ i x i,τ. (49) Let us in turn bound for each node i the ter t. τ= γi τ x i,τ Let γ i τ l(w i (τ) x i,τ ) denote a subgradient for the point sapled at tie τ at node i and γ i = (γ, i γ, i..., γt ) i be the vector of subgradients up to tie t. We have i= τ= t γτ i x i,τ = γτ i γτ i x i,τ x i,τ τ,τ τ= = (γ i ) Q i,t γ i γ i σ (Q i,t ) (t )L σ (Q i,t ), (50) where Q i,t is the (t ) (t ) Gra subatrix corresponding to the points sapled at the i-th node until tie t. 0

21 Further bounding (49): ( w(t) i= (t )L σ (Q i,t ) µ(t ) ( ) L σ (Q i,t ) µ. t i= Since as stated before everything is conditioned on the saple split we take expectations w.r.t the history and the rando split and using the Cauchy-Schwarz inequality again, and the fact that the points are sapled i.i.d. fro the sae distribution, w(t) ] L µ L µ = L µ ) ] σ (Q i,t )σ (Q j,t ) t i= j= ] ] σ (Q i,t ) σ (Q j,t ) t t i= j= σ (Q i,t ) t ]. (5) The last line follows fro the expectation over the sapling odel: the data at node i and node j have the sae expected covariance since they are sapled uniforly at rando fro the total data. Taking the expectation in (46) and substituting (5) we have J i ( w(t)) ] ] L σ (Q Si ) n ] + L σ (Q i,t ). (5) t Since S i is a unifor rando draw fro S and by assuing both t and n to be greater than 4/(3ρ ) log(d), applying Lea 7 gives us J i ( w(t)) ] 0L ρ. (53) 0.4. Bounding J i (w i (t)) ] We have just as in the previous subsection J i (w i (t)) L σ (Q Si ) n + µ w i (t). Using the triangle inequality, the fact that (a + a ) a + a, the bounds (44) and (5), and Lea 7: w i (t) ] w i (t) w(t) ] + w(t) ] 8L µ log (bet ) b (t ) + 5L ρ µ. (54) Since the second ter does not scale with t, fro (54) we can infer that for the second ter to doinate the first we require t 8 log(t) > 5 ρb.

22 This gives us w i (t) ] 0L ρ µ, (55) and therefore J i (w i (t)) ] 30L ρ. (56) 0.5 Bound for T Because the gradients are bounded, g i (t) i= = g i (t) g i (t) i,j g i (t) ] = + g i (t) g j (t) ] i= i j g i (t) g j (t) ] L + i j i j = L + ]] F t gi (t) g j (t) F t. Now using the fact that the gradients g i (t) are unbiased estiates of J i (w t ) and that g i (t) and g j (t) are independent given past history and inequality (56) for node i and j we get i j F t gi (t) g j (t) F t ]] = i j Ft Ji (w i (t)) J j (w j (t)) ] Ft J i (w i (t)) ] Ft J j (w j (t)) ] i j ( ) = 30L ρ 30L ρ. (57) Therefore to bound the ter T in (35) we can use g i (t) L + 30L ρ. (58) i=

23 0.6 Bound for T 3 Applying (45), (53), and (56) to T3 in (35), as well as Lea 8 and the fact that (a + a ) a + a we obtain the following bound: T 3 w(t) w i (t) ] i= ( J i (w i (t)) + J i ( w(t)) ) ] L µ i= 0L µ 0.7 Cobining the Bounds b log(bet ) bt log(t ) t 0Lρ ρ. (59) Finally cobining (58) and (59) in (35) and applying the step size assuption η t = /(µt): where K 0 = ( ( 30ρ + / + 60 ρ log(t ) J( w(t)) J(w )] (η t µ) w(t) w ] η t w(t + ) w ] ( 30L ρ ) + + L µ µ t + 0L µ log(bet ) ρ b t µ(t ) w(t) w ] µt w(t + ) w ] + K 0 L ) µt, (60) ) /b, using t T and assuing T > be. Let us now define two new sequences, the average of the average of iterates over nodes fro t = to T and the average for any node i ] ŵ(t ) = T ŵ i (T ) = T T w(t) (6) t= T w i (t). (6) Then suing (60) fro t = to T, using the convexity of J and collapsing the telescoping su in the first two ters of (60), t= J(ŵ(T )) J(w )] T J( w(t)) J(w )] T t= µt w(t + ) w ] + K 0 L µ T t= /t K 0 L µ log(t ). (63) T T 3

24 Now using the definition of subgradient, Cauchy-Schwarz, and Jensen s inequality we have J(ŵ i (T )) J(w ) J(ŵ(T )) J(w ) + J(ŵ i (T )) (ŵ i (t) ŵ(t )) J(ŵ(T )) J(w ) + J(ŵ i (T ) ŵ i (t) ŵ(t ) J(ŵ(T )) J(w ) T w i (t) w(t) + J(ŵ i (T )). (64) T t= To proceed we ust bound J(ŵ i (T )) ] in a siilar way as the bound (53). First, let α i = l(ŵ i (T ) x i ) denote the subgradient for the i-th loss function of J( ) in (), evaluated at ŵ i (T ), and α T = (α, α,..., α N ) be the vector of subgradients. As before, J(ŵ i (T )) N = α i x i + µŵ i (T ) N i= N α Qα + µ ŵ i (T ) 0L ρ + µ ŵ i (T ) 0L ρ + µ T Taking expectations of both sides and using (55) as before: J(ŵ i (T )) ] 30L ρ. T w i (t). Taking expectations of both sides of (64) and using the Cauchy-Schwarz inequality, (63), the preceding gradient bound, Lea 8 and the definition of K 0 we get t= J(ŵ i (T )) J(w )] K 0 L µ log(t ) + 30L ρ log(t ) T µ b T ( K ρ ) log T log T b T ( ) 30ρ ρ log T b T t= t L µ log T T. (65) Recalling that b = log(/λ (P)) λ (P ), assuing T > be and subsuing the first ter in the third and taking expectations with respect to the saple split the above bound can be written as ( J(ŵ i (T )) J(w )] + 00 ) ρ log T λ (P ) L µ log T T. (66) Proof of Lea Proof. Let us define the product of the sequence of rando atrices {P(τ) : s τ t}: Φ(s : t) = P(t) P(s). (67) 4

25 Then proceeding as in proof of Lea 8 and using the step size η t = /(µt), we get t ( ) w(t) w i (t) η s Φ(s : t) ij g j (s) s= j= + η t g j(t) g i (t) t s= j= (68) L µs Φ(s : t)e i + L µt. (69) Let e i be a vector with 0 s everywhere except at the the ith position, then ] Φ(s : t)e i ] Φ(s : t)e i. (70) Consider the recursion u(t + ) = P(t)u(t) and let v(t + ) = P(t)u(t) then we have v(t + ) v(t + ) v(t) ] = v(t) P (t)v(t) v(t) ] = v(t) P (t) ] v(t) v(t) λ ( P (t) ]), (7) since v(t) is orthogonal to the largest eigenvector of P(t). Taking expectations w.r.t to v(t) we get v(t + ) ] v(t) ] ( λ P (t) ]). (7) Recursively expanding (7) we obtain v(t + ) ] v(0) ( λ P (t) ]) t s+. (73) Consider an initial vector u(0) = e i. We see that v(t + ) = Φ(s : t) i, this finally gives us ] Φ(s : t) i ] Φ(s : t) i e i ( λ P (t) ]) t s+ λ ( P (t) ]) t s+. (74) ( Proceeding like the proof of Lea where a = λ P (t) ]) and b = log(a) we get w(t) w i (t) ] L log(bet ). (75) µ bt Proof of Theore 3 The proof follows easily fro the proof of Theore. Proof. Since (35) still holds, we erely apply Lea in (35) and continue in the sae way as the proof of Theore. 5

26 3 Proof of Theore 4 We will first establish the network lea for schee (3). Lea 9. Fix a Markov atrix P and consider Algorith when the objective J(w) is strongly convex and the frequency of counication satisfies /ν > 4 log(d) (76) 3ρ we have the following inequality for the expected squared error between the iterate w i (t) at node i at tie t and the average w(t) defined in Algorith for schee (3) w(t) w i (t) ] 4L 5ρ log(bet ) (77) µ bt where b = (/) log(/λ (P)). Proof. It is easy to see that we can write the update equation in Algorith as w i (t + ) = P ij (t)w j (t) η t g /ν i (t) (78) where and g i (t) = g /ν i (t) + µw i (t). We need first a bound on g /ν j g /ν i j= { Pij (t) when i j P ij (t) = P ii (t) t when i = j (s) using the definition of the inibatch (sub)gradient: (s) = Fro (40) and the inibatch (sub)gradient bound w(t) w i (t) t η s s= j= + η t i ks H l(w s i i (s) x kis )x kis /ν (79) L ν Q /ν (80) ( ) ( P t s ) ij j= g/ν L ν t Q/ν L ν Q/ν t s= s= + L ν Q/ν µt L ν t Q/ν g /ν j (s) j (t) g /ν i (t) ( P t s ) i µs + L ν Q /ν µt (Pt s ) i + (P t s ) i ( P t s ) i s= µs (Pt s ) i + L ν Q/ν µs µt 6

27 have Continuing as in the proof of Lea 8, taking expectations and using Lea 7, for /ν > 4 3ρ log(d) we ν w(t) w i (t) ] 4L Q /ν ] µ 4L 5ρ µ log(bet ) bt log(bet ) bt (8) For the schee (3) all the steps until bound (35) fro proof of Theore 3 reain the sae. The difference in the rest of the proof arises priarily fro the ini batch gradient nor factor in Lea 9. We have the sae decoposition as (35) with T, T, and T3 as in (3), (33), and (34). The gradient nor bounds also don t change since the inibatch gradient is also an unbiased gradient of the true gradient J( ). Thus substituting Lea 9 in the above and following the sae steps as in proof of Theore 3, replacing T by νt where T is now the total iterations including the counication as well as the inibatch gathering rounds, we get Theore Proof of Lea 5 In the proof we will use the corresponding ultivariate norality result of Bianchi et al. 5, Theore 5]. Finally using soothness and strong convexity we shall get Lea 5. It is easy to verify that Algorith satisfies all the assuptions necessary (Assuptions, 4, 6, 7, 8a, and 8b in Bianchi et al. 5]) for the result to hold. Assuption requires the weight atrix P(t) to be row stochastic alost surely, identically distributed over tie, and that P(t)] is colun stochastic. Our Markov atrix is constant over tie and doubly stochastic. Assuption b follows because P is constant and independent of the stochastic gradients, which are sapled uniforly with replaceent. Assuption 4 requires square integrability of the gradients as well as a regularity condition. In our setting, this follows since the sapled gradients are bounded alost everywhere. Assuption 6 iposes soe analytic conditions at the optiu value. These hold since the gradient is assued to be differentiable and the Hessian atrix at w is positive definite with its sallest eigenvalue is at least µ > 0 (this follows fro strong convexity). Assuption 7 of Bianchi et al. 5] follows fro our existing assuptions. Assuptions 8a and 8b are standard stochastic approxiation assuptions on the step size that are easily satisfied by η t = µt. Next it is straightforward to show that the average over the nodes of the iterates w i (t), w i (t) for Algorith and distributed algorith of 5] are the sae and satisfy Now note that w i (t) w = w(t + ) = w(t) η t i= g i(t) w i (t + ) = w i (t + ) η t i= g i(t) w i (t) w i (t) }{{} + w i (t) w }{{} T=Network rror T=Asyptotically Noral (8) (83) Fro Lea 8 we know that the network error (T) decays and fro update equation (8) we know that the averaged iterates for both the versions are the sae. Then the proof of Theore 5 of Bianchi et 7

Stochastic Subgradient Methods

Stochastic Subgradient Methods Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods

More information

Sharp Time Data Tradeoffs for Linear Inverse Problems

Sharp Time Data Tradeoffs for Linear Inverse Problems Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used

More information

A Simple Regression Problem

A Simple Regression Problem A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October

More information

Distributed Subgradient Methods for Multi-agent Optimization

Distributed Subgradient Methods for Multi-agent Optimization 1 Distributed Subgradient Methods for Multi-agent Optiization Angelia Nedić and Asuan Ozdaglar October 29, 2007 Abstract We study a distributed coputation odel for optiizing a su of convex objective functions

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,

More information

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds

More information

Block designs and statistics

Block designs and statistics Bloc designs and statistics Notes for Math 447 May 3, 2011 The ain paraeters of a bloc design are nuber of varieties v, bloc size, nuber of blocs b. A design is built on a set of v eleents. Each eleent

More information

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering

More information

Feature Extraction Techniques

Feature Extraction Techniques Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that

More information

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab Support Vector Machines Machine Learning Series Jerry Jeychandra Bloh Lab Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a

More information

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices CS71 Randoness & Coputation Spring 018 Instructor: Alistair Sinclair Lecture 13: February 7 Disclaier: These notes have not been subjected to the usual scrutiny accorded to foral publications. They ay

More information

Ch 12: Variations on Backpropagation

Ch 12: Variations on Backpropagation Ch 2: Variations on Backpropagation The basic backpropagation algorith is too slow for ost practical applications. It ay take days or weeks of coputer tie. We deonstrate why the backpropagation algorith

More information

Boosting with log-loss

Boosting with log-loss Boosting with log-loss Marco Cusuano-Towner Septeber 2, 202 The proble Suppose we have data exaples {x i, y i ) i =... } for a two-class proble with y i {, }. Let F x) be the predictor function with the

More information

1 Bounding the Margin

1 Bounding the Margin COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost

More information

RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION

RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION GUANGHUI LAN AND YI ZHOU Abstract. In this paper, we consider a class of finite-su convex optiization probles defined over a distributed

More information

Asynchronous Gossip Algorithms for Stochastic Optimization

Asynchronous Gossip Algorithms for Stochastic Optimization Asynchronous Gossip Algoriths for Stochastic Optiization S. Sundhar Ra ECE Dept. University of Illinois Urbana, IL 680 ssrini@illinois.edu A. Nedić IESE Dept. University of Illinois Urbana, IL 680 angelia@illinois.edu

More information

arxiv: v1 [cs.ds] 3 Feb 2014

arxiv: v1 [cs.ds] 3 Feb 2014 arxiv:40.043v [cs.ds] 3 Feb 04 A Bound on the Expected Optiality of Rando Feasible Solutions to Cobinatorial Optiization Probles Evan A. Sultani The Johns Hopins University APL evan@sultani.co http://www.sultani.co/

More information

Polygonal Designs: Existence and Construction

Polygonal Designs: Existence and Construction Polygonal Designs: Existence and Construction John Hegean Departent of Matheatics, Stanford University, Stanford, CA 9405 Jeff Langford Departent of Matheatics, Drake University, Des Moines, IA 5011 G

More information

COS 424: Interacting with Data. Written Exercises

COS 424: Interacting with Data. Written Exercises COS 424: Interacting with Data Hoework #4 Spring 2007 Regression Due: Wednesday, April 18 Written Exercises See the course website for iportant inforation about collaboration and late policies, as well

More information

Support Vector Machines MIT Course Notes Cynthia Rudin

Support Vector Machines MIT Course Notes Cynthia Rudin Support Vector Machines MIT 5.097 Course Notes Cynthia Rudin Credit: Ng, Hastie, Tibshirani, Friedan Thanks: Şeyda Ertekin Let s start with soe intuition about argins. The argin of an exaple x i = distance

More information

Multi-Dimensional Hegselmann-Krause Dynamics

Multi-Dimensional Hegselmann-Krause Dynamics Multi-Diensional Hegselann-Krause Dynaics A. Nedić Industrial and Enterprise Systes Engineering Dept. University of Illinois Urbana, IL 680 angelia@illinois.edu B. Touri Coordinated Science Laboratory

More information

Convex Programming for Scheduling Unrelated Parallel Machines

Convex Programming for Scheduling Unrelated Parallel Machines Convex Prograing for Scheduling Unrelated Parallel Machines Yossi Azar Air Epstein Abstract We consider the classical proble of scheduling parallel unrelated achines. Each job is to be processed by exactly

More information

3.8 Three Types of Convergence

3.8 Three Types of Convergence 3.8 Three Types of Convergence 3.8 Three Types of Convergence 93 Suppose that we are given a sequence functions {f k } k N on a set X and another function f on X. What does it ean for f k to converge to

More information

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

e-companion ONLY AVAILABLE IN ELECTRONIC FORM OPERATIONS RESEARCH doi 10.1287/opre.1070.0427ec pp. ec1 ec5 e-copanion ONLY AVAILABLE IN ELECTRONIC FORM infors 07 INFORMS Electronic Copanion A Learning Approach for Interactive Marketing to a Custoer

More information

Machine Learning Basics: Estimators, Bias and Variance

Machine Learning Basics: Estimators, Bias and Variance Machine Learning Basics: Estiators, Bias and Variance Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Basics

More information

Soft-margin SVM can address linearly separable problems with outliers

Soft-margin SVM can address linearly separable problems with outliers Non-linear Support Vector Machines Non-linearly separable probles Hard-argin SVM can address linearly separable probles Soft-argin SVM can address linearly separable probles with outliers Non-linearly

More information

A Theoretical Analysis of a Warm Start Technique

A Theoretical Analysis of a Warm Start Technique A Theoretical Analysis of a War Start Technique Martin A. Zinkevich Yahoo! Labs 701 First Avenue Sunnyvale, CA Abstract Batch gradient descent looks at every data point for every step, which is wasteful

More information

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes

More information

CSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13

CSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13 CSE55: Randoied Algoriths and obabilistic Analysis May 6, Lecture Lecturer: Anna Karlin Scribe: Noah Siegel, Jonathan Shi Rando walks and Markov chains This lecture discusses Markov chains, which capture

More information

Interactive Markov Models of Evolutionary Algorithms

Interactive Markov Models of Evolutionary Algorithms Cleveland State University EngagedScholarship@CSU Electrical Engineering & Coputer Science Faculty Publications Electrical Engineering & Coputer Science Departent 2015 Interactive Markov Models of Evolutionary

More information

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay A Low-Coplexity Congestion Control and Scheduling Algorith for Multihop Wireless Networks with Order-Optial Per-Flow Delay Po-Kai Huang, Xiaojun Lin, and Chih-Chun Wang School of Electrical and Coputer

More information

CS Lecture 13. More Maximum Likelihood

CS Lecture 13. More Maximum Likelihood CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood

More information

Tight Complexity Bounds for Optimizing Composite Objectives

Tight Complexity Bounds for Optimizing Composite Objectives Tight Coplexity Bounds for Optiizing Coposite Objectives Blake Woodworth Toyota Technological Institute at Chicago Chicago, IL, 60637 blake@ttic.edu Nathan Srebro Toyota Technological Institute at Chicago

More information

The proofs of Theorem 1-3 are along the lines of Wied and Galeano (2013).

The proofs of Theorem 1-3 are along the lines of Wied and Galeano (2013). A Appendix: Proofs The proofs of Theore 1-3 are along the lines of Wied and Galeano (2013) Proof of Theore 1 Let D[d 1, d 2 ] be the space of càdlàg functions on the interval [d 1, d 2 ] equipped with

More information

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search Quantu algoriths (CO 781, Winter 2008) Prof Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search ow we begin to discuss applications of quantu walks to search algoriths

More information

Learnability and Stability in the General Learning Setting

Learnability and Stability in the General Learning Setting Learnability and Stability in the General Learning Setting Shai Shalev-Shwartz TTI-Chicago shai@tti-c.org Ohad Shair The Hebrew University ohadsh@cs.huji.ac.il Nathan Srebro TTI-Chicago nati@uchicago.edu

More information

OPTIMIZATION in multi-agent networks has attracted

OPTIMIZATION in multi-agent networks has attracted Distributed constrained optiization and consensus in uncertain networks via proxial iniization Kostas Margellos, Alessandro Falsone, Sione Garatti and Maria Prandini arxiv:603.039v3 [ath.oc] 3 May 07 Abstract

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic

More information

Chaotic Coupled Map Lattices

Chaotic Coupled Map Lattices Chaotic Coupled Map Lattices Author: Dustin Keys Advisors: Dr. Robert Indik, Dr. Kevin Lin 1 Introduction When a syste of chaotic aps is coupled in a way that allows the to share inforation about each

More information

Analyzing Simulation Results

Analyzing Simulation Results Analyzing Siulation Results Dr. John Mellor-Cruey Departent of Coputer Science Rice University johnc@cs.rice.edu COMP 528 Lecture 20 31 March 2005 Topics for Today Model verification Model validation Transient

More information

Constrained Consensus and Optimization in Multi-Agent Networks arxiv: v2 [math.oc] 17 Dec 2008

Constrained Consensus and Optimization in Multi-Agent Networks arxiv: v2 [math.oc] 17 Dec 2008 LIDS Report 2779 1 Constrained Consensus and Optiization in Multi-Agent Networks arxiv:0802.3922v2 [ath.oc] 17 Dec 2008 Angelia Nedić, Asuan Ozdaglar, and Pablo A. Parrilo February 15, 2013 Abstract We

More information

Recovering Data from Underdetermined Quadratic Measurements (CS 229a Project: Final Writeup)

Recovering Data from Underdetermined Quadratic Measurements (CS 229a Project: Final Writeup) Recovering Data fro Underdeterined Quadratic Measureents (CS 229a Project: Final Writeup) Mahdi Soltanolkotabi Deceber 16, 2011 1 Introduction Data that arises fro engineering applications often contains

More information

Non-Parametric Non-Line-of-Sight Identification 1

Non-Parametric Non-Line-of-Sight Identification 1 Non-Paraetric Non-Line-of-Sight Identification Sinan Gezici, Hisashi Kobayashi and H. Vincent Poor Departent of Electrical Engineering School of Engineering and Applied Science Princeton University, Princeton,

More information

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material Consistent Multiclass Algoriths for Coplex Perforance Measures Suppleentary Material Notations. Let λ be the base easure over n given by the unifor rando variable (say U over n. Hence, for all easurable

More information

Introduction to Machine Learning. Recitation 11

Introduction to Machine Learning. Recitation 11 Introduction to Machine Learning Lecturer: Regev Schweiger Recitation Fall Seester Scribe: Regev Schweiger. Kernel Ridge Regression We now take on the task of kernel-izing ridge regression. Let x,...,

More information

3.3 Variational Characterization of Singular Values

3.3 Variational Characterization of Singular Values 3.3. Variational Characterization of Singular Values 61 3.3 Variational Characterization of Singular Values Since the singular values are square roots of the eigenvalues of the Heritian atrices A A and

More information

Robustness and Regularization of Support Vector Machines

Robustness and Regularization of Support Vector Machines Robustness and Regularization of Support Vector Machines Huan Xu ECE, McGill University Montreal, QC, Canada xuhuan@ci.cgill.ca Constantine Caraanis ECE, The University of Texas at Austin Austin, TX, USA

More information

Chapter 6 1-D Continuous Groups

Chapter 6 1-D Continuous Groups Chapter 6 1-D Continuous Groups Continuous groups consist of group eleents labelled by one or ore continuous variables, say a 1, a 2,, a r, where each variable has a well- defined range. This chapter explores:

More information

PAC-Bayes Analysis Of Maximum Entropy Learning

PAC-Bayes Analysis Of Maximum Entropy Learning PAC-Bayes Analysis Of Maxiu Entropy Learning John Shawe-Taylor and David R. Hardoon Centre for Coputational Statistics and Machine Learning Departent of Coputer Science University College London, UK, WC1E

More information

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical IEEE TRANSACTIONS ON INFORMATION THEORY Large Alphabet Source Coding using Independent Coponent Analysis Aichai Painsky, Meber, IEEE, Saharon Rosset and Meir Feder, Fellow, IEEE arxiv:67.7v [cs.it] Jul

More information

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis City University of New York (CUNY) CUNY Acadeic Works International Conference on Hydroinforatics 8-1-2014 Experiental Design For Model Discriination And Precise Paraeter Estiation In WDS Analysis Giovanna

More information

Keywords: Estimator, Bias, Mean-squared error, normality, generalized Pareto distribution

Keywords: Estimator, Bias, Mean-squared error, normality, generalized Pareto distribution Testing approxiate norality of an estiator using the estiated MSE and bias with an application to the shape paraeter of the generalized Pareto distribution J. Martin van Zyl Abstract In this work the norality

More information

A PROBABILISTIC AND RIPLESS THEORY OF COMPRESSED SENSING. Emmanuel J. Candès Yaniv Plan. Technical Report No November 2010

A PROBABILISTIC AND RIPLESS THEORY OF COMPRESSED SENSING. Emmanuel J. Candès Yaniv Plan. Technical Report No November 2010 A PROBABILISTIC AND RIPLESS THEORY OF COMPRESSED SENSING By Eanuel J Candès Yaniv Plan Technical Report No 200-0 Noveber 200 Departent of Statistics STANFORD UNIVERSITY Stanford, California 94305-4065

More information

A Simplified Analytical Approach for Efficiency Evaluation of the Weaving Machines with Automatic Filling Repair

A Simplified Analytical Approach for Efficiency Evaluation of the Weaving Machines with Automatic Filling Repair Proceedings of the 6th SEAS International Conference on Siulation, Modelling and Optiization, Lisbon, Portugal, Septeber -4, 006 0 A Siplified Analytical Approach for Efficiency Evaluation of the eaving

More information

A note on the multiplication of sparse matrices

A note on the multiplication of sparse matrices Cent. Eur. J. Cop. Sci. 41) 2014 1-11 DOI: 10.2478/s13537-014-0201-x Central European Journal of Coputer Science A note on the ultiplication of sparse atrices Research Article Keivan Borna 12, Sohrab Aboozarkhani

More information

A Probabilistic and RIPless Theory of Compressed Sensing

A Probabilistic and RIPless Theory of Compressed Sensing A Probabilistic and RIPless Theory of Copressed Sensing Eanuel J Candès and Yaniv Plan 2 Departents of Matheatics and of Statistics, Stanford University, Stanford, CA 94305 2 Applied and Coputational Matheatics,

More information

Optimal Jamming Over Additive Noise: Vector Source-Channel Case

Optimal Jamming Over Additive Noise: Vector Source-Channel Case Fifty-first Annual Allerton Conference Allerton House, UIUC, Illinois, USA October 2-3, 2013 Optial Jaing Over Additive Noise: Vector Source-Channel Case Erah Akyol and Kenneth Rose Abstract This paper

More information

Randomized Recovery for Boolean Compressed Sensing

Randomized Recovery for Boolean Compressed Sensing Randoized Recovery for Boolean Copressed Sensing Mitra Fatei and Martin Vetterli Laboratory of Audiovisual Counication École Polytechnique Fédéral de Lausanne (EPFL) Eail: {itra.fatei, artin.vetterli}@epfl.ch

More information

Bootstrapping Dependent Data

Bootstrapping Dependent Data Bootstrapping Dependent Data One of the key issues confronting bootstrap resapling approxiations is how to deal with dependent data. Consider a sequence fx t g n t= of dependent rando variables. Clearly

More information

Randomized Accuracy-Aware Program Transformations For Efficient Approximate Computations

Randomized Accuracy-Aware Program Transformations For Efficient Approximate Computations Randoized Accuracy-Aware Progra Transforations For Efficient Approxiate Coputations Zeyuan Allen Zhu Sasa Misailovic Jonathan A. Kelner Martin Rinard MIT CSAIL zeyuan@csail.it.edu isailo@it.edu kelner@it.edu

More information

Physics 215 Winter The Density Matrix

Physics 215 Winter The Density Matrix Physics 215 Winter 2018 The Density Matrix The quantu space of states is a Hilbert space H. Any state vector ψ H is a pure state. Since any linear cobination of eleents of H are also an eleent of H, it

More information

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lesson 1 4 October 2017 Outline Learning and Evaluation for Pattern Recognition Notation...2 1. The Pattern Recognition

More information

Average Consensus and Gossip Algorithms in Networks with Stochastic Asymmetric Communications

Average Consensus and Gossip Algorithms in Networks with Stochastic Asymmetric Communications Average Consensus and Gossip Algoriths in Networks with Stochastic Asyetric Counications Duarte Antunes, Daniel Silvestre, Carlos Silvestre Abstract We consider that a set of distributed agents desire

More information

New Slack-Monotonic Schedulability Analysis of Real-Time Tasks on Multiprocessors

New Slack-Monotonic Schedulability Analysis of Real-Time Tasks on Multiprocessors New Slack-Monotonic Schedulability Analysis of Real-Tie Tasks on Multiprocessors Risat Mahud Pathan and Jan Jonsson Chalers University of Technology SE-41 96, Göteborg, Sweden {risat, janjo}@chalers.se

More information

TEST OF HOMOGENEITY OF PARALLEL SAMPLES FROM LOGNORMAL POPULATIONS WITH UNEQUAL VARIANCES

TEST OF HOMOGENEITY OF PARALLEL SAMPLES FROM LOGNORMAL POPULATIONS WITH UNEQUAL VARIANCES TEST OF HOMOGENEITY OF PARALLEL SAMPLES FROM LOGNORMAL POPULATIONS WITH UNEQUAL VARIANCES S. E. Ahed, R. J. Tokins and A. I. Volodin Departent of Matheatics and Statistics University of Regina Regina,

More information

Support Vector Machines. Goals for the lecture

Support Vector Machines. Goals for the lecture Support Vector Machines Mark Craven and David Page Coputer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Soe of the slides in these lectures have been adapted/borrowed fro aterials developed

More information

List Scheduling and LPT Oliver Braun (09/05/2017)

List Scheduling and LPT Oliver Braun (09/05/2017) List Scheduling and LPT Oliver Braun (09/05/207) We investigate the classical scheduling proble P ax where a set of n independent jobs has to be processed on 2 parallel and identical processors (achines)

More information

Solutions of some selected problems of Homework 4

Solutions of some selected problems of Homework 4 Solutions of soe selected probles of Hoework 4 Sangchul Lee May 7, 2018 Proble 1 Let there be light A professor has two light bulbs in his garage. When both are burned out, they are replaced, and the next

More information

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

A Smoothed Boosting Algorithm Using Probabilistic Output Codes A Soothed Boosting Algorith Using Probabilistic Output Codes Rong Jin rongjin@cse.su.edu Dept. of Coputer Science and Engineering, Michigan State University, MI 48824, USA Jian Zhang jian.zhang@cs.cu.edu

More information

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t. CS 493: Algoriths for Massive Data Sets Feb 2, 2002 Local Models, Bloo Filter Scribe: Qin Lv Local Models In global odels, every inverted file entry is copressed with the sae odel. This work wells when

More information

The Weierstrass Approximation Theorem

The Weierstrass Approximation Theorem 36 The Weierstrass Approxiation Theore Recall that the fundaental idea underlying the construction of the real nubers is approxiation by the sipler rational nubers. Firstly, nubers are often deterined

More information

A Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science

A Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science A Better Algorith For an Ancient Scheduling Proble David R. Karger Steven J. Phillips Eric Torng Departent of Coputer Science Stanford University Stanford, CA 9435-4 Abstract One of the oldest and siplest

More information

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians Using EM To Estiate A Probablity Density With A Mixture Of Gaussians Aaron A. D Souza adsouza@usc.edu Introduction The proble we are trying to address in this note is siple. Given a set of data points

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic

More information

Approximation in Stochastic Scheduling: The Power of LP-Based Priority Policies

Approximation in Stochastic Scheduling: The Power of LP-Based Priority Policies Approxiation in Stochastic Scheduling: The Power of -Based Priority Policies Rolf Möhring, Andreas Schulz, Marc Uetz Setting (A P p stoch, r E( w and (B P p stoch E( w We will assue that the processing

More information

Ocean 420 Physical Processes in the Ocean Project 1: Hydrostatic Balance, Advection and Diffusion Answers

Ocean 420 Physical Processes in the Ocean Project 1: Hydrostatic Balance, Advection and Diffusion Answers Ocean 40 Physical Processes in the Ocean Project 1: Hydrostatic Balance, Advection and Diffusion Answers 1. Hydrostatic Balance a) Set all of the levels on one of the coluns to the lowest possible density.

More information

arxiv: v2 [cs.lg] 30 Mar 2017

arxiv: v2 [cs.lg] 30 Mar 2017 Batch Renoralization: Towards Reducing Minibatch Dependence in Batch-Noralized Models Sergey Ioffe Google Inc., sioffe@google.co arxiv:1702.03275v2 [cs.lg] 30 Mar 2017 Abstract Batch Noralization is quite

More information

A new type of lower bound for the largest eigenvalue of a symmetric matrix

A new type of lower bound for the largest eigenvalue of a symmetric matrix Linear Algebra and its Applications 47 7 9 9 www.elsevier.co/locate/laa A new type of lower bound for the largest eigenvalue of a syetric atrix Piet Van Mieghe Delft University of Technology, P.O. Box

More information

On the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation

On the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation journal of coplexity 6, 459473 (2000) doi:0.006jco.2000.0544, available online at http:www.idealibrary.co on On the Counication Coplexity of Lipschitzian Optiization for the Coordinated Model of Coputation

More information

Lecture 9 November 23, 2015

Lecture 9 November 23, 2015 CSC244: Discrepancy Theory in Coputer Science Fall 25 Aleksandar Nikolov Lecture 9 Noveber 23, 25 Scribe: Nick Spooner Properties of γ 2 Recall that γ 2 (A) is defined for A R n as follows: γ 2 (A) = in{r(u)

More information

Lecture 21. Interior Point Methods Setup and Algorithm

Lecture 21. Interior Point Methods Setup and Algorithm Lecture 21 Interior Point Methods In 1984, Kararkar introduced a new weakly polynoial tie algorith for solving LPs [Kar84a], [Kar84b]. His algorith was theoretically faster than the ellipsoid ethod and

More information

On Constant Power Water-filling

On Constant Power Water-filling On Constant Power Water-filling Wei Yu and John M. Cioffi Electrical Engineering Departent Stanford University, Stanford, CA94305, U.S.A. eails: {weiyu,cioffi}@stanford.edu Abstract This paper derives

More information

A method to determine relative stroke detection efficiencies from multiplicity distributions

A method to determine relative stroke detection efficiencies from multiplicity distributions A ethod to deterine relative stroke detection eiciencies ro ultiplicity distributions Schulz W. and Cuins K. 2. Austrian Lightning Detection and Inoration Syste (ALDIS), Kahlenberger Str.2A, 90 Vienna,

More information

Proc. of the IEEE/OES Seventh Working Conference on Current Measurement Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES

Proc. of the IEEE/OES Seventh Working Conference on Current Measurement Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES Proc. of the IEEE/OES Seventh Working Conference on Current Measureent Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES Belinda Lipa Codar Ocean Sensors 15 La Sandra Way, Portola Valley, CA 98 blipa@pogo.co

More information

Deflation of the I-O Series Some Technical Aspects. Giorgio Rampa University of Genoa April 2007

Deflation of the I-O Series Some Technical Aspects. Giorgio Rampa University of Genoa April 2007 Deflation of the I-O Series 1959-2. Soe Technical Aspects Giorgio Rapa University of Genoa g.rapa@unige.it April 27 1. Introduction The nuber of sectors is 42 for the period 1965-2 and 38 for the initial

More information

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs BULLETIN OF THE POLISH ACADEMY OF SCIENCES TECHNICAL SCIENCES Vol. 57, No. 3, 2009 Algoriths for parallel processor scheduling with distinct due windows and unit-tie obs A. JANIAK 1, W.A. JANIAK 2, and

More information

DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS

DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS ISSN 1440-771X AUSTRALIA DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS An Iproved Method for Bandwidth Selection When Estiating ROC Curves Peter G Hall and Rob J Hyndan Working Paper 11/00 An iproved

More information

arxiv: v1 [cs.ds] 17 Mar 2016

arxiv: v1 [cs.ds] 17 Mar 2016 Tight Bounds for Single-Pass Streaing Coplexity of the Set Cover Proble Sepehr Assadi Sanjeev Khanna Yang Li Abstract arxiv:1603.05715v1 [cs.ds] 17 Mar 2016 We resolve the space coplexity of single-pass

More information

1 Identical Parallel Machines

1 Identical Parallel Machines FB3: Matheatik/Inforatik Dr. Syaantak Das Winter 2017/18 Optiizing under Uncertainty Lecture Notes 3: Scheduling to Miniize Makespan In any standard scheduling proble, we are given a set of jobs J = {j

More information

Lower Bounds for Quantized Matrix Completion

Lower Bounds for Quantized Matrix Completion Lower Bounds for Quantized Matrix Copletion Mary Wootters and Yaniv Plan Departent of Matheatics University of Michigan Ann Arbor, MI Eail: wootters, yplan}@uich.edu Mark A. Davenport School of Elec. &

More information

Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence

Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence Best Ar Identification: A Unified Approach to Fixed Budget and Fixed Confidence Victor Gabillon Mohaad Ghavazadeh Alessandro Lazaric INRIA Lille - Nord Europe, Tea SequeL {victor.gabillon,ohaad.ghavazadeh,alessandro.lazaric}@inria.fr

More information

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon Lecture 2: Enseble Methods Isabelle Guyon guyoni@inf.ethz.ch Introduction Book Chapter 7 Weighted Majority Mixture of Experts/Coittee Assue K experts f, f 2, f K (base learners) x f (x) Each expert akes

More information

Bipartite subgraphs and the smallest eigenvalue

Bipartite subgraphs and the smallest eigenvalue Bipartite subgraphs and the sallest eigenvalue Noga Alon Benny Sudaov Abstract Two results dealing with the relation between the sallest eigenvalue of a graph and its bipartite subgraphs are obtained.

More information

Error Exponents in Asynchronous Communication

Error Exponents in Asynchronous Communication IEEE International Syposiu on Inforation Theory Proceedings Error Exponents in Asynchronous Counication Da Wang EECS Dept., MIT Cabridge, MA, USA Eail: dawang@it.edu Venkat Chandar Lincoln Laboratory,

More information

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness A Note on Scheduling Tall/Sall Multiprocessor Tasks with Unit Processing Tie to Miniize Maxiu Tardiness Philippe Baptiste and Baruch Schieber IBM T.J. Watson Research Center P.O. Box 218, Yorktown Heights,

More information

1 Generalization bounds based on Rademacher complexity

1 Generalization bounds based on Rademacher complexity COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #0 Scribe: Suqi Liu March 07, 08 Last tie we started proving this very general result about how quickly the epirical average converges

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Coputable Shell Decoposition Bounds John Langford TTI-Chicago jcl@cs.cu.edu David McAllester TTI-Chicago dac@autoreason.co Editor: Leslie Pack Kaelbling and David Cohn Abstract Haussler, Kearns, Seung

More information

Low-complexity, Low-memory EMS algorithm for non-binary LDPC codes

Low-complexity, Low-memory EMS algorithm for non-binary LDPC codes Low-coplexity, Low-eory EMS algorith for non-binary LDPC codes Adrian Voicila,David Declercq, François Verdier ETIS ENSEA/CP/CNRS MR-85 954 Cergy-Pontoise, (France) Marc Fossorier Dept. Electrical Engineering

More information