Convergence rates for distributed stochastic optimization over random networks

Size: px
Start display at page:

Download "Convergence rates for distributed stochastic optimization over random networks"

Transcription

1 Convergence rates for distributed stochastic optimization over random networs Dusan Jaovetic, Dragana Bajovic, Anit Kumar Sahu and Soummya Kar Abstract We establish the O ) convergence rate for distributed stochastic gradient methods that operate over strongly convex costs and random networs. The considered class of methods is standard each node performs a weighted average of its own and its neighbors solution estimates consensus), and taes a negative step with respect to a noisy version of its local functions gradient innovation). The underlying communication networ is modeled through a sequence of temporally independent identically distributed i.i.d.) Laplacian matrices connected on average, while the local gradient noises are also i.i.d. in time, have finite second moment, and possibly unbounded support. We show that, after a careful setting of the consensus and innovations potentials weights), the distributed stochastic gradient method achieves a order-optimal) O ) convergence rate in the mean square distance from the solution. This is the first order-optimal convergence rate result on distributed strongly convex stochastic optimization when the networ is random and/or the gradient noises have unbounded support. Simulation examples confirm the theoretical findings.. ITRODUCTIO Distributed optimization and learning algorithms attract a great interest in recent years, thans to their widespread applications including distributed estimation in networed systems, e.g., ], distributed control, e.g., 2], and big data analytics, e.g., 3]. In this paper, we study distributed stochastic optimization algorithms that operate over random networs and minimize smooth strongly convex costs. We consider standard distributed stochastic gradient methods where at each time step, each node maes a weighted average of its own and its neighbors solution estimates, and performs a step in the negative direction of its noisy local gradient. The underlying networ is allowed to be randomly varying, similarly to, e.g., the models in 4] 6]. More specifically, the networ is modeled through a sequence of independent identically distributed i.i.d.) graph Laplacian matrices, where the networ is assumed to be connected on average. This translates into the requirement that the algebraic connectivity of the mean Laplacian matrix is strictly positive.) Random The wor of DJ and DB was supported in part by the EU Horizon 2020 project I-BiDaaS, project number The wor of D. Jaovetic was also supported in part by the Serbian Ministry of Education, Science, and Technological Development, grant The wor of AKS and SK was supported in part by ational Science Foundation under grant CCF D. Bajovic is with the Faculty of Technical Sciences, University of ovi Sad 2000 ovi Sad, Serbia dbajovic@uns.ac.rs D. Jaovetic is with the Department of Mathematics and Informatics, Faculty of Sciences, University of ovi Sad 2000 ovi Sad, Serbia djaovet@uns.ac.rs A. K. Sahu and S. Kar are with the Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 523 {anits,soummya}@andrew.cmu.edu networ models are highly relevant in, e.g., internet of things IoT) and cyber physical systems CPS) applications, lie, e.g., predictive maintenance and monitoring in industrial manufacturing systems, monitoring smart buildings, etc. Therein, networed nodes often communicate through unreliable/intermittent wireless lins, due to, e.g., low-power transmissions or harsh environments. The main contributions of the paper are as follows. We show that, by carefully designing the consensus and the gradient weights potentials), the considered distributed stochastic gradient algorithm achieves the order-optimal O/) rate of decay of the mean squared distance from the solution mean squared error MSE). This is achieved for twice continuously differentiable strongly convex local costs, assuming also that the noisy gradients are unbiased estimates of the true gradients and that the noise in gradients has bounded second moment. To the best of our nowledge, this is the first time an order-optimal convergence rate for distributed strongly convex stochastic optimization has been established for random networs. We now briefly review the literature to help us contrast this paper from prior wor. In the context of the extensive literature on distributed optimization, the most relevant to our wor are the references on: ) distributed strongly convex stochastic sub)gradient methods; and 2) distributed sub)gradient methods over random networs both deterministic and stochastic methods). For the former thread of wors, several papers give explicit convergence rates under different assumptions. Regarding the underlying networ, references 7], 8] consider static networs, while the wors 9] ] consider deterministic time-varying networs. References 7], 8] consider distributed strongly convex optimization for static networs, assuming that the data distributions that underlie each node s local cost function are equal reference 7] considers empirical riss while reference 8] considers ris functions in the form of expectation); this essentially corresponds to each nodes local function having the same minimizer. References 9] ] consider deterministically varying networs, assuming that the union graph over finite windows of iterations is connected. The papers 7] 0] assume undirected networs, while ] allows for directed networs but assumes a bounded support for the gradient noise. The wors 7], 9] ] allow the local costs to be non-smooth, while 8] assumes smooth costs, as we do here. With respect to these wors, we consider random networs, undirected networs, smooth costs, and allow the noise to have unbounded support. Distributed optimization over random networs has been

2 studied in 4] 6]. References 4], 5] consider nondifferentiable convex costs and no sub)gradient noise, while reference 6] considers differentiable costs with Lipschitz continuous and bounded gradients, and it also does not allow for gradient noise, i.e., it considers methods with exact deterministic) gradients. In 2], we consider a distributed Kiefer-Wolfowitz-type stochastic approximation method and establish the method s O/ /2 ) convergence rate. Reference 2] complements the current paper by assuming that nodes only have access to noisy functions estimates zeroth order optimization), and no gradient estimates are available. In contrast, by assuming a noisy first-order gradient) information is available, we show here that strictly faster rates than in 2] can be achieved. In summary, to the best of our nowledge, this is the first time to establish an order-optimal convergence rate for distributed stochastic gradient methods under strongly convex costs and random networs. Paper organization. The next paragraph introduces notation. Section 2 describes the model and the stochastic gradient method we consider. Section 3 states and proves the main result on the algorithm s MSE convergence rate. Section 4 provides a simulation example. Finally, we conclude in Section 5. otation. We denote by R the set of real numbers and by R m the m-dimensional Euclidean real coordinate space. We use normal lower-case letters for scalars, lower case boldface letters for vectors, and upper case boldface letters for matrices. Further, we denote by: A ij the entry in the i- th row and j-th column of a matrix A; A the transpose of a matrix A; the Kronecer product of matrices; I, 0, and, respectively, the identity matrix, the zero matrix, and the column vector with unit entries; J the matrix J := /). When necessary, we indicate the matrix or vector dimension through a subscript. ext, A 0 A 0) means that the symmetric matrix A is positive definite respectively, positive semi-definite). We further denote by: = 2 the Euclidean respectively, spectral) norm of its vector respectively, matrix) argument; λ i ) the i-th smallest eigenvalue; hw) and 2 hw) the gradient and Hessian, respectively, evaluated at w of a function h : R m R, m ; PA) and Eu] the probability of an event A and expectation of a random variable u, respectively. Finally, for two positive sequences η n and χ n, we have: η n = Oχ n ) if lim sup n η n χ n <. 2. MODEL AD ALGORITHM A. Optimization and networ models We consider the scenario where networed nodes aim to collaboratively solve the following unconstrained problem: minimize f i x), ) where f i : R m R is a convex function available to node i, i =,...,. We mae the following standard assumption on the f i s. Assumption. For all i =,...,, function f i : R m R is twice continuously differentiable, and there exist constants 0 < µ L <, such that, for all x R m, there holds: µ I 2 f i x) L I. Assumption implies that each f i is strongly convex with modulus µ, and it also has Lipschitz continuous gradient with Lipschitz constant L, i.e., the following two inequalities hold for any x, y R m : f i y) f i x) + f i x) y x) + µ x y 2 2 f i x) f i y) L x y. Furthermore, under Assumption, problem ) is solvable and has the unique solution, which we denote by x R m. For future reference, introduce also the sum function f : R m R, fx) = f ix). We consider distributed stochastic gradient methods to solve ) over random networs. Specifically, we adopt the following model. At each time instant = 0,,..., the underlying networ G) = V, E)) is undirected and random, with V = {,..., } the set of nodes, and E) the random set of undirected edges. We denote by {i, j} the edge that connects nodes i and j. Further, denote by 7 Ω i ) = {j V : {i, j} E)} the random neighborhood od node i at time excluding node i). We associate to G) its symmetric) Laplacian matrix L), defined by: L ij ) =, if {i, j} E), i j; L ij ) = 0, if {i, j} / E), i j; and L ii ) = j i L ij). Denote by L = E L) ] as explained ahead, the expectation is independent of ), and let G = V, E) be the graph induced by matrix L, i.e., E = {{i, j} : i j, L ij > 0}. We mae the following assumption. Assumption 2. The matrices {L)} are independent, identically distributed i.i.d.) Furthermore, graph G is connected. It is well-nown that the connectedness of G is equivalent to the condition λ 2 L ) > 0. B. Gradient noise model and the algorithm We consider the following distributed stochastic gradient method to solve ). Each node i, i =,...,, maintains over time steps iterations) = 0,,..., its solution estimate x i ) R m. Specifically, for arbitrary deterministic initial points x i 0) R m, i =,...,, the update rule at node i and = 0,,..., is as follows: x i + ) = x i ) β x i ) x j )) 2) j Ω i) α f i x i )) + v i ) ). The update 2) is performed in parallel by all nodes i =,...,. The algorithm iteration is realized as follows. First, each node i broadcasts x i ) to all its available neighbors j Ω i ), and receives x j ) from all j Ω i ). Subsequently, each node i, i =,..., maes update 2), which completes an iteration. In 2), α is the step-size that we set to α = α 0 / + ), = 0,,..., with α 0 > 0;

3 and β is the possibly) time-varying weight that each node assigns to all its neighbors. We set β = β 0 / + ) ν, = 0,,..., with ν 0, /2]. Here, β 0 > 0 is a constant that should be taen to be sufficiently small; e.g., one can set β 0 = / + θ), where θ is the maximal degree number of neighbors of a node) across networ. Finally, v i ) is noise in the calculation of the f i s gradient at iteration. For future reference, we also present algorithm 2) in matrix format. Denote by x) = x ),, x )] R m the vector that stacs the solution estimates of all nodes. Also, define function F : R m R, by F x) = f ix i ), with x = ] x,, x R m. Finally, let W = I L ) I m, where L = β L). Then, for = 0,,..., algorithm 2) can be compactly written as follows: x + ) = W x) α F x)) + v)). 3) We mae the following standard assumption on the gradient noises. First, denote by F the history of algorithm 2) up to time ; that is, F, =, 2,..., is an increasing sequence of sigma algebras, where F is the sigma algebra generated by the collection of random variables { Ls), v i t)}, i =,...,, s = 0,...,, t = 0,...,. Assumption 3. For each i =,...,, the sequence of noises {v i )} satisfies for all = 0,,...: E v i ) F ] = 0, almost surely a.s.) 4) E v i ) 2 F ] c v x i ) 2 + c v, a.s., 5) where c v and c v are nonnegative constants. Assumption 3 is satisfied, for example, when {v i )} is an i.i.d. zero-mean, finite second moment, noise sequence such that v i ) is also independent of history F. However, the assumption allows that the gradient noise v i ) be dependent on node i and also on the current point x i ); the next subsection gives some important machine learning settings encompassed by Assumption 3. C. A machine learning motivation The optimization-algorithmic model defined by Assumptions and 3 subsumes, e.g., important machine learning applications. Consider the scenario where f i corresponds to the ris function associated with the node i s local data, i.e., f i x) = E di P i l i x; d i ) ] + Ψ i x). 6) Here, P i is node i s local distribution according to which its data samples d i R q are generated; l i ; ) is a loss function that is convex in its first argument for any fixed value of its second argument; and Ψ : R m R is a strongly convex regularizer. Similarly, f i can be an empirical ris function: f i x) = n i l i x; d i,j ) + Ψ i x), 7) n i j= where d i,j, j =,..., n i, is the set of training examples at node i. Examples for the loss l i ; ) include the following: l i x; a i, b i ) = ) a 2 2 i x b i quadratic loss) 8) l i x; a i, b i ) = ln + exp b i a i x)) ) logistic loss) For the quadratic loss above, a data sample d i = a i, b i ), where a i is a regressor vector and b i is a response variable; for the logistic loss, a i is a feature vector and b i {, +} is its class label. Clearly, both the ris 6) and the empirical ris 7) satisfy Assumption for the losses in 8). We next discuss the search directions in 2) and Assumption 3 for the gradient noise. A common search direction in machine learning algorithms is the gradient of the loss with respect to a single data point : g i x) = l i x; d i ) + Ψ i x). In case of the ris function 8), d i is drawn from distribution P i ; in case of the empirical ris 7), d i can be, e.g., drawn uniformly at random from the set of data points d i,j, j =,..., n i, with repetition along iterations. In both cases, gradient noise v i = g i x) f i x) clearly satisfies assumption 4). To see this, consider, for example, the ris function 6), and let us fix iteration and node i s estimate x i ) = x i. Then, E v i ) F ] = E g i ) f i x i )) x i ) = x i ] = E l i x i ); d i ) x i ) = x i ] + Ψ i x i ) f i x i ) + Ψ i x)) = E di P i l i x; d i )] + Ψ i x i ) E di P i l i x; d i )] + Ψ i x i )) = 0. Further, for the empirical ris, assumption 5) holds trivially. For the ris function 6), assumption 5) holds for a sufficiently regular distribution P i. For instance, it is easy to show that the assumption holds for the logistic loss in 8) when P i has finite second moment, while it holds for the square loss in 8) when P i has finite fourth moment. ote that our setting allows that the data generated at different nodes be generated through different distributions P i, as well as that the nodes utilize different losses l i s and regularizers Ψ i s. Mathematically, this means that f i x ) 0, in general. In words, if a node i relies only on its local data d i, it cannot recover the true solution x. odes then engage in a collaborative algorithm 2) through which, as shown ahead, they can recover the global solution x. 3. PERFORMACE AALYSIS A. Statement of main results and auxiliary lemmas We are now ready to state our main result. Theorem 3.. Consider algorithm 2) with step-sizes α = α 0 + and β = β0 +), where β ν 0 > 0, α 0 > 2 /µ, and ν 0, /2]. Further, let Assumptions 3 hold. Then, for Similar considerations hold for a loss with respect to a mini-batch of data points; this discussion is abstracted for simplicity.

4 each node i s solution estimate x i ) and the solution x of problem ), there holds: E x i ) x 2] = O/). We remar that the condition α 0 > 2 /µ can be relaxed to require only a positive α 0, in which case the rate becomes Oln)/), instead of O/). 2 Also, to avoid large stepsizes at initial iterations for a large α 0, step-size α can be modified to α = α 0 / + 0 ), for arbitrary positive constant 0, and Theorem 3. continues to hold. Theorem 3. establishes the O/) MSE rate of convergence of algorithm 2); due to the assumed f i s strong convexity, the theorem also implies that E fx i )) fx )] = O/). ote that the expectation in Theorem 3. is both with respect to randomness in gradient noises and with respect to the randomness in the underlying networ. The O/) rate does not depend on the statistics of the underlying random networ, as long as the networ is connected on average i.e., satisfies Assumption 2.) The hidden constant depends on the underlying networ statistics, but simulation examples suggest that the dependence is usually not strong see Section 4). Proof strategy and auxiliary lemmas. Our strategy for proving Theorem 3. is as follows. We first establish the mean square boundedness uniform in ) of the iterates x i ), which also implies the uniform mean square boundedness of the gradients f i x i )) Subsection 3-B). We then bound, in the mean square sense, the disagreements of different nodes estimates, i.e., quantities x i ) x j )), showing that E x i ) x j ) 2 ] = O/) Subsection 3-C). This allows us to show that the hypothetical) global average of the nodes solution estimates x) := x i) evolves according to a stochastic gradient method with the gradient estimates that have a sufficiently small bias and finite second moment. This allows us to show the O/) rate on the mean square error at the global average, which in turn allows to derive a similar bound at the individual nodes estimates Subsection 3-D). In completing the strategy above, we mae use of the following Lemma; the Lemma is a minor modification of Lemmas 4 and 5 in ]. Lemma 3.2. Let z) be a nonnegative deterministic) sequence satisfying: z + ) r )) z ) + r 2 ), where {r )} and {r 2 )} are deterministic sequences with a + ) r a 2 ) and r 2 ) δ + ), δ2 with a, a 2, δ, δ 2 > 0. Then, a) if δ = δ 2 =, there holds: z) = O); b) if δ = /2 and δ 2 = 3/2, then z) = O/); and c) if δ =, δ 2 = 2, and a >, then z) = O/). 2 This subtlety comes from equation 32) ahead and the requirement that c 20 >. If c 20, it can be shown that in 33) the right hand side modifies to a Oln)/) quantity. Subsequent analysis in Subsections 3-b until 3-d restricts to the case when ν = /2, i.e., when consensus weights β equal β = 0. That is, for simplicity of presentation, +) /2 we prove Theorem 3. for case ν = /2. As it can be verified in subsequent analysis, the proof of Theorem 3. extends to a generic µ 0, /2) as well. As another step in simplifying notations, throughout Subsections 3-b and 3-c, we let m = to avoid extensive usage of Kronecer products; again, the proofs extend to a generic m >. B. Mean square boundedness of the iterates This Subsection shows the uniform mean square boundedness of the algorithm iterates and the gradients evaluated at the algorithm iterates. Lemma 3.3. Consider algorithm 2), and let Assumptions - 3 hold. Then, there exist nonnegative constants c x and c f such that, for all = 0,,..., there holds: E x) 2 ] c x and E F x)) 2 ] c f. Proof. Denote by x o = x and recall 3). Then, we have: x + ) x o = W x) x o ) 9) By mean value theorem, we have: α F x)) F x o )) α v) α F x o ). F x)) F x o ) 0) ] = 2 F x o + sx) x o )) d s x) x o ) s=0 = H x) x o ). ote that LI H µi. Using 0) in 9) we have: x + ) x o = W α H ) x) x o )) α v) α F x o ). Denote by ζ) = x) x o and by ξ) = W α H ) x) x o ) α F x o ). Then, there holds: E ζ + ) 2 F ] ξ) 2 2α ξ) E v) F ] + α 2 E v) 2 F ] ξ) 2 + α 2 c v x) 2 + c v), a.s., 2) where we used Assumption 3 and the fact that ξ) is measurable with respect to F. We next bound ξ) 2. ote that W α H µ α. Therefore, we have: ξ) µ α ) ζ) + α F x o ). 3) We now use the following inequality: a + b) 2 + θ) a ) b 2, 4) θ

5 for any a, b R and θ > 0. We set θ = c0 +, with c 0 > 0. Using the inequality 4) in 3), we have: ξ) 2 + c ) 0 α µ) 2 + ζ) ) α c F 2 x o ) 2. 0 ext, for c 0 < α 0 µ, the last inequality implies: ξ) 2 c ) ζ) 2 5) + + c 2 + F xo ) 2, for some constants c, c 2 > 0. Combining 5) and 2), we get: ) E ζ + ) 2 F ] c ζ) 2 + c 2 + +, 6) for some c, c 2 > 0. Taing expectation in 6) and applying Lemma 3.2, it follows that E ζ) 2 ] = E x) x o 2 ] is uniformly in ) bounded from above by a positive constant. It is easy to see that the latter implies that E x) 2 ] is also uniformly bounded. Using the Lipschitz continuity of F, we finally also have that E F x)) 2 ] is also uniformly bounded. The proof of Lemma 3.3 is now complete. C. Disagreement bounds Recall the hypothetically available) global average of nodes estimates x) = x i), and denote by x i ) = x i ) x) the quantity that measures how far apart is node i s solution estimate from the global average. Introduce also vector x) = x ),..., x ) ), and note that it can be represented as x) = I J) x), where we recall J =. We have the following Lemma. Lemma 3.4. Consider algorithm 2) under Assumptions 3. Then, there holds: E x) 2 ] = O/). As detailed in the next Subsection, Lemma 3.4 is important as it allows to sufficiently tightly bound the bias in the gradient estimates according to which the global average x) evolves. Proof. It is easy to show that the process { x)} follows the recursion: x + ) = W) x) α I J) F x)) + v)), }{{} w) 7) where W) = W) J = I L) J. ote that, E w) 2] c 7 <, which follows due to the mean square boundedness of x) and F x)). Then, we have: x + ) W) x) + α w). We now invoe Lemma 4.4 in 3] to note that, after an appropriately chosen, we have for, x + ) r)) x) + α w), 8) with r) being a F -adapted process that satisfies r) 0, ], a.s., and: E r) F ] c 8 β = a.s., 9) + ) 2 for some constants c 8, c 9 > 0. Using 4) in 8), we have: + for θ = x + ) 2 + θ ) r)) 2 x) 2 ) + θ α 2 w) 2, c0 +) 2. Then, we have: ] E x + ) 2 c 9 F + θ ) + ) 2 ) + + θ α 2 E w) 2 F ], a.s. c 9 ) 2 x) 2 ext, for c 0 < c 9 c 0 can be chosen freely), we have: E x + ) 2] c ) E x) 2] 20) + ) 2 + c 2 + ) 3 2 Utilizing Lemma 3.2, inequality 20) finally yields E x + ) 2] = O ). The proof of the Lemma is complete. D. Proof of Theorem 3. We are now ready to prove Theorem 3.. Proof. Consider global average x) = n= x i). From 7), we have: x + ) = x) α which implies: f i x i )) + v i ) }{{} v) x + ) = x) α f i x i )) f i x)) + f i x))] α v). Recall f ) = f i ). Then, we have: x + ) = x) α f x)) 2) α ] f i x i )) f i x)) α v),

6 which implies: where x + ) = x) 22) α e) = v) + f x)) + e)], f i x i )) f i x))). 23) } {{ } ɛ) ote that, f i x i )) f i x)) L x i ) x) = L x i ). Thus, we can conclude for ɛ) = f i x i )) f i x))) the following: E ɛ) 2] c 5 + ). 24) ote here that 22) is an inexact gradient method for minimizing f with step size α / and the random gradient error e) = v) + ɛ). The term v) is zero-mean, while the gradient estimate bias is induced by ɛ); as per 24), the bias is at most O/) in the mean square sense. With the above development in place, we rewrite 2) as follows: x + ) = x) α f x)) α ɛ) α v). This implies, recalling that x is the solution to ): 25) x + ) x = x) x 26) α f x)) f x ) α }{{} ɛ) α v). = 0 27) x + ) x 2 m) 2 2α m) v) + α 2 v) 2. The latter inequality implies: E x + ) x 2 F ] m) 2 2α m) E v) F ] + α 2 E v) 2 F ], a.s. Taing expectation, using the fact that E v) F ] = 0, Assumption 3, and Lemma 3.3, we obtain: E x + ) x 2] E m) 2] + c 7 + ) 2, 3) for some constant c 7 > 0. ext, using 4), we have for m) the following: m) 2 + θ ) I α H 2 x) x ) α 2 θ 2 ɛ) 2 + θ ) c 8 α ) 2 x) x ) α 2 θ 2 ɛ) 2, with c 8 = µ/, because µ I H L I. After choosing θ = c9 +) such that c 9 < α 0 c 8 /2 = α 0 µ/2) and after taing expectation, we obtain: E m) 2 ] c ) 20 E x) x 2 ] + c ) 2, 32) where c 20 > α 0 µ/2) > because α 0 > 2/µ) and c 2 is a positive constant. Combining 32) and 3), we get: E x + ) x 2] c ) 20 x) x c 2 By the mean value theorem, we have: + ) 2 + c 7 + ) 2. ] f x)) f x ) = 2 f x + s x) x )) d s Invoing Lemma 3.2, the latter inequality implies: s=0 }{{} H E x + ) x 2] c 22 + ), 33) x) x ), 28) for some constant c where it is to be noted that L H µ. Using 28) in 22 > 0. Therefore, for the global average x), we have obtained the mean square rate O ) 25), we have:. Finally, x + ) x ) = I α ] we note that, H x) x ) 29) α ɛ) α v). Denote by m) = x i ) x x) x + I α H ] x i ) x). }{{} x) x ) α ɛ). x i) Then, 29) is rewritten as: After using: x + ) x ) = m) α v), 30) and so: x i ) x 2 2 x i ) x) x 2, and taing expectation, it follows that E x i ) x 2] = O ), for all i =,...,. The proof is complete.

7 4. SIMULATIO EXAMPLE We provide a simulation example on l 2 -regularized logistic losses and random networs where lins fail independently over iterations and across different lins, with probability p fail. The simulation corroborates the derived O/) rate of algorithm 2) over random networs and shows that deterioration due to increase of p fail is small. We consider empirical ris minimization 7) with the logistic loss in 8) and the regularization functions set to Ψ i x) = κ 2 x 2, i =,...,, where κ > 0 is the regularization parameter that is set to κ = 0.5. The number of data points per node is n i = 0. We generate the true classification vector x = x ), x 0) by drawing its entries independently from standard normal distribution. Then, the class labels are generated as b ij = sign ) x ) a i,j + x 0 + ɛ ij, where ɛij s are drawn independently from normal distribution with zero mean and standard deviation 2. The feature vectors a i,j, j =,..., n i, at node i are generated as follows: each entry of each vector is a sum of a standard normal random variable and a uniform random variable with support 0, 5 i]. Different entries within a feature vector are drawn independently, and also different vectors are drawn independently, both intra node and inter nodes. ote that the feature vectors at different nodes are drawn from different distributions. The algorithm parameters are set as follows. We let β =, α θ +) /2 = +, = 0,,... Here, θ is the maximal degree across all nodes in the networ and here equals θ = 6. Algorithm 2) is initialized with x i 0) = 0, for all i =,...,. We consider a connected networ G with = 0 nodes and 23 lins, generated as a random geometric graph: nodes are placed randomly uniformly) on a unit square, and the node pairs whose distance is less than a radius are connected by an edge. We consider the random networ model where each undirected) lin in networ G fails independently across iterations and independently from other lins with probability p fail. We consider the cases p fail {0; 0.5; 0.9}. ote that the case p fail = 0 corresponds to networ G with all its lins always online, more precisely, with lins failing with zero probability. Algorithm 2) is then run on each of the described networ models, i.e., for each p fail {0; 0.5; 0.9}. This allows us to assess how much the algorithm performance degrades with the increase of p fail. We also include a comparison with the following centralized stochastic gradient method: y + ) = y) + ) l y); a i ), b i ) ), 34) where a i ), b i )) is drawn uniformly from the set a i,j, b i,j ), j =,..., n i. ote that algorithm 34) maes an unbiased estimate of f iy)) by drawing a sample uniformly at random from each node s data set. Algorithm 34) is an idealization of 2): it shows how 2) would be implemented if there existed a fusion node that had access to all nodes data. Hence, the comparison with 34) allows us to examine how much the performance of 2) degrades due to lac of global information, i.e., due to the distributed nature of the considered problem. ote that stepsize in 34) is set to /+) for a meaningful comparison with 2), as this is the step-size effectively utilized by the hypothetical global average of the nodes iterates with 2). As an error metric, we use the mean square error MSE) estimate averaged across nodes: x i) x 2. Figure plots the estimated MSE, averaged across 00 algorithm runs, versus iteration number for different values of parameter p fail in log 0 -log 0 scale. ote that here the slope of the plot curve corresponds to the sublinear rate of the method; e.g., the slope corresponds to a / rate. First, note from the Figure that, for any value of p fail, algorithm 2) achieves on this example at least) the / rate, thus corroborating our theory. ext, note that the increase of the lin failure probability only increases the constant in the MSE but does not affect the rate. The curves that correspond to different values of p fail are vertically shifted. ) Interestingly, the loss due to the increase of p fail is small; e.g., the curves that correspond to p fail = 0.5 and p fail = 0 no lin failures) practically match. Figure also shows the performance of the centralized method 34). We can see that, except for the initial few iterations, the distributed method 2) is very close in performance to the centralized method. Fig. : Estimated MSE versus iteration number for algorithm 2) with lin failure probability p fail = 0 red, solid line); 0.5 blue, dashed line); and 0.9 green, dash-dot line). The Figure also shows the performance of the centralized stochastic gradient method in 34) blac, dotted line). 5. COCLUSIO We considered a distributed stochastic gradient method for smooth strongly convex optimization. Through the analysis of the considered method, we established for the first time the order optimal O/) MSE convergence rate for the assumed optimization setting when the underlying networ

8 is randomly varying. Simulation example on l 2 -regularized logistic losses corroborates the established O/) rate and suggests that the effect of the underlying networ s statistics on the O/) rate s hidden constant is small. REFERECES ] S. Kar and J. M. F. Moura, Convergence rate analysis of distributed gossip linear parameter) estimation: Fundamental limits and tradeoffs, IEEE Journal of Selected Topics in Signal Processing, Signal Processing in Gossiping Algorithms Design and Applications, vol. 5, no. 4, pp , Aug ] F. Bullo, J. Cortes, and S. Martinez, Distributed control of robotic networs: A mathematical approach to motion coordination algorithms. Princeton University Press, ] A. Daneshmand, F. Facchinei, V. Kungurtsev, and G. Scutari, Hybrid random/deterministic parallel algorithms for nonconvex big data optimization, submitted to IEEE Trans. on Signal Processing, ] I. Lobel and A. E. Ozdaglar, Distributed subgradient methods for convex optimization over random networs, IEEE Trans. Automat. Contr., vol. 56, no. 6, pp , Jan ] I. Lobel, A. Ozdaglar, and D. Feijer, Distributed multi-agent optimization with state-dependent communication, Mathematical Programming, vol. 29, no. 2, pp , 20. 6] D. Jaovetic, J. Xavier, and J. M. F. Moura, Convergence rates of distributed esterov-lie gradient methods on random networs, IEEE Transactions on Signal Processing, vol. 62, no. 4, pp , February ] K. Tsianos and M. Rabbat, Distributed strongly convex optimization, 50th Annual Allerton Conference oncommunication, Control, and Computing, Oct ] Z. J. Towfic, J. Chen, and A. H. Sayed, Excess-ris of distributed stochastic learners, IEEE Transactions on Information Theory, vol. 62, no. 0, Oct ] D. Yuan, Y. Hong, D. W. C. Ho, and G. Jiang, Optimal distributed stochastic mirror descent for strongly convex optimization, Automatica, vol. 90, pp , April ]. D. Vanli, M. O. Sayin, and S. S. Kozat, Stochastic subgradient algorithms for strongly convex optimization over distributed networs, IEEE Transactions on networ science and engineering, vol. 4, no. 4, pp , Oct.-Dec ] A. edic and A. Olshevsy, Stochastic gradient-push for strongly convex functions on time-varying directed graphs, IEEE Transactions on Automatic Control, vol. 6, no. 2, pp , Dec ] A. K. Sahu, D. Jaovetic, D. Bajovic, and S. Kar, Distributed zeroth order optimization over random networs: A Kiefer-Wolfowitz stochastic approximation approach, 208, available at dropbox.com/s/fc2hgbfcx5yhr8/maincdc208kwsa.pdf. 3] S. Kar, J. M. F. Moura, and H. V. Poor, Distributed linear parameter estimation: Asymptotically efficient adaptive strategies, SIAM J. Control and Optimization, vol. 5, no. 3, pp , 203.

Newton-like method with diagonal correction for distributed optimization

Newton-like method with diagonal correction for distributed optimization Newton-lie method with diagonal correction for distributed optimization Dragana Bajović Dušan Jaovetić Nataša Krejić Nataša Krlec Jerinić February 7, 2017 Abstract We consider distributed optimization

More information

Newton-like method with diagonal correction for distributed optimization

Newton-like method with diagonal correction for distributed optimization Newton-lie method with diagonal correction for distributed optimization Dragana Bajović Dušan Jaovetić Nataša Krejić Nataša Krlec Jerinić August 15, 2015 Abstract We consider distributed optimization problems

More information

Distributed Constrained Recursive Nonlinear Least-Squares Estimation: Algorithms and Asymptotics

Distributed Constrained Recursive Nonlinear Least-Squares Estimation: Algorithms and Asymptotics 1 Distributed Constrained Recursive Nonlinear Least-Squares Estimation: Algorithms and Asymptotics Anit Kumar Sahu, Student Member, IEEE, Soummya Kar, Member, IEEE, José M. F. Moura, Fellow, IEEE and H.

More information

ADMM and Fast Gradient Methods for Distributed Optimization

ADMM and Fast Gradient Methods for Distributed Optimization ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work

More information

Distributed Optimization over Networks Gossip-Based Algorithms

Distributed Optimization over Networks Gossip-Based Algorithms Distributed Optimization over Networks Gossip-Based Algorithms Angelia Nedić angelia@illinois.edu ISE Department and Coordinated Science Laboratory University of Illinois at Urbana-Champaign Outline Random

More information

Distributed Constrained Recursive Nonlinear. Least-Squares Estimation: Algorithms and Asymptotics

Distributed Constrained Recursive Nonlinear. Least-Squares Estimation: Algorithms and Asymptotics Distributed Constrained Recursive Nonlinear 1 Least-Squares Estimation: Algorithms and Asymptotics arxiv:1602.00382v3 [math.oc] 19 Oct 2016 Anit Kumar Sahu, Student Member, IEEE, Soummya Kar, Member, IEEE,

More information

WE consider an undirected, connected network of n

WE consider an undirected, connected network of n On Nonconvex Decentralized Gradient Descent Jinshan Zeng and Wotao Yin Abstract Consensus optimization has received considerable attention in recent years. A number of decentralized algorithms have been

More information

Distributed Stochastic Optimization in Networks with Low Informational Exchange

Distributed Stochastic Optimization in Networks with Low Informational Exchange Distributed Stochastic Optimization in Networs with Low Informational Exchange Wenjie Li and Mohamad Assaad, Senior Member, IEEE arxiv:80790v [csit] 30 Jul 08 Abstract We consider a distributed stochastic

More information

Distributed Optimization over Random Networks

Distributed Optimization over Random Networks Distributed Optimization over Random Networks Ilan Lobel and Asu Ozdaglar Allerton Conference September 2008 Operations Research Center and Electrical Engineering & Computer Science Massachusetts Institute

More information

Distributed intelligence in multi agent systems

Distributed intelligence in multi agent systems Distributed intelligence in multi agent systems Usman Khan Department of Electrical and Computer Engineering Tufts University Workshop on Distributed Optimization, Information Processing, and Learning

More information

Communication Optimality Trade-offs For Distributed Estimation

Communication Optimality Trade-offs For Distributed Estimation Journal of Machine Learning Research 0 0000 0-00 Submitted 0/00; Published 0/00 Communication Optimality Trade-offs For Distributed Estimation Anit Kumar Sahu Department of Electrical and Computer Engineering

More information

A Distributed Newton Method for Network Utility Maximization, I: Algorithm

A Distributed Newton Method for Network Utility Maximization, I: Algorithm A Distributed Newton Method for Networ Utility Maximization, I: Algorithm Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie October 31, 2012 Abstract Most existing wors use dual decomposition and first-order

More information

arxiv: v3 [math.oc] 8 Jan 2019

arxiv: v3 [math.oc] 8 Jan 2019 Why Random Reshuffling Beats Stochastic Gradient Descent Mert Gürbüzbalaban, Asuman Ozdaglar, Pablo Parrilo arxiv:1510.08560v3 [math.oc] 8 Jan 2019 January 9, 2019 Abstract We analyze the convergence rate

More information

Asynchronous Non-Convex Optimization For Separable Problem

Asynchronous Non-Convex Optimization For Separable Problem Asynchronous Non-Convex Optimization For Separable Problem Sandeep Kumar and Ketan Rajawat Dept. of Electrical Engineering, IIT Kanpur Uttar Pradesh, India Distributed Optimization A general multi-agent

More information

DECENTRALIZED algorithms are used to solve optimization

DECENTRALIZED algorithms are used to solve optimization 5158 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 64, NO. 19, OCTOBER 1, 016 DQM: Decentralized Quadratically Approximated Alternating Direction Method of Multipliers Aryan Mohtari, Wei Shi, Qing Ling,

More information

DISTRIBUTED GENERALIZED LIKELIHOOD RATIO TESTS : FUNDAMENTAL LIMITS AND TRADEOFFS. Anit Kumar Sahu and Soummya Kar

DISTRIBUTED GENERALIZED LIKELIHOOD RATIO TESTS : FUNDAMENTAL LIMITS AND TRADEOFFS. Anit Kumar Sahu and Soummya Kar DISTRIBUTED GENERALIZED LIKELIHOOD RATIO TESTS : FUNDAMENTAL LIMITS AND TRADEOFFS Anit Kumar Sahu and Soummya Kar Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh

More information

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.

More information

Distributed Estimation and Detection for Smart Grid

Distributed Estimation and Detection for Smart Grid Distributed Estimation and Detection for Smart Grid Texas A&M University Joint Wor with: S. Kar (CMU), R. Tandon (Princeton), H. V. Poor (Princeton), and J. M. F. Moura (CMU) 1 Distributed Estimation/Detection

More information

Fast Angular Synchronization for Phase Retrieval via Incomplete Information

Fast Angular Synchronization for Phase Retrieval via Incomplete Information Fast Angular Synchronization for Phase Retrieval via Incomplete Information Aditya Viswanathan a and Mark Iwen b a Department of Mathematics, Michigan State University; b Department of Mathematics & Department

More information

Analysis of incremental RLS adaptive networks with noisy links

Analysis of incremental RLS adaptive networks with noisy links Analysis of incremental RLS adaptive networs with noisy lins Azam Khalili, Mohammad Ali Tinati, and Amir Rastegarnia a) Faculty of Electrical and Computer Engineering, University of Tabriz Tabriz 51664,

More information

Decentralized Quadratically Approximated Alternating Direction Method of Multipliers

Decentralized Quadratically Approximated Alternating Direction Method of Multipliers Decentralized Quadratically Approimated Alternating Direction Method of Multipliers Aryan Mokhtari Wei Shi Qing Ling Alejandro Ribeiro Department of Electrical and Systems Engineering, University of Pennsylvania

More information

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China) Network Newton Aryan Mokhtari, Qing Ling and Alejandro Ribeiro University of Pennsylvania, University of Science and Technology (China) aryanm@seas.upenn.edu, qingling@mail.ustc.edu.cn, aribeiro@seas.upenn.edu

More information

arxiv: v2 [math.oc] 5 May 2018

arxiv: v2 [math.oc] 5 May 2018 The Impact of Local Geometry and Batch Size on Stochastic Gradient Descent for Nonconvex Problems Viva Patel a a Department of Statistics, University of Chicago, Illinois, USA arxiv:1709.04718v2 [math.oc]

More information

WE consider an undirected, connected network of n

WE consider an undirected, connected network of n On Nonconvex Decentralized Gradient Descent Jinshan Zeng and Wotao Yin Abstract Consensus optimization has received considerable attention in recent years. A number of decentralized algorithms have been

More information

Recursive Distributed Detection for Composite Hypothesis Testing: Nonlinear Observation Models in Additive Gaussian Noise

Recursive Distributed Detection for Composite Hypothesis Testing: Nonlinear Observation Models in Additive Gaussian Noise Recursive Distributed Detection for Composite Hypothesis Testing: Nonlinear Observation Models in Additive Gaussian Noise Anit Kumar Sahu, Student Member, IEEE and Soummya Kar, Member, IEEE Abstract This

More information

A Distributed Newton Method for Network Utility Maximization

A Distributed Newton Method for Network Utility Maximization A Distributed Newton Method for Networ Utility Maximization Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie Abstract Most existing wor uses dual decomposition and subgradient methods to solve Networ Utility

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Bregman Divergence and Mirror Descent

Bregman Divergence and Mirror Descent Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,

More information

Distributed online optimization over jointly connected digraphs

Distributed online optimization over jointly connected digraphs Distributed online optimization over jointly connected digraphs David Mateos-Núñez Jorge Cortés University of California, San Diego {dmateosn,cortes}@ucsd.edu Mathematical Theory of Networks and Systems

More information

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications Weijun Zhou 28 October 20 Abstract A hybrid HS and PRP type conjugate gradient method for smooth

More information

Fast Linear Iterations for Distributed Averaging 1

Fast Linear Iterations for Distributed Averaging 1 Fast Linear Iterations for Distributed Averaging 1 Lin Xiao Stephen Boyd Information Systems Laboratory, Stanford University Stanford, CA 943-91 lxiao@stanford.edu, boyd@stanford.edu Abstract We consider

More information

ABSTRACT 1. INTRODUCTION

ABSTRACT 1. INTRODUCTION A DIAGONAL-AUGMENTED QUASI-NEWTON METHOD WITH APPLICATION TO FACTORIZATION MACHINES Aryan Mohtari and Amir Ingber Department of Electrical and Systems Engineering, University of Pennsylvania, PA, USA Big-data

More information

WE consider the problem of estimating a time varying

WE consider the problem of estimating a time varying 450 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL 61, NO 2, JANUARY 15, 2013 D-MAP: Distributed Maximum a Posteriori Probability Estimation of Dynamic Systems Felicia Y Jakubiec Alejro Ribeiro Abstract This

More information

Multi-Robotic Systems

Multi-Robotic Systems CHAPTER 9 Multi-Robotic Systems The topic of multi-robotic systems is quite popular now. It is believed that such systems can have the following benefits: Improved performance ( winning by numbers ) Distributed

More information

ADD-OPT: Accelerated Distributed Directed Optimization

ADD-OPT: Accelerated Distributed Directed Optimization 1 ADD-OPT: Accelerated Distributed Directed Optimization Chenguang Xi, Student Member, IEEE, and Usman A. Khan, Senior Member, IEEE Abstract This paper considers distributed optimization problems where

More information

On the linear convergence of distributed optimization over directed graphs

On the linear convergence of distributed optimization over directed graphs 1 On the linear convergence of distributed optimization over directed graphs Chenguang Xi, and Usman A. Khan arxiv:1510.0149v4 [math.oc] 7 May 016 Abstract This paper develops a fast distributed algorithm,

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Diffusion LMS Algorithms for Sensor Networks over Non-ideal Inter-sensor Wireless Channels

Diffusion LMS Algorithms for Sensor Networks over Non-ideal Inter-sensor Wireless Channels Diffusion LMS Algorithms for Sensor Networs over Non-ideal Inter-sensor Wireless Channels Reza Abdolee and Benoit Champagne Electrical and Computer Engineering McGill University 3480 University Street

More information

Multivariate Statistics Random Projections and Johnson-Lindenstrauss Lemma

Multivariate Statistics Random Projections and Johnson-Lindenstrauss Lemma Multivariate Statistics Random Projections and Johnson-Lindenstrauss Lemma Suppose again we have n sample points x,..., x n R p. The data-point x i R p can be thought of as the i-th row X i of an n p-dimensional

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

EUSIPCO

EUSIPCO EUSIPCO 3 569736677 FULLY ISTRIBUTE SIGNAL ETECTION: APPLICATION TO COGNITIVE RAIO Franc Iutzeler Philippe Ciblat Telecom ParisTech, 46 rue Barrault 753 Paris, France email: firstnamelastname@telecom-paristechfr

More information

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Mengdi Wang Ethan X. Fang Han Liu Abstract Classical stochastic gradient methods are well suited

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Decentralized Consensus Optimization with Asynchrony and Delay

Decentralized Consensus Optimization with Asynchrony and Delay Decentralized Consensus Optimization with Asynchrony and Delay Tianyu Wu, Kun Yuan 2, Qing Ling 3, Wotao Yin, and Ali H. Sayed 2 Department of Mathematics, 2 Department of Electrical Engineering, University

More information

arxiv: v1 [cs.na] 6 Jan 2017

arxiv: v1 [cs.na] 6 Jan 2017 SPECTRAL STATISTICS OF LATTICE GRAPH STRUCTURED, O-UIFORM PERCOLATIOS Stephen Kruzick and José M. F. Moura 2 Carnegie Mellon University, Department of Electrical Engineering 5000 Forbes Avenue, Pittsburgh,

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Bootstrap Percolation on Periodic Trees

Bootstrap Percolation on Periodic Trees Bootstrap Percolation on Periodic Trees Milan Bradonjić Iraj Saniee Abstract We study bootstrap percolation with the threshold parameter θ 2 and the initial probability p on infinite periodic trees that

More information

Consensus and Products of Random Stochastic Matrices: Exact Rate for Convergence in Probability

Consensus and Products of Random Stochastic Matrices: Exact Rate for Convergence in Probability Consensus and Products of Random Stochastic Matrices: Exact Rate for Convergence in Probability 1 arxiv:1202.6389v1 [math.pr] 28 Feb 2012 Dragana Bajović, João Xavier, José M. F. Moura and Bruno Sinopoli

More information

Line search methods with variable sample size for unconstrained optimization

Line search methods with variable sample size for unconstrained optimization Line search methods with variable sample size for unconstrained optimization Nataša Krejić Nataša Krklec June 27, 2011 Abstract Minimization of unconstrained objective function in the form of mathematical

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

DLM: Decentralized Linearized Alternating Direction Method of Multipliers

DLM: Decentralized Linearized Alternating Direction Method of Multipliers 1 DLM: Decentralized Linearized Alternating Direction Method of Multipliers Qing Ling, Wei Shi, Gang Wu, and Alejandro Ribeiro Abstract This paper develops the Decentralized Linearized Alternating Direction

More information

Stochastic Quasi-Newton Methods

Stochastic Quasi-Newton Methods Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent

More information

Distributed and Recursive Parameter Estimation in Parametrized Linear State-Space Models

Distributed and Recursive Parameter Estimation in Parametrized Linear State-Space Models 1 Distributed and Recursive Parameter Estimation in Parametrized Linear State-Space Models S. Sundhar Ram, V. V. Veeravalli, and A. Nedić Abstract We consider a network of sensors deployed to sense a spatio-temporal

More information

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 Online Convex Optimization Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 The General Setting The General Setting (Cover) Given only the above, learning isn't always possible Some Natural

More information

Proximal and First-Order Methods for Convex Optimization

Proximal and First-Order Methods for Convex Optimization Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,

More information

Consensus Problems on Small World Graphs: A Structural Study

Consensus Problems on Small World Graphs: A Structural Study Consensus Problems on Small World Graphs: A Structural Study Pedram Hovareshti and John S. Baras 1 Department of Electrical and Computer Engineering and the Institute for Systems Research, University of

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Graph and Controller Design for Disturbance Attenuation in Consensus Networks

Graph and Controller Design for Disturbance Attenuation in Consensus Networks 203 3th International Conference on Control, Automation and Systems (ICCAS 203) Oct. 20-23, 203 in Kimdaejung Convention Center, Gwangju, Korea Graph and Controller Design for Disturbance Attenuation in

More information

A Distributed Newton Method for Network Utility Maximization, II: Convergence

A Distributed Newton Method for Network Utility Maximization, II: Convergence A Distributed Newton Method for Network Utility Maximization, II: Convergence Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie October 31, 2012 Abstract The existing distributed algorithms for Network Utility

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation

A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation Vu Malbasa and Slobodan Vucetic Abstract Resource-constrained data mining introduces many constraints when learning from

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

Alternative Characterization of Ergodicity for Doubly Stochastic Chains

Alternative Characterization of Ergodicity for Doubly Stochastic Chains Alternative Characterization of Ergodicity for Doubly Stochastic Chains Behrouz Touri and Angelia Nedić Abstract In this paper we discuss the ergodicity of stochastic and doubly stochastic chains. We define

More information

arxiv: v1 [cs.it] 21 Feb 2013

arxiv: v1 [cs.it] 21 Feb 2013 q-ary Compressive Sensing arxiv:30.568v [cs.it] Feb 03 Youssef Mroueh,, Lorenzo Rosasco, CBCL, CSAIL, Massachusetts Institute of Technology LCSL, Istituto Italiano di Tecnologia and IIT@MIT lab, Istituto

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

Minimum Repair Bandwidth for Exact Regeneration in Distributed Storage

Minimum Repair Bandwidth for Exact Regeneration in Distributed Storage 1 Minimum Repair andwidth for Exact Regeneration in Distributed Storage Vivec R Cadambe, Syed A Jafar, Hamed Malei Electrical Engineering and Computer Science University of California Irvine, Irvine, California,

More information

Stochastic Proximal Gradient Algorithm

Stochastic Proximal Gradient Algorithm Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind

More information

THERE is an increasing interest in multi-robot systems.

THERE is an increasing interest in multi-robot systems. 1 Distributed Consensus on Robot Networs for Dynamically Merging Feature-Based Maps Rosario Aragues, Student Member, IEEE, Jorge Cortes, Senior Member, IEEE, Carlos Sagues, Senior Member, IEEE Abstract

More information

Q-Learning and Stochastic Approximation

Q-Learning and Stochastic Approximation MS&E338 Reinforcement Learning Lecture 4-04.11.018 Q-Learning and Stochastic Approximation Lecturer: Ben Van Roy Scribe: Christopher Lazarus Javier Sagastuy In this lecture we study the convergence of

More information

Convex Optimization of Graph Laplacian Eigenvalues

Convex Optimization of Graph Laplacian Eigenvalues Convex Optimization of Graph Laplacian Eigenvalues Stephen Boyd Abstract. We consider the problem of choosing the edge weights of an undirected graph so as to maximize or minimize some function of the

More information

10-701/15-781, Machine Learning: Homework 4

10-701/15-781, Machine Learning: Homework 4 10-701/15-781, Machine Learning: Homewor 4 Aarti Singh Carnegie Mellon University ˆ The assignment is due at 10:30 am beginning of class on Mon, Nov 15, 2010. ˆ Separate you answers into five parts, one

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems Thore Graepel and Nicol N. Schraudolph Institute of Computational Science ETH Zürich, Switzerland {graepel,schraudo}@inf.ethz.ch

More information

Porcupine Neural Networks: (Almost) All Local Optima are Global

Porcupine Neural Networks: (Almost) All Local Optima are Global Porcupine Neural Networs: (Almost) All Local Optima are Global Soheil Feizi, Hamid Javadi, Jesse Zhang and David Tse arxiv:1710.0196v1 [stat.ml] 5 Oct 017 Stanford University Abstract Neural networs have

More information

Convergence of Simultaneous Perturbation Stochastic Approximation for Nondifferentiable Optimization

Convergence of Simultaneous Perturbation Stochastic Approximation for Nondifferentiable Optimization 1 Convergence of Simultaneous Perturbation Stochastic Approximation for Nondifferentiable Optimization Ying He yhe@engr.colostate.edu Electrical and Computer Engineering Colorado State University Fort

More information

On the Linear Convergence of Distributed Optimization over Directed Graphs

On the Linear Convergence of Distributed Optimization over Directed Graphs 1 On the Linear Convergence of Distributed Optimization over Directed Graphs Chenguang Xi, and Usman A. Khan arxiv:1510.0149v1 [math.oc] 7 Oct 015 Abstract This paper develops a fast distributed algorithm,

More information

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

More information

Convergence Rate for Consensus with Delays

Convergence Rate for Consensus with Delays Convergence Rate for Consensus with Delays Angelia Nedić and Asuman Ozdaglar October 8, 2007 Abstract We study the problem of reaching a consensus in the values of a distributed system of agents with time-varying

More information

UTILIZING PRIOR KNOWLEDGE IN ROBUST OPTIMAL EXPERIMENT DESIGN. EE & CS, The University of Newcastle, Australia EE, Technion, Israel.

UTILIZING PRIOR KNOWLEDGE IN ROBUST OPTIMAL EXPERIMENT DESIGN. EE & CS, The University of Newcastle, Australia EE, Technion, Israel. UTILIZING PRIOR KNOWLEDGE IN ROBUST OPTIMAL EXPERIMENT DESIGN Graham C. Goodwin James S. Welsh Arie Feuer Milan Depich EE & CS, The University of Newcastle, Australia 38. EE, Technion, Israel. Abstract:

More information

sparse and low-rank tensor recovery Cubic-Sketching

sparse and low-rank tensor recovery Cubic-Sketching Sparse and Low-Ran Tensor Recovery via Cubic-Setching Guang Cheng Department of Statistics Purdue University www.science.purdue.edu/bigdata CCAM@Purdue Math Oct. 27, 2017 Joint wor with Botao Hao and Anru

More information

Large Deviations Rates for Distributed Inference. Doctor of Philosophy. Electrical and Computer Engineering. Dragana Bajović

Large Deviations Rates for Distributed Inference. Doctor of Philosophy. Electrical and Computer Engineering. Dragana Bajović Large Deviations Rates for Distributed Inference Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical and Computer Engineering Dragana Bajović Dipl.

More information

Distributed online optimization over jointly connected digraphs

Distributed online optimization over jointly connected digraphs Distributed online optimization over jointly connected digraphs David Mateos-Núñez Jorge Cortés University of California, San Diego {dmateosn,cortes}@ucsd.edu Southern California Optimization Day UC San

More information

Unconstrained optimization

Unconstrained optimization Chapter 4 Unconstrained optimization An unconstrained optimization problem takes the form min x Rnf(x) (4.1) for a target functional (also called objective function) f : R n R. In this chapter and throughout

More information

1216 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 6, JUNE /$ IEEE

1216 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 6, JUNE /$ IEEE 1216 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL 54, NO 6, JUNE 2009 Feedback Weighting Mechanisms for Improving Jacobian Estimates in the Adaptive Simultaneous Perturbation Algorithm James C Spall, Fellow,

More information

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing Afonso S. Bandeira April 9, 2015 1 The Johnson-Lindenstrauss Lemma Suppose one has n points, X = {x 1,..., x n }, in R d with d very

More information

Multilayer Perceptron

Multilayer Perceptron Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Single Perceptron 3 Boolean Function Learning 4

More information

Observer design for a general class of triangular systems

Observer design for a general class of triangular systems 1st International Symposium on Mathematical Theory of Networks and Systems July 7-11, 014. Observer design for a general class of triangular systems Dimitris Boskos 1 John Tsinias Abstract The paper deals

More information

Optimal Newton-type methods for nonconvex smooth optimization problems

Optimal Newton-type methods for nonconvex smooth optimization problems Optimal Newton-type methods for nonconvex smooth optimization problems Coralia Cartis, Nicholas I. M. Gould and Philippe L. Toint June 9, 20 Abstract We consider a general class of second-order iterations

More information

THE EFFECT OF DETERMINISTIC NOISE 1 IN SUBGRADIENT METHODS

THE EFFECT OF DETERMINISTIC NOISE 1 IN SUBGRADIENT METHODS Submitted: 24 September 2007 Revised: 5 June 2008 THE EFFECT OF DETERMINISTIC NOISE 1 IN SUBGRADIENT METHODS by Angelia Nedić 2 and Dimitri P. Bertseas 3 Abstract In this paper, we study the influence

More information

Cooperative Communication with Feedback via Stochastic Approximation

Cooperative Communication with Feedback via Stochastic Approximation Cooperative Communication with Feedback via Stochastic Approximation Utsaw Kumar J Nicholas Laneman and Vijay Gupta Department of Electrical Engineering University of Notre Dame Email: {ukumar jnl vgupta}@ndedu

More information

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 11 Luca Trevisan February 29, 2016

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 11 Luca Trevisan February 29, 2016 U.C. Berkeley CS294: Spectral Methods and Expanders Handout Luca Trevisan February 29, 206 Lecture : ARV In which we introduce semi-definite programming and a semi-definite programming relaxation of sparsest

More information

Randomized Gossip Algorithms

Randomized Gossip Algorithms 2508 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 52, NO 6, JUNE 2006 Randomized Gossip Algorithms Stephen Boyd, Fellow, IEEE, Arpita Ghosh, Student Member, IEEE, Balaji Prabhakar, Member, IEEE, and Devavrat

More information

Resilient Distributed Optimization Algorithm against Adversary Attacks

Resilient Distributed Optimization Algorithm against Adversary Attacks 207 3th IEEE International Conference on Control & Automation (ICCA) July 3-6, 207. Ohrid, Macedonia Resilient Distributed Optimization Algorithm against Adversary Attacks Chengcheng Zhao, Jianping He

More information

ARock: an algorithmic framework for asynchronous parallel coordinate updates

ARock: an algorithmic framework for asynchronous parallel coordinate updates ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,

More information

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, NO. 23, DECEMBER 1, Aryan Mokhtari and Alejandro Ribeiro, Member, IEEE

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, NO. 23, DECEMBER 1, Aryan Mokhtari and Alejandro Ribeiro, Member, IEEE IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, NO. 23, DECEMBER 1, 2014 6089 RES: Regularized Stochastic BFGS Algorithm Aryan Mokhtari and Alejandro Ribeiro, Member, IEEE Abstract RES, a regularized

More information

EXPOSITORY NOTES ON DISTRIBUTION THEORY, FALL 2018

EXPOSITORY NOTES ON DISTRIBUTION THEORY, FALL 2018 EXPOSITORY NOTES ON DISTRIBUTION THEORY, FALL 2018 While these notes are under construction, I expect there will be many typos. The main reference for this is volume 1 of Hörmander, The analysis of liner

More information

Advanced computational methods X Selected Topics: SGD

Advanced computational methods X Selected Topics: SGD Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety

More information

arxiv: v1 [stat.ml] 12 Nov 2015

arxiv: v1 [stat.ml] 12 Nov 2015 Random Multi-Constraint Projection: Stochastic Gradient Methods for Convex Optimization with Many Constraints Mengdi Wang, Yichen Chen, Jialin Liu, Yuantao Gu arxiv:5.03760v [stat.ml] Nov 05 November 3,

More information

Consensus-Based Distributed Optimization with Malicious Nodes

Consensus-Based Distributed Optimization with Malicious Nodes Consensus-Based Distributed Optimization with Malicious Nodes Shreyas Sundaram Bahman Gharesifard Abstract We investigate the vulnerabilities of consensusbased distributed optimization protocols to nodes

More information

Discrete-time Consensus Filters on Directed Switching Graphs

Discrete-time Consensus Filters on Directed Switching Graphs 214 11th IEEE International Conference on Control & Automation (ICCA) June 18-2, 214. Taichung, Taiwan Discrete-time Consensus Filters on Directed Switching Graphs Shuai Li and Yi Guo Abstract We consider

More information