Convergence rates for distributed stochastic optimization over random networks

Similar documents
Newton-like method with diagonal correction for distributed optimization

Newton-like method with diagonal correction for distributed optimization

Distributed Constrained Recursive Nonlinear Least-Squares Estimation: Algorithms and Asymptotics

ADMM and Fast Gradient Methods for Distributed Optimization

Distributed Optimization over Networks Gossip-Based Algorithms

Distributed Constrained Recursive Nonlinear. Least-Squares Estimation: Algorithms and Asymptotics

WE consider an undirected, connected network of n

Distributed Stochastic Optimization in Networks with Low Informational Exchange

Distributed Optimization over Random Networks

Distributed intelligence in multi agent systems

Communication Optimality Trade-offs For Distributed Estimation

A Distributed Newton Method for Network Utility Maximization, I: Algorithm

arxiv: v3 [math.oc] 8 Jan 2019

Asynchronous Non-Convex Optimization For Separable Problem

DECENTRALIZED algorithms are used to solve optimization

DISTRIBUTED GENERALIZED LIKELIHOOD RATIO TESTS : FUNDAMENTAL LIMITS AND TRADEOFFS. Anit Kumar Sahu and Soummya Kar

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Distributed Estimation and Detection for Smart Grid

Fast Angular Synchronization for Phase Retrieval via Incomplete Information

Analysis of incremental RLS adaptive networks with noisy links

Decentralized Quadratically Approximated Alternating Direction Method of Multipliers

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)

arxiv: v2 [math.oc] 5 May 2018

WE consider an undirected, connected network of n

Recursive Distributed Detection for Composite Hypothesis Testing: Nonlinear Observation Models in Additive Gaussian Noise

A Distributed Newton Method for Network Utility Maximization

Comparison of Modern Stochastic Optimization Algorithms

Bregman Divergence and Mirror Descent

Distributed online optimization over jointly connected digraphs

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications

Fast Linear Iterations for Distributed Averaging 1

ABSTRACT 1. INTRODUCTION

WE consider the problem of estimating a time varying

Multi-Robotic Systems

ADD-OPT: Accelerated Distributed Directed Optimization

On the linear convergence of distributed optimization over directed graphs

Convex Optimization Lecture 16

Diffusion LMS Algorithms for Sensor Networks over Non-ideal Inter-sensor Wireless Channels

Multivariate Statistics Random Projections and Johnson-Lindenstrauss Lemma

Stochastic Optimization Algorithms Beyond SG

EUSIPCO

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

Decentralized Consensus Optimization with Asynchrony and Delay

arxiv: v1 [cs.na] 6 Jan 2017

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Bootstrap Percolation on Periodic Trees

Consensus and Products of Random Stochastic Matrices: Exact Rate for Convergence in Probability

Line search methods with variable sample size for unconstrained optimization

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

DLM: Decentralized Linearized Alternating Direction Method of Multipliers

Stochastic Quasi-Newton Methods

Distributed and Recursive Parameter Estimation in Parametrized Linear State-Space Models

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

Proximal and First-Order Methods for Convex Optimization

Consensus Problems on Small World Graphs: A Structural Study

Neural Network Training

Graph and Controller Design for Disturbance Attenuation in Consensus Networks

A Distributed Newton Method for Network Utility Maximization, II: Convergence

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation

arxiv: v4 [math.oc] 5 Jan 2016

Alternative Characterization of Ergodicity for Doubly Stochastic Chains

arxiv: v1 [cs.it] 21 Feb 2013

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Minimum Repair Bandwidth for Exact Regeneration in Distributed Storage

Stochastic Proximal Gradient Algorithm

THERE is an increasing interest in multi-robot systems.

Q-Learning and Stochastic Approximation

Convex Optimization of Graph Laplacian Eigenvalues

10-701/15-781, Machine Learning: Homework 4

Linear Models in Machine Learning

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

Porcupine Neural Networks: (Almost) All Local Optima are Global

Convergence of Simultaneous Perturbation Stochastic Approximation for Nondifferentiable Optimization

On the Linear Convergence of Distributed Optimization over Directed Graphs

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

Convergence Rate for Consensus with Delays

UTILIZING PRIOR KNOWLEDGE IN ROBUST OPTIMAL EXPERIMENT DESIGN. EE & CS, The University of Newcastle, Australia EE, Technion, Israel.

sparse and low-rank tensor recovery Cubic-Sketching

Large Deviations Rates for Distributed Inference. Doctor of Philosophy. Electrical and Computer Engineering. Dragana Bajović

Distributed online optimization over jointly connected digraphs

Unconstrained optimization

1216 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 6, JUNE /$ IEEE

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing

Multilayer Perceptron

Observer design for a general class of triangular systems

Optimal Newton-type methods for nonconvex smooth optimization problems

THE EFFECT OF DETERMINISTIC NOISE 1 IN SUBGRADIENT METHODS

Cooperative Communication with Feedback via Stochastic Approximation

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 11 Luca Trevisan February 29, 2016

Randomized Gossip Algorithms

Resilient Distributed Optimization Algorithm against Adversary Attacks

ARock: an algorithmic framework for asynchronous parallel coordinate updates

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, NO. 23, DECEMBER 1, Aryan Mokhtari and Alejandro Ribeiro, Member, IEEE

EXPOSITORY NOTES ON DISTRIBUTION THEORY, FALL 2018

Advanced computational methods X Selected Topics: SGD

arxiv: v1 [stat.ml] 12 Nov 2015

Consensus-Based Distributed Optimization with Malicious Nodes

Discrete-time Consensus Filters on Directed Switching Graphs

Transcription:

Convergence rates for distributed stochastic optimization over random networs Dusan Jaovetic, Dragana Bajovic, Anit Kumar Sahu and Soummya Kar Abstract We establish the O ) convergence rate for distributed stochastic gradient methods that operate over strongly convex costs and random networs. The considered class of methods is standard each node performs a weighted average of its own and its neighbors solution estimates consensus), and taes a negative step with respect to a noisy version of its local functions gradient innovation). The underlying communication networ is modeled through a sequence of temporally independent identically distributed i.i.d.) Laplacian matrices connected on average, while the local gradient noises are also i.i.d. in time, have finite second moment, and possibly unbounded support. We show that, after a careful setting of the consensus and innovations potentials weights), the distributed stochastic gradient method achieves a order-optimal) O ) convergence rate in the mean square distance from the solution. This is the first order-optimal convergence rate result on distributed strongly convex stochastic optimization when the networ is random and/or the gradient noises have unbounded support. Simulation examples confirm the theoretical findings.. ITRODUCTIO Distributed optimization and learning algorithms attract a great interest in recent years, thans to their widespread applications including distributed estimation in networed systems, e.g., ], distributed control, e.g., 2], and big data analytics, e.g., 3]. In this paper, we study distributed stochastic optimization algorithms that operate over random networs and minimize smooth strongly convex costs. We consider standard distributed stochastic gradient methods where at each time step, each node maes a weighted average of its own and its neighbors solution estimates, and performs a step in the negative direction of its noisy local gradient. The underlying networ is allowed to be randomly varying, similarly to, e.g., the models in 4] 6]. More specifically, the networ is modeled through a sequence of independent identically distributed i.i.d.) graph Laplacian matrices, where the networ is assumed to be connected on average. This translates into the requirement that the algebraic connectivity of the mean Laplacian matrix is strictly positive.) Random The wor of DJ and DB was supported in part by the EU Horizon 2020 project I-BiDaaS, project number 780787. The wor of D. Jaovetic was also supported in part by the Serbian Ministry of Education, Science, and Technological Development, grant 74030. The wor of AKS and SK was supported in part by ational Science Foundation under grant CCF- 53936. D. Bajovic is with the Faculty of Technical Sciences, University of ovi Sad 2000 ovi Sad, Serbia dbajovic@uns.ac.rs D. Jaovetic is with the Department of Mathematics and Informatics, Faculty of Sciences, University of ovi Sad 2000 ovi Sad, Serbia djaovet@uns.ac.rs A. K. Sahu and S. Kar are with the Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 523 {anits,soummya}@andrew.cmu.edu networ models are highly relevant in, e.g., internet of things IoT) and cyber physical systems CPS) applications, lie, e.g., predictive maintenance and monitoring in industrial manufacturing systems, monitoring smart buildings, etc. Therein, networed nodes often communicate through unreliable/intermittent wireless lins, due to, e.g., low-power transmissions or harsh environments. The main contributions of the paper are as follows. We show that, by carefully designing the consensus and the gradient weights potentials), the considered distributed stochastic gradient algorithm achieves the order-optimal O/) rate of decay of the mean squared distance from the solution mean squared error MSE). This is achieved for twice continuously differentiable strongly convex local costs, assuming also that the noisy gradients are unbiased estimates of the true gradients and that the noise in gradients has bounded second moment. To the best of our nowledge, this is the first time an order-optimal convergence rate for distributed strongly convex stochastic optimization has been established for random networs. We now briefly review the literature to help us contrast this paper from prior wor. In the context of the extensive literature on distributed optimization, the most relevant to our wor are the references on: ) distributed strongly convex stochastic sub)gradient methods; and 2) distributed sub)gradient methods over random networs both deterministic and stochastic methods). For the former thread of wors, several papers give explicit convergence rates under different assumptions. Regarding the underlying networ, references 7], 8] consider static networs, while the wors 9] ] consider deterministic time-varying networs. References 7], 8] consider distributed strongly convex optimization for static networs, assuming that the data distributions that underlie each node s local cost function are equal reference 7] considers empirical riss while reference 8] considers ris functions in the form of expectation); this essentially corresponds to each nodes local function having the same minimizer. References 9] ] consider deterministically varying networs, assuming that the union graph over finite windows of iterations is connected. The papers 7] 0] assume undirected networs, while ] allows for directed networs but assumes a bounded support for the gradient noise. The wors 7], 9] ] allow the local costs to be non-smooth, while 8] assumes smooth costs, as we do here. With respect to these wors, we consider random networs, undirected networs, smooth costs, and allow the noise to have unbounded support. Distributed optimization over random networs has been

studied in 4] 6]. References 4], 5] consider nondifferentiable convex costs and no sub)gradient noise, while reference 6] considers differentiable costs with Lipschitz continuous and bounded gradients, and it also does not allow for gradient noise, i.e., it considers methods with exact deterministic) gradients. In 2], we consider a distributed Kiefer-Wolfowitz-type stochastic approximation method and establish the method s O/ /2 ) convergence rate. Reference 2] complements the current paper by assuming that nodes only have access to noisy functions estimates zeroth order optimization), and no gradient estimates are available. In contrast, by assuming a noisy first-order gradient) information is available, we show here that strictly faster rates than in 2] can be achieved. In summary, to the best of our nowledge, this is the first time to establish an order-optimal convergence rate for distributed stochastic gradient methods under strongly convex costs and random networs. Paper organization. The next paragraph introduces notation. Section 2 describes the model and the stochastic gradient method we consider. Section 3 states and proves the main result on the algorithm s MSE convergence rate. Section 4 provides a simulation example. Finally, we conclude in Section 5. otation. We denote by R the set of real numbers and by R m the m-dimensional Euclidean real coordinate space. We use normal lower-case letters for scalars, lower case boldface letters for vectors, and upper case boldface letters for matrices. Further, we denote by: A ij the entry in the i- th row and j-th column of a matrix A; A the transpose of a matrix A; the Kronecer product of matrices; I, 0, and, respectively, the identity matrix, the zero matrix, and the column vector with unit entries; J the matrix J := /). When necessary, we indicate the matrix or vector dimension through a subscript. ext, A 0 A 0) means that the symmetric matrix A is positive definite respectively, positive semi-definite). We further denote by: = 2 the Euclidean respectively, spectral) norm of its vector respectively, matrix) argument; λ i ) the i-th smallest eigenvalue; hw) and 2 hw) the gradient and Hessian, respectively, evaluated at w of a function h : R m R, m ; PA) and Eu] the probability of an event A and expectation of a random variable u, respectively. Finally, for two positive sequences η n and χ n, we have: η n = Oχ n ) if lim sup n η n χ n <. 2. MODEL AD ALGORITHM A. Optimization and networ models We consider the scenario where networed nodes aim to collaboratively solve the following unconstrained problem: minimize f i x), ) where f i : R m R is a convex function available to node i, i =,...,. We mae the following standard assumption on the f i s. Assumption. For all i =,...,, function f i : R m R is twice continuously differentiable, and there exist constants 0 < µ L <, such that, for all x R m, there holds: µ I 2 f i x) L I. Assumption implies that each f i is strongly convex with modulus µ, and it also has Lipschitz continuous gradient with Lipschitz constant L, i.e., the following two inequalities hold for any x, y R m : f i y) f i x) + f i x) y x) + µ x y 2 2 f i x) f i y) L x y. Furthermore, under Assumption, problem ) is solvable and has the unique solution, which we denote by x R m. For future reference, introduce also the sum function f : R m R, fx) = f ix). We consider distributed stochastic gradient methods to solve ) over random networs. Specifically, we adopt the following model. At each time instant = 0,,..., the underlying networ G) = V, E)) is undirected and random, with V = {,..., } the set of nodes, and E) the random set of undirected edges. We denote by {i, j} the edge that connects nodes i and j. Further, denote by 7 Ω i ) = {j V : {i, j} E)} the random neighborhood od node i at time excluding node i). We associate to G) its symmetric) Laplacian matrix L), defined by: L ij ) =, if {i, j} E), i j; L ij ) = 0, if {i, j} / E), i j; and L ii ) = j i L ij). Denote by L = E L) ] as explained ahead, the expectation is independent of ), and let G = V, E) be the graph induced by matrix L, i.e., E = {{i, j} : i j, L ij > 0}. We mae the following assumption. Assumption 2. The matrices {L)} are independent, identically distributed i.i.d.) Furthermore, graph G is connected. It is well-nown that the connectedness of G is equivalent to the condition λ 2 L ) > 0. B. Gradient noise model and the algorithm We consider the following distributed stochastic gradient method to solve ). Each node i, i =,...,, maintains over time steps iterations) = 0,,..., its solution estimate x i ) R m. Specifically, for arbitrary deterministic initial points x i 0) R m, i =,...,, the update rule at node i and = 0,,..., is as follows: x i + ) = x i ) β x i ) x j )) 2) j Ω i) α f i x i )) + v i ) ). The update 2) is performed in parallel by all nodes i =,...,. The algorithm iteration is realized as follows. First, each node i broadcasts x i ) to all its available neighbors j Ω i ), and receives x j ) from all j Ω i ). Subsequently, each node i, i =,..., maes update 2), which completes an iteration. In 2), α is the step-size that we set to α = α 0 / + ), = 0,,..., with α 0 > 0;

and β is the possibly) time-varying weight that each node assigns to all its neighbors. We set β = β 0 / + ) ν, = 0,,..., with ν 0, /2]. Here, β 0 > 0 is a constant that should be taen to be sufficiently small; e.g., one can set β 0 = / + θ), where θ is the maximal degree number of neighbors of a node) across networ. Finally, v i ) is noise in the calculation of the f i s gradient at iteration. For future reference, we also present algorithm 2) in matrix format. Denote by x) = x ),, x )] R m the vector that stacs the solution estimates of all nodes. Also, define function F : R m R, by F x) = f ix i ), with x = ] x,, x R m. Finally, let W = I L ) I m, where L = β L). Then, for = 0,,..., algorithm 2) can be compactly written as follows: x + ) = W x) α F x)) + v)). 3) We mae the following standard assumption on the gradient noises. First, denote by F the history of algorithm 2) up to time ; that is, F, =, 2,..., is an increasing sequence of sigma algebras, where F is the sigma algebra generated by the collection of random variables { Ls), v i t)}, i =,...,, s = 0,...,, t = 0,...,. Assumption 3. For each i =,...,, the sequence of noises {v i )} satisfies for all = 0,,...: E v i ) F ] = 0, almost surely a.s.) 4) E v i ) 2 F ] c v x i ) 2 + c v, a.s., 5) where c v and c v are nonnegative constants. Assumption 3 is satisfied, for example, when {v i )} is an i.i.d. zero-mean, finite second moment, noise sequence such that v i ) is also independent of history F. However, the assumption allows that the gradient noise v i ) be dependent on node i and also on the current point x i ); the next subsection gives some important machine learning settings encompassed by Assumption 3. C. A machine learning motivation The optimization-algorithmic model defined by Assumptions and 3 subsumes, e.g., important machine learning applications. Consider the scenario where f i corresponds to the ris function associated with the node i s local data, i.e., f i x) = E di P i l i x; d i ) ] + Ψ i x). 6) Here, P i is node i s local distribution according to which its data samples d i R q are generated; l i ; ) is a loss function that is convex in its first argument for any fixed value of its second argument; and Ψ : R m R is a strongly convex regularizer. Similarly, f i can be an empirical ris function: f i x) = n i l i x; d i,j ) + Ψ i x), 7) n i j= where d i,j, j =,..., n i, is the set of training examples at node i. Examples for the loss l i ; ) include the following: l i x; a i, b i ) = ) a 2 2 i x b i quadratic loss) 8) l i x; a i, b i ) = ln + exp b i a i x)) ) logistic loss) For the quadratic loss above, a data sample d i = a i, b i ), where a i is a regressor vector and b i is a response variable; for the logistic loss, a i is a feature vector and b i {, +} is its class label. Clearly, both the ris 6) and the empirical ris 7) satisfy Assumption for the losses in 8). We next discuss the search directions in 2) and Assumption 3 for the gradient noise. A common search direction in machine learning algorithms is the gradient of the loss with respect to a single data point : g i x) = l i x; d i ) + Ψ i x). In case of the ris function 8), d i is drawn from distribution P i ; in case of the empirical ris 7), d i can be, e.g., drawn uniformly at random from the set of data points d i,j, j =,..., n i, with repetition along iterations. In both cases, gradient noise v i = g i x) f i x) clearly satisfies assumption 4). To see this, consider, for example, the ris function 6), and let us fix iteration and node i s estimate x i ) = x i. Then, E v i ) F ] = E g i ) f i x i )) x i ) = x i ] = E l i x i ); d i ) x i ) = x i ] + Ψ i x i ) f i x i ) + Ψ i x)) = E di P i l i x; d i )] + Ψ i x i ) E di P i l i x; d i )] + Ψ i x i )) = 0. Further, for the empirical ris, assumption 5) holds trivially. For the ris function 6), assumption 5) holds for a sufficiently regular distribution P i. For instance, it is easy to show that the assumption holds for the logistic loss in 8) when P i has finite second moment, while it holds for the square loss in 8) when P i has finite fourth moment. ote that our setting allows that the data generated at different nodes be generated through different distributions P i, as well as that the nodes utilize different losses l i s and regularizers Ψ i s. Mathematically, this means that f i x ) 0, in general. In words, if a node i relies only on its local data d i, it cannot recover the true solution x. odes then engage in a collaborative algorithm 2) through which, as shown ahead, they can recover the global solution x. 3. PERFORMACE AALYSIS A. Statement of main results and auxiliary lemmas We are now ready to state our main result. Theorem 3.. Consider algorithm 2) with step-sizes α = α 0 + and β = β0 +), where β ν 0 > 0, α 0 > 2 /µ, and ν 0, /2]. Further, let Assumptions 3 hold. Then, for Similar considerations hold for a loss with respect to a mini-batch of data points; this discussion is abstracted for simplicity.

each node i s solution estimate x i ) and the solution x of problem ), there holds: E x i ) x 2] = O/). We remar that the condition α 0 > 2 /µ can be relaxed to require only a positive α 0, in which case the rate becomes Oln)/), instead of O/). 2 Also, to avoid large stepsizes at initial iterations for a large α 0, step-size α can be modified to α = α 0 / + 0 ), for arbitrary positive constant 0, and Theorem 3. continues to hold. Theorem 3. establishes the O/) MSE rate of convergence of algorithm 2); due to the assumed f i s strong convexity, the theorem also implies that E fx i )) fx )] = O/). ote that the expectation in Theorem 3. is both with respect to randomness in gradient noises and with respect to the randomness in the underlying networ. The O/) rate does not depend on the statistics of the underlying random networ, as long as the networ is connected on average i.e., satisfies Assumption 2.) The hidden constant depends on the underlying networ statistics, but simulation examples suggest that the dependence is usually not strong see Section 4). Proof strategy and auxiliary lemmas. Our strategy for proving Theorem 3. is as follows. We first establish the mean square boundedness uniform in ) of the iterates x i ), which also implies the uniform mean square boundedness of the gradients f i x i )) Subsection 3-B). We then bound, in the mean square sense, the disagreements of different nodes estimates, i.e., quantities x i ) x j )), showing that E x i ) x j ) 2 ] = O/) Subsection 3-C). This allows us to show that the hypothetical) global average of the nodes solution estimates x) := x i) evolves according to a stochastic gradient method with the gradient estimates that have a sufficiently small bias and finite second moment. This allows us to show the O/) rate on the mean square error at the global average, which in turn allows to derive a similar bound at the individual nodes estimates Subsection 3-D). In completing the strategy above, we mae use of the following Lemma; the Lemma is a minor modification of Lemmas 4 and 5 in ]. Lemma 3.2. Let z) be a nonnegative deterministic) sequence satisfying: z + ) r )) z ) + r 2 ), where {r )} and {r 2 )} are deterministic sequences with a + ) r a 2 ) and r 2 ) δ + ), δ2 with a, a 2, δ, δ 2 > 0. Then, a) if δ = δ 2 =, there holds: z) = O); b) if δ = /2 and δ 2 = 3/2, then z) = O/); and c) if δ =, δ 2 = 2, and a >, then z) = O/). 2 This subtlety comes from equation 32) ahead and the requirement that c 20 >. If c 20, it can be shown that in 33) the right hand side modifies to a Oln)/) quantity. Subsequent analysis in Subsections 3-b until 3-d restricts to the case when ν = /2, i.e., when consensus weights β equal β = 0. That is, for simplicity of presentation, +) /2 we prove Theorem 3. for case ν = /2. As it can be verified in subsequent analysis, the proof of Theorem 3. extends to a generic µ 0, /2) as well. As another step in simplifying notations, throughout Subsections 3-b and 3-c, we let m = to avoid extensive usage of Kronecer products; again, the proofs extend to a generic m >. B. Mean square boundedness of the iterates This Subsection shows the uniform mean square boundedness of the algorithm iterates and the gradients evaluated at the algorithm iterates. Lemma 3.3. Consider algorithm 2), and let Assumptions - 3 hold. Then, there exist nonnegative constants c x and c f such that, for all = 0,,..., there holds: E x) 2 ] c x and E F x)) 2 ] c f. Proof. Denote by x o = x and recall 3). Then, we have: x + ) x o = W x) x o ) 9) By mean value theorem, we have: α F x)) F x o )) α v) α F x o ). F x)) F x o ) 0) ] = 2 F x o + sx) x o )) d s x) x o ) s=0 = H x) x o ). ote that LI H µi. Using 0) in 9) we have: x + ) x o = W α H ) x) x o )) α v) α F x o ). Denote by ζ) = x) x o and by ξ) = W α H ) x) x o ) α F x o ). Then, there holds: E ζ + ) 2 F ] ξ) 2 2α ξ) E v) F ] + α 2 E v) 2 F ] ξ) 2 + α 2 c v x) 2 + c v), a.s., 2) where we used Assumption 3 and the fact that ξ) is measurable with respect to F. We next bound ξ) 2. ote that W α H µ α. Therefore, we have: ξ) µ α ) ζ) + α F x o ). 3) We now use the following inequality: a + b) 2 + θ) a 2 + + ) b 2, 4) θ

for any a, b R and θ > 0. We set θ = c0 +, with c 0 > 0. Using the inequality 4) in 3), we have: ξ) 2 + c ) 0 α µ) 2 + ζ) 2 + + + ) α c F 2 x o ) 2. 0 ext, for c 0 < α 0 µ, the last inequality implies: ξ) 2 c ) ζ) 2 5) + + c 2 + F xo ) 2, for some constants c, c 2 > 0. Combining 5) and 2), we get: ) E ζ + ) 2 F ] c ζ) 2 + c 2 + +, 6) for some c, c 2 > 0. Taing expectation in 6) and applying Lemma 3.2, it follows that E ζ) 2 ] = E x) x o 2 ] is uniformly in ) bounded from above by a positive constant. It is easy to see that the latter implies that E x) 2 ] is also uniformly bounded. Using the Lipschitz continuity of F, we finally also have that E F x)) 2 ] is also uniformly bounded. The proof of Lemma 3.3 is now complete. C. Disagreement bounds Recall the hypothetically available) global average of nodes estimates x) = x i), and denote by x i ) = x i ) x) the quantity that measures how far apart is node i s solution estimate from the global average. Introduce also vector x) = x ),..., x ) ), and note that it can be represented as x) = I J) x), where we recall J =. We have the following Lemma. Lemma 3.4. Consider algorithm 2) under Assumptions 3. Then, there holds: E x) 2 ] = O/). As detailed in the next Subsection, Lemma 3.4 is important as it allows to sufficiently tightly bound the bias in the gradient estimates according to which the global average x) evolves. Proof. It is easy to show that the process { x)} follows the recursion: x + ) = W) x) α I J) F x)) + v)), }{{} w) 7) where W) = W) J = I L) J. ote that, E w) 2] c 7 <, which follows due to the mean square boundedness of x) and F x)). Then, we have: x + ) W) x) + α w). We now invoe Lemma 4.4 in 3] to note that, after an appropriately chosen, we have for, x + ) r)) x) + α w), 8) with r) being a F -adapted process that satisfies r) 0, ], a.s., and: E r) F ] c 8 β = a.s., 9) + ) 2 for some constants c 8, c 9 > 0. Using 4) in 8), we have: + for θ = x + ) 2 + θ ) r)) 2 x) 2 ) + θ α 2 w) 2, c0 +) 2. Then, we have: ] E x + ) 2 c 9 F + θ ) + ) 2 ) + + θ α 2 E w) 2 F ], a.s. c 9 ) 2 x) 2 ext, for c 0 < c 9 c 0 can be chosen freely), we have: E x + ) 2] c ) E x) 2] 20) + ) 2 + c 2 + ) 3 2 Utilizing Lemma 3.2, inequality 20) finally yields E x + ) 2] = O ). The proof of the Lemma is complete. D. Proof of Theorem 3. We are now ready to prove Theorem 3.. Proof. Consider global average x) = n= x i). From 7), we have: x + ) = x) α which implies: f i x i )) + v i ) }{{} v) x + ) = x) α f i x i )) f i x)) + f i x))] α v). Recall f ) = f i ). Then, we have: x + ) = x) α f x)) 2) α ] f i x i )) f i x)) α v),

which implies: where x + ) = x) 22) α e) = v) + f x)) + e)], f i x i )) f i x))). 23) } {{ } ɛ) ote that, f i x i )) f i x)) L x i ) x) = L x i ). Thus, we can conclude for ɛ) = f i x i )) f i x))) the following: E ɛ) 2] c 5 + ). 24) ote here that 22) is an inexact gradient method for minimizing f with step size α / and the random gradient error e) = v) + ɛ). The term v) is zero-mean, while the gradient estimate bias is induced by ɛ); as per 24), the bias is at most O/) in the mean square sense. With the above development in place, we rewrite 2) as follows: x + ) = x) α f x)) α ɛ) α v). This implies, recalling that x is the solution to ): 25) x + ) x = x) x 26) α f x)) f x ) α }{{} ɛ) α v). = 0 27) x + ) x 2 m) 2 2α m) v) + α 2 v) 2. The latter inequality implies: E x + ) x 2 F ] m) 2 2α m) E v) F ] + α 2 E v) 2 F ], a.s. Taing expectation, using the fact that E v) F ] = 0, Assumption 3, and Lemma 3.3, we obtain: E x + ) x 2] E m) 2] + c 7 + ) 2, 3) for some constant c 7 > 0. ext, using 4), we have for m) the following: m) 2 + θ ) I α H 2 x) x 2 + + ) α 2 θ 2 ɛ) 2 + θ ) c 8 α ) 2 x) x 2 + + ) α 2 θ 2 ɛ) 2, with c 8 = µ/, because µ I H L I. After choosing θ = c9 +) such that c 9 < α 0 c 8 /2 = α 0 µ/2) and after taing expectation, we obtain: E m) 2 ] c ) 20 E x) x 2 ] + c 2 + + ) 2, 32) where c 20 > α 0 µ/2) > because α 0 > 2/µ) and c 2 is a positive constant. Combining 32) and 3), we get: E x + ) x 2] c ) 20 x) x 2 + + c 2 By the mean value theorem, we have: + ) 2 + c 7 + ) 2. ] f x)) f x ) = 2 f x + s x) x )) d s Invoing Lemma 3.2, the latter inequality implies: s=0 }{{} H E x + ) x 2] c 22 + ), 33) x) x ), 28) for some constant c where it is to be noted that L H µ. Using 28) in 22 > 0. Therefore, for the global average x), we have obtained the mean square rate O ) 25), we have:. Finally, x + ) x ) = I α ] we note that, H x) x ) 29) α ɛ) α v). Denote by m) = x i ) x x) x + I α H ] x i ) x). }{{} x) x ) α ɛ). x i) Then, 29) is rewritten as: After using: x + ) x ) = m) α v), 30) and so: x i ) x 2 2 x i ) 2 + 2 x) x 2, and taing expectation, it follows that E x i ) x 2] = O ), for all i =,...,. The proof is complete.

4. SIMULATIO EXAMPLE We provide a simulation example on l 2 -regularized logistic losses and random networs where lins fail independently over iterations and across different lins, with probability p fail. The simulation corroborates the derived O/) rate of algorithm 2) over random networs and shows that deterioration due to increase of p fail is small. We consider empirical ris minimization 7) with the logistic loss in 8) and the regularization functions set to Ψ i x) = κ 2 x 2, i =,...,, where κ > 0 is the regularization parameter that is set to κ = 0.5. The number of data points per node is n i = 0. We generate the true classification vector x = x ), x 0) by drawing its entries independently from standard normal distribution. Then, the class labels are generated as b ij = sign ) x ) a i,j + x 0 + ɛ ij, where ɛij s are drawn independently from normal distribution with zero mean and standard deviation 2. The feature vectors a i,j, j =,..., n i, at node i are generated as follows: each entry of each vector is a sum of a standard normal random variable and a uniform random variable with support 0, 5 i]. Different entries within a feature vector are drawn independently, and also different vectors are drawn independently, both intra node and inter nodes. ote that the feature vectors at different nodes are drawn from different distributions. The algorithm parameters are set as follows. We let β =, α θ +) /2 = +, = 0,,... Here, θ is the maximal degree across all nodes in the networ and here equals θ = 6. Algorithm 2) is initialized with x i 0) = 0, for all i =,...,. We consider a connected networ G with = 0 nodes and 23 lins, generated as a random geometric graph: nodes are placed randomly uniformly) on a unit square, and the node pairs whose distance is less than a radius are connected by an edge. We consider the random networ model where each undirected) lin in networ G fails independently across iterations and independently from other lins with probability p fail. We consider the cases p fail {0; 0.5; 0.9}. ote that the case p fail = 0 corresponds to networ G with all its lins always online, more precisely, with lins failing with zero probability. Algorithm 2) is then run on each of the described networ models, i.e., for each p fail {0; 0.5; 0.9}. This allows us to assess how much the algorithm performance degrades with the increase of p fail. We also include a comparison with the following centralized stochastic gradient method: y + ) = y) + ) l y); a i ), b i ) ), 34) where a i ), b i )) is drawn uniformly from the set a i,j, b i,j ), j =,..., n i. ote that algorithm 34) maes an unbiased estimate of f iy)) by drawing a sample uniformly at random from each node s data set. Algorithm 34) is an idealization of 2): it shows how 2) would be implemented if there existed a fusion node that had access to all nodes data. Hence, the comparison with 34) allows us to examine how much the performance of 2) degrades due to lac of global information, i.e., due to the distributed nature of the considered problem. ote that stepsize in 34) is set to /+) for a meaningful comparison with 2), as this is the step-size effectively utilized by the hypothetical global average of the nodes iterates with 2). As an error metric, we use the mean square error MSE) estimate averaged across nodes: x i) x 2. Figure plots the estimated MSE, averaged across 00 algorithm runs, versus iteration number for different values of parameter p fail in log 0 -log 0 scale. ote that here the slope of the plot curve corresponds to the sublinear rate of the method; e.g., the slope corresponds to a / rate. First, note from the Figure that, for any value of p fail, algorithm 2) achieves on this example at least) the / rate, thus corroborating our theory. ext, note that the increase of the lin failure probability only increases the constant in the MSE but does not affect the rate. The curves that correspond to different values of p fail are vertically shifted. ) Interestingly, the loss due to the increase of p fail is small; e.g., the curves that correspond to p fail = 0.5 and p fail = 0 no lin failures) practically match. Figure also shows the performance of the centralized method 34). We can see that, except for the initial few iterations, the distributed method 2) is very close in performance to the centralized method. Fig. : Estimated MSE versus iteration number for algorithm 2) with lin failure probability p fail = 0 red, solid line); 0.5 blue, dashed line); and 0.9 green, dash-dot line). The Figure also shows the performance of the centralized stochastic gradient method in 34) blac, dotted line). 5. COCLUSIO We considered a distributed stochastic gradient method for smooth strongly convex optimization. Through the analysis of the considered method, we established for the first time the order optimal O/) MSE convergence rate for the assumed optimization setting when the underlying networ

is randomly varying. Simulation example on l 2 -regularized logistic losses corroborates the established O/) rate and suggests that the effect of the underlying networ s statistics on the O/) rate s hidden constant is small. REFERECES ] S. Kar and J. M. F. Moura, Convergence rate analysis of distributed gossip linear parameter) estimation: Fundamental limits and tradeoffs, IEEE Journal of Selected Topics in Signal Processing, Signal Processing in Gossiping Algorithms Design and Applications, vol. 5, no. 4, pp. 674 690, Aug. 20. 2] F. Bullo, J. Cortes, and S. Martinez, Distributed control of robotic networs: A mathematical approach to motion coordination algorithms. Princeton University Press, 209. 3] A. Daneshmand, F. Facchinei, V. Kungurtsev, and G. Scutari, Hybrid random/deterministic parallel algorithms for nonconvex big data optimization, submitted to IEEE Trans. on Signal Processing, 204. 4] I. Lobel and A. E. Ozdaglar, Distributed subgradient methods for convex optimization over random networs, IEEE Trans. Automat. Contr., vol. 56, no. 6, pp. 29 306, Jan. 20. 5] I. Lobel, A. Ozdaglar, and D. Feijer, Distributed multi-agent optimization with state-dependent communication, Mathematical Programming, vol. 29, no. 2, pp. 255 284, 20. 6] D. Jaovetic, J. Xavier, and J. M. F. Moura, Convergence rates of distributed esterov-lie gradient methods on random networs, IEEE Transactions on Signal Processing, vol. 62, no. 4, pp. 868 882, February 204. 7] K. Tsianos and M. Rabbat, Distributed strongly convex optimization, 50th Annual Allerton Conference oncommunication, Control, and Computing, Oct. 202. 8] Z. J. Towfic, J. Chen, and A. H. Sayed, Excess-ris of distributed stochastic learners, IEEE Transactions on Information Theory, vol. 62, no. 0, Oct. 206. 9] D. Yuan, Y. Hong, D. W. C. Ho, and G. Jiang, Optimal distributed stochastic mirror descent for strongly convex optimization, Automatica, vol. 90, pp. 96 203, April 208. 0]. D. Vanli, M. O. Sayin, and S. S. Kozat, Stochastic subgradient algorithms for strongly convex optimization over distributed networs, IEEE Transactions on networ science and engineering, vol. 4, no. 4, pp. 248 260, Oct.-Dec. 207. ] A. edic and A. Olshevsy, Stochastic gradient-push for strongly convex functions on time-varying directed graphs, IEEE Transactions on Automatic Control, vol. 6, no. 2, pp. 3936 3947, Dec. 206. 2] A. K. Sahu, D. Jaovetic, D. Bajovic, and S. Kar, Distributed zeroth order optimization over random networs: A Kiefer-Wolfowitz stochastic approximation approach, 208, available at https://www. dropbox.com/s/fc2hgbfcx5yhr8/maincdc208kwsa.pdf. 3] S. Kar, J. M. F. Moura, and H. V. Poor, Distributed linear parameter estimation: Asymptotically efficient adaptive strategies, SIAM J. Control and Optimization, vol. 5, no. 3, pp. 2200 2229, 203.