Excess-Risk of Distributed Stochastic Learners

Size: px

Start display at page:

Download "Excess-Risk of Distributed Stochastic Learners"

Ronald Stanley
6 years ago
Views:

1 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. XX, NO. XX, MONTH 06 Excess-Risk of Distributed Stochastic Learners Zaid J. Towfic, Member, IEEE, Jianshu Chen, Member, IEEE, and Ali H. Sayed, Fellow, IEEE arxiv:30.57v3 [math.oc] 7 Jul 06 Abstract This work studies the learning ability of consensus and diffusion distributed learners from continuous streams of data arising from different but related statistical distributions. Four distinctive features for diffusion learners are revealed in relation to other decentralized schemes even under leftstochastic combination policies. First, closed-form expressions for the evolution of their excess-risk are derived for strongly-convex risk functions under a diminishing step-size rule. Second, using these results, it is shown that the diffusion strategy improves the asymptotic convergence rate of the excess-risk relative to non-cooperative schemes. Third, it is shown that when the innetwork cooperation rules are designed optimally, the performance of the diffusion implementation can outperform that of naive centralized processing. Finally, the arguments further show that diffusion outperforms consensus strategies asymptotically, and that the asymptotic excess-risk expression is invariant to the particular network topology. The framework adopted in this work studies convergence in the stronger mean-squareerror sense, rather than in distribution, and develops tools that enable a close examination of the differences between distributed strategies in terms of asymptotic behavior, as well as in terms of convergence rates. Index Terms distributed stochastic optimization, diffusion strategies, consensus strategies, centralized processing, excessrisk, asymptotic behavior, convergence rate, combination policy I. INTRODUCTION Machine learning applications rely on the premise that it is possible to benefit from leveraging information collected from different users. The range of benefits, and the computational cost necessary to analyze the data, depend on how the information is mined. It is sometimes advantageous to aggregate the information from all users at a central location for processing and analysis. Many current implementations rely on this centralized approach. However, the rapid increase in the number of users, coupled with privacy and communication constraints related to transmitting, storing, and analyzing huge amounts of data at remote central locations, have been serving as strong motivation for the development of decentralized solutions to learning and data mining [3] []. In this work, we study the distributed real-time prediction problem over a network of N learners. We assume the network is connected, meaning that any two arbitrary agents are either Copyright c 06 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to pubs-permissions@ieee.org. This work was supported in part by NSF grants CCF-098, ECCS- 4077, and CCF A short version of this work appears in the conference presentations [], []. Z. J. Towfic and J. Chen were with Department of Electrical Engineering, University of California, Los Angeles. Z. J. Towfic is currently with MIT Lincoln Laboratory, Lexington, MA 04. J. Chen is currently with Microsoft Research, Redmond, WA This work was performed while they were PhD students at UCLA. A. H. Sayed is with Department of Electrical Engineering, University of California, Los Angeles, CA {ztowfic,cjs09,sayed}@ucla.edu. connected directly or by means of a path passing through other agents. We do not expect the agents to share their data sets but only a parameter vector or a statistic that is representative of their local information. Such networks serve as useful models for peer-to-peer networks and social networks. The objective of the learning process is for all nodes to minimize some objective function, termed the risk function, in a distributed manner. We shall compare the performance of cooperative and non-cooperative solutions by examining the gap between the risk achieved by the distributed implementations and the risk achieved by an oracle solution with access to the true distribution of the input data; we shall refer to this gap as the excess-risk. Among other contributions, this work studies stochastic gradient-based distributed strategies that are shown here to converge in the mean-square-error sense when a decaying stepsize sequence is used, and that are also shown to outperform other implementations, even under left-stochastic combination rules [] [5]. Specifically, we will show that the strategies under study achieve a better convergence rate than noncooperative algorithms, and we will also explain why diffusion strategies outperform other distributed solutions such as those relying on consensus constructions or on doublystochastic combination policies [4], [6] [9], as well as naïve centralized algorithms [3]. It was previously shown that the diffusion strategies outperform their consensus-based counterparts in the constant step-size scenario [0]. We analytically show that the same conclusion holds in the diminishing stepsize scenario even as the step-size decays. We also illustrate in the simulations that while diffusion and consensus-based algorithms have the same computational complexity, it turns out that diffusion algorithms reduce the overshoot during the transient phase. In comparison to the useful work [6], our formulation studies convergence in the stronger mean-squareerror sense, and develops analysis tools that do not depend on using the central limit theorem or on studying convergence in a weaker distributional sense. In addition, unlike the works [3], [0], [], [], we are not solely interested in bounding the excess-risk. Instead, we are interested in obtaining a closedform expression for the asymptotic excess-risk of distributed and non-distributed strategies in order to compare and optimize their absolute asymptotic performance. Recently, there has also been interest in primal-dual approaches for distributed optimization [6], [3] [6]. Generally, these approaches are studied in the deterministic optimization context where the iterates are not prone to noise or when the risk function is non-differentiable. It was demonstrated in [3] that the primal diffusion strategy studied in this manuscript also outperforms augmented Lagrangian and Arrow-Hurwicz primal-dual algorithms in the stochastic constant-step-size setting in both stability and steady-state performance. It is possible to carry out similar comparisons in the diminishing

2 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. XX, NO. XX, MONTH 06 step-size scenario, but this manuscript is focused on the study of primal approaches. As the extended analysis and derivation in later sections and appendices show, this case is already demanding enough to warrant separate consideration in this work. The techniques developed will allow us to examine analytically and closely the differences between distributed strategies in terms of asymptotic behavior, as well as in terms of rates of convergence by exploiting properties of Gamma functions and the convergence properties of products of infinitely many scaling coefficients. For instance, when the noise profile is uniform across all agents, one of our conclusions will be to show that the convergence rate of diffusion strategies is in the order of Θ/Ni, where the notation fn = Θgn means that the sequence fn decays at the same rate as gn for sufficiently large n, i.e., there exist positive constants c and c and an integer n o such that c gn fn c gn for all n > n o. This rate is consistent with the result established for consensus implementations under doubly-stochastic combination policies in [6], [7], [9] albeit under a weaker convergence in distribution sense, where it was argued that the estimation error approaches a Gaussian distribution whose covariance matrix scales as /Ni. On the other hand, when the noise profile is non-uniform across the agents, the analysis will show that diffusion methods can surpass this rate. These and other useful conclusions will follow from the detailed meansquare and convergence analyses that are carried out in the sequel. The theoretical findings are illustrated by simulations in the last section. For ease of reference, we summarize here the main conclusions in the manuscript: We derive a closed-form expression and not only a bound for the asymptotic excess-risk curve of the distributed strategies. We analyze the derived expression to conclude that the asymptotic performance depends on the network topology solely through the Perron vector of the combination matrix used in the strategy. In this way, different topology structures with the same Perron vector are shown to attain the same asymptotic performance. That is, the full eigenstructure of the topology are become irrelevant in the asymptotic regime. We show that once the Perron vector is optimized to minimize the asymptotic excess-risk, it is possible to construct a combination matrix with that Perron vector in order to attain the optimal performance in a fully distributed manner. We compare the asymptotic excess-risk performance of the diffusion strategy to centralized and non-cooperative strategies to conclude that the diffusion strategy can attain the performance of a weighted centralized strategy asymptotically. We compare the asymptotic excess-risk performance of the diffusion strategy to consensus distributed strategies to conclude that the asymptotic excess-risk curve of the consensus strategy will be worse than that of the diffusion strategy. We verify our conclusions through simulations. Notation: Random quantities are denoted in boldface. Throughout the manuscript, all vectors are column vectors. Matrices are denoted in capital letters, while vectors and scalars are denoted in lowercase letters. Network variables that aggregate variables across the network are denoted in calligraphic letters. Unless otherwise noted, the notation refers to the Euclidean norm for a vector and to the matrix norm that is induced by Euclidean norm for vectors. Furthermore, the notation denotes the Kronecker product operation [7, p. 39]. The notation N denotes a vector of dimension N with all its elements equal to one. II. PROBLEM FORMULATION AND ALGORITHMS Consider a network of N learners. Each learner k is subject to a streaming sequence of independent data samples x k,i, for i =,,..., arising from some fixed distribution X k. The goal of each agent is to learn the vector w o R M that optimizes the average of some loss function, say, EQw, x k,i, where the expectation is over the distribution of the data x k,i and w R M is the vector variable of optimization. For example, in order to learn the hyper-plane that best separates feature data h k,i R M belonging to one of two classes y k i {+, }, a regularized logistic-regression RLR algorithm would minimize the expected value of the following loss function over w with the expectation computed over the distribution of the data x k,i {h k,i, y k,i } X k []: Q RLR w, h k,i, y k i ρ w +log+e y kih T k,i w while a mean-square-error algorithm would minimize the expected value of the quadratic loss [8] also referred to as the delta rule [9]: Q LMS w, h k,i, y k i ρ w + y k i h T k,iw The expectation of the loss function over the distribution of the data is referred to as the risk function [30, p. 6]: J k w E{Qw, x k,i } [risk function] 3 and we denote the optimizer of 3 by w o : w o arg min w J k w 4 where w o is unique when J k w is strongly-convex, which we shall assume for the remainder of the manuscript. The assumption of strong-convexity of the risk function is important in practice since the convergence rate of most stochastic approximation strategies will be significantly reduced when the condition does not hold []. This is not a limitation in most problems arising in the context of adaptation and learning since regularization such as l is often used and it helps ensure strong convexity. The risk function can be viewed as a measure of the prediction-error of a classifier or regression method since it evaluates the performance of the method against samples taken from the distribution of the input data that have not yet been observed by the classifier/regressor [30, p. 6]. The risk serves as a measure about how well an estimate w will perform on a new sample x k,i X k on

3 TOWFIC, CHEN, SAYED: EXCESS-RISK OF DISTRIBUTED STOCHASTIC LEARNERS 3 average. For this reason, the risk is also referred to as the generalization ability of the classifier. We will assume for the remainder of this exposition that the optimizer w o is the same for all nodes k =,..., N. This case is common in both machine learning where, for example, X k = X for all k and distributed inference applications where the distributions X k are dependent on a common parameter vector w o to be optimized see Sec. V further ahead. In order to measure the performance of each learner, we define the excess-risk ER at node k as: ER k i E{J k w k,i J k w o } 5 where w k,i denotes the estimator of w o that is computed by node k at time i i.e., it is the estimator that is generated observing past data within the neighborhood of node k. The excess-risk is non-negative because J k w is stronglyconvex and, therefore, J k w > J k w o for all w w o. The expectation in 5 is over the data since w k,i is a random quantity that depends on all the data samples up to time i i.e., {x l,j } j i, l N. The dependence on the data from the other agents arises from the network topology. Our interest in this work is to characterize the convergence rate, to zero, of the excess-risk for various distributed and non-distributed strategies of learning w k,i for a given loss function. We also derive closed-form expressions for the asymptotic excess-risk and compare the absolute-value of the excess-risk curves for algorithms that converge at the same rate. There are various approaches for optimizing 4. We concentrate on fully-distributed strategies that operate over sparselyconnected networks. The concept of a fully distributed strategy is used here to mean the following: i There is no central node that is coordinating the communication and computation during the learning process. ii A node does not need to be connected to all other nodes. Indeed, as long as the network is connected and it can be sparsely connected, the algorithm is able to approach the solution to the global learning problem. iii Only one-hop communication is allowed during the learning process. That is, we do not allow the routing of a data packet over the network. Instead, each agent/node is only allowed to be directly communicating with its intermediate neighbors. Figure illustrates the types of topologies examined in this manuscript. It is important to notice that the centralized and fully-connected topologies are theoretically equivalent, but practically different as the centralized topology greatly reduces the amount of information that is communicated throughout the network. The centralized topology, however, is not robust to node failure since the entire solution breaks down if the central node fails. Throughout the remainder of the manuscript, we will refer to the centralized and fully-connected approaches interchangeably since they have identical excess-risk performance. A. Non-Cooperative Strategy First, we examine the non-cooperative strategy for optimizing 4, which is for each node k to run independently a stochastic gradient algorithm of the following form for i [3] [34]: w k,i = w k,i µi w Qw k,i,x k,i [no cooperation] 6 where w Q denotes the gradient vector of the loss function, and µi 0 is a step-size sequence. The gradient vector employed in 6 is an approximation for the actual gradient vector, w J k, of the risk function. The difference between the true gradient vector and its approximation used in 6 is called gradient noise. Due to the presence of the gradient noise, the estimate w k,i generated by 6 becomes a random quantity; we use boldface letters to refer to random variables throughout our manuscript, which is already reflected in our notation in 6. It is shown in [3], [35] that for strongly-convex risk functions J k w, the non-cooperative scheme 6 achieves an asymptotic convergence rate in the order of O/i under some conditions on the gradient noise and the step-size sequence µi, where the notation fn = Ogn means that the sequence fn decays at a rate that is at most the rate of decay of gn for sufficiently large n i.e., there exist positive constant c and an integer n 0 such that fn c gn for all n > n 0. In this way, in order to achieve an excess-risk accuracy on the order of Oɛ, the non-cooperative algorithm 6 would require O/ɛ samples. It is further shown in [3], [36] that no algorithm can improve upon this rate under the same conditions. This implies that if no cooperation is to take place between the nodes, then the best asymptotic rate each learner would hope to achieve is on the order of O/i. B. Centralized Strategy Now, in place of the non-cooperative strategy, let us assume that the N nodes transmit their samples to a central processor, which executes the following algorithm: w i = w i µi N w Qw i, x k,i k= [unweighted cent.] It can be shown that this implementation will have an asymptotic convergence rate in the order of O/N i for step-size sequences of the form µi = µ/i see Corollary. In other words, the centralized implementation 7 provides an N fold increase in convergence rate relative to the non-cooperative solution 6. One of the questions we wish to answer in this work is whether it is possible to derive a fully distributed algorithm that allows every node in the network to converge in the mean-square-error sense at the same rate as the centralized solution, i.e., O/N i, with only communication between neighboring nodes and for general ad-hoc networks. We show that this task is indeed possible. We will additionally show that the distributed strategy can outperform the naïve centralized implementation 7 when the gradient noise profile across the agents is non-uniform, but that it will match the performance of a weighted version of 7, namely, the following weighted centralized strategy: 7

4 4 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. XX, NO. XX, MONTH 06 a Non-cooperative b Fully-connected c Centralized d Sparsely-connected Fig. : Different network topologies for distributed learning: a a non-cooperative topology where agents do not interact with one another, b a fully-connected topology where every agent is connected to every other agent in the network, c a centralized topology that is equivalent in performance to the fully-connected topology but requires less communication overhead, and d a sparsely connected network topology in which each user is connected to a subset of the users in the network. w i = w i µi π k w Qw i, x k,i k= [weighted cent.] where the {π k } are convex combination weights that satisfy: π k > 0, 8 π k = 9 k= and are meant to discount gradients with higher noise power compared to others. We next describe two popular fullydistributed strategies. C. Diffusion Strategy Following the approach of [4], [3], [34], the diffusion strategy for evaluating distributed estimates w k,i for w o in 4 takes the following form: ψ k,i =w k,i µi w Qw k,i, x k,i [adaptation] 0 w k,i = a lk ψ l,i [aggregation] l= where x k,i denotes the current data sample available at node k. Each node k begins with an estimate w k,0 and employs a diminishing positive step-size sequence µi. The nonnegative coefficients {a lk }, which form a left-stochastic N N combination matrix A = [a lk ], are used to scale information arriving at node k from its neighbors. These coefficients are chosen to satisfy: a lk =, a lk = 0 if nodes l,k are not connected l= We emphasize that we are only requiring A to be leftstochastic, meaning that only each of its columns should add up to one rather than each of its columns and rows. The neighborhood N k for each node k is defined as the set of nodes {l} for which a lk 0. The neighborhood N k is typically known to agent k. The main difference between the above algorithm and the original adapt-then-combine ATC diffusion strategy studied in [4], [3], [34] is that we are employing a diminishing step-size sequence µi as opposed to a constant step-size. Constant step-sizes have the distinct advantage that they allow nodes to continue adapting their estimates in response to drifts in the underlying data distribution [37]. In this work, we are interested in examining the generalization ability of distributed learners asymptotically when the underlying distribution, X k, remains stationary, in which case the use of decaying step-sizes sequences is justified. If the statistical distribution of the data were subject to drifts, then constant step-sizes become a necessity, and this scenario is already studied in some detail in [4], [5], [3], [34]. D. Consensus Strategy In addition to 0, we shall examine the following consensus-based implementation [4], [7], [8], [33] for solving the same problem 4: w k,i = a lk w l,i µi w Qw k,i, x k,i [consensus] l= 3 The diffusion and consensus strategies 0-3 have exactly the same computational complexity, except that the computations are performed in a different order. We will see in Sec. IV-D that this difference enhances the performance of diffusion over consensus. Moreover, in the constant step-size case, the difference in the order in which the operations are performed causes an anomaly in the behavior of consensus solutions in that they can become unstable even when all individual nodes are able to solve the inference task in a stable manner; see [0], [3], [34]. Furthermore, consensus strategies of the form 3 are usually limited to employing a doubly-stochastic combination matrix A. The analysis in the sequel will show that left-stochastic matrices actually lead to improved excessrisk performance see Eqs while convergence of the distributed implementation continues to be guaranteed see Theorem. III. MAIN ASSUMPTIONS Before proceeding with the analysis, we list in this section the main assumptions that are needed to facilitate the exposition. The conditions listed below are common in the broad stochastic optimization literature see the explanations in [3] [34]. The first condition assumes that the J k w

5 TOWFIC, CHEN, SAYED: EXCESS-RISK OF DISTRIBUTED STOCHASTIC LEARNERS 5 are strongly-convex, with a common minimizer w o for k =,..., N. This condition ensures that the optimization problem 4 is well-conditioned. Assumption Properties of risk functions: Each risk function J k w is twice continuously-differentiable and its Hessian matrix is uniformly bounded from below and from above, namely, 0 < λi wj k w λi < 4 Furthermore, the risks at the various agents are minimized at the same location: w o = arg min w J k w, k =,..., N 5 and the Hessian matrices are locally Lipschitz continuous at w o, i.e., for all w o w < ε, there exists some τ k 0 such that: wj k w o wj k w τ k w o w 6 We denote the value of the Hessian matrices at w = w o assumed uniform across the agents by H wj k w o, k =,..., N 7 where H > 0. We let λ min > 0 denote the smallest eigenvalue of H. In fact, conditions 4 and the locally Lipschitz condition 6 jointly imply 6 globally see Lemma E.8 in [3]; i.e., for all w R M, we have that there exists some τ k such that: Observe that wj k w o wj k w τ k w o w 8 λ min λ 9 One useful implication that follows from Assumption is the following. Consider the expected excess-risk 5 at node k. Using the following sequence of inequalities, we can bound the excess-risk by a square weighted norm: ER k i E w {J k w k,i J k w o } a = E w w k,i S k,i 0a b λ E w w k,i 0b where w k,i w o w k,i, E w { } denotes expectation over the distribution of w, and the weighting matrix S k,i is defined as [ ] S k,i t wj k w o s t w k,i dsdt 0 0 Step a is a consequence of applying the following meanvalue theorem [3, p. 4] [3] twice for an arbitrary realvalued differentiable function f : fa + b = fa + T fa + t b dt b 0 and the fact that w o optimizes J k w so that w J k w o = 0. Step b is due to 4 and. Expression 0a shows that the expected excess-risk at node k is equal to a weighted mean-square-error with weight matrix. This means that one way to compute or bound the expected excess-risk is by evaluating weighted mean-squareerror quantities of the form 0a or 0b. This is the route that we will follow in this manuscript. We will analyze the righthand side of 0a in order to draw conclusions regarding the evolution of the expected excess-risk. In particular, once we establish that the distributed algorithm converges in the meansquare-error sense, then inequality 0b will immediately allow us to conclude that the algorithm also converges in excess-risk. Similarly, we can obtain the asymptotic expression for the excess-risk by leveraging the weighted-mean-squareerror analysis developed for constant step-size distributed strategies [4], [5], [3], adjusted for the decaying step-size case. Observe that these conclusions are different than the useful results in [6], which focused on studying convergence in distribution. The mean-square-error results will enable us to expose analytically various interesting differences in the performance of distributed strategies, such as diffusion and consensus. Our second condition is on the gradient noise process, which is defined, for a generic vector w, as v k,i w w Qw, x k,i w J k w 3 We collect the noises from across the network into a column vector g i W col{v,i w,..., v N,i w N } 4 where we are introducing the vector notation W col{w,..., w N } 5 for the collection of parameters across the agents. We denote the covariance matrix of the gradient noise vector by R g,i W E{g i Wg T i W F i } 6 where the conditioning is in relation to the past history of the estimators, F i {w k,j : k =,..., N and j i }. The following conditions are relaxations of assumptions that are regularly considered in the stochastic approximation literature; they are generally satisfied in important scenarios, such as logistic regression or quadratic loss functions of the form see [3]. Assumption Gradient noise model: We assume the gradient noise process satisfies: E{v k,i w F i } = 0 7 E{ v k,i w 4 F i } α 4 w o w 4 + σ 4 v4 8 E{v T k,iw o v l,i w o } = 0, k l 9 for some α 4 0, σ 4 v4 0, as well as: R g,i W R g,i N w o κ N w o W γ 30 lim E{R g,i N w o } R o g 3 for some κ 0, 0 < γ 4, and where W col{w,..., w N } 3 and 30 is assumed to hold for N w o W ɛ, for some small ɛ > 0.

6 6 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. XX, NO. XX, MONTH 06 Observe that Assumption implies that: E{ v k,i w F i } α w o w +σ v 33 for some α, σ v 0 and w F i. In addition, the local Lipschitz condition 30 of order γ [38, p. 53] sometimes referred to as the Hölder condition of order γ [39, p. 0] implies, under Assumption, that the following global condition also holds [5], [3]: R g,i W R g,i N w o κ N w o W + κ N w o W γ 34 for some constant κ 0 that depends on ɛ and where κ is from 30. Furthermore, due to 9, we have that the matrix R o g is block-diagonal: R o g blockdiag{r o v,,..., R o v,n } 35 R o v,k lim E{v k,i w o v T k,iw o } 36 Since nodes sample the data in an independent fashion, it is reasonable to expect the gradient noise to be uncorrelated across all nodes, as required by 9. Our third condition relates to the structure of the network topology. We will assume that the network is stronglyconnected, which means that a there exists at least one nontrivial self-loop, a kk > 0 for some k, and b for any two agents k and l, there exists a path with nonzero weights {a k,m, a m,m,..., a mr,l} from k to l, either directly if they are neighbors or through other agents. It is well-known that the combination matrix A for such networks is primitive [40, p. 56]. That is, all entries of A are non-negative and there exists some positive integer n o > 0 such that all entries of A no are strictly positive. One important property of primitive matrices follows from the Perron-Frobenius theorem [40, p. 534]; A will have a single eigenvalue at one, while all other eigenvalues of A will lie strictly inside the unit circle. Moreover, if we let p denote the right-eigenvector associated with the eigenvalue at one, and normalize its entries to add up to one, i.e., Ap = p, T Np = 37 then all entries of p will be strictly positive. We shall refer to p as the Perron eigenvector of A. We formalize this assumption in the following: Assumption 3 Network Topology: The network is strongly-connected so that the combination matrix A is primitive with T p = and p 0, where p denotes the the Perron eigenvector of A. IV. MAIN RESULTS In this section, we list the main results and defer the detailed proofs to the appendices. A. Convergence Properties Our first result provides conditions on the step-size sequence under which the diffusion strategy converges both in the mean-square-error MSE sense and also almost surely. The difference between the two sets of conditions that appear below is that in one case the step-size sequence is additionally required to be square-summable. Theorem Convergence rates: Let Assumptions -3 hold and let the step-size sequence satisfy µi > 0, µi =, lim µi = i= Then, w k,i generated by 0 converges in the MSE sense to w o, i.e., lim E wo w k,i = If the step-size sequence satisfies the additional squaresummability condition: µ i <, 40 i= then w k,i converges to w o almost surely i.e., with probability one for all k =,..., N. Furthermore, when the step-size sequence is of the form µi = µ/i with µ satisfying λµ >, then the second and fourth-order moments of the error vector converge at the rates of O/i and O/i, respectively: lim sup lim sup where σ v was introduced in 33. Proof: See Appendix A. E w k,i i σ vµ λµ 4 E w k,i 4 i constant 4 Observe that 4 implies that each node converges in the mean-square-error sense at the rate /i. Combining this result with 0b, we conclude that each node also converges in excess-risk at this rate: lim sup ER k i i µ λµ λµ σ v 43 when λµ >. Note that this conclusion does not yet reveal the benefit of cooperation for example, it does not show how the convergence rate depends on N. In the next section, we will derive closed-form asymptotic expressions for the mean-square-error and excess-risk, and from these expressions we will be able to highlight the benefit of network cooperation. B. Evolution of Excess-Risk Measure We continue to assume that the step-size sequence is selected as µi = µ/i for some µ > 0. This sequence satisfies conditions 38 and 40. Observe that in order to evaluate the excess-risk at node k, we must evaluate 0a. To do so, we first form the following network-wide error quantity: Wi col{ w,i,..., w N,i } 44 and let E kk denote the N N matrix with a single entry equal to one at the k, k th location and all other entries equal to zero. Then, using 0a, we can write: ER k i E Wi E kk S k,i 45

7 TOWFIC, CHEN, SAYED: EXCESS-RISK OF DISTRIBUTED STOCHASTIC LEARNERS 7 In order to facilitate the analysis, we introduce the eigenvalue decomposition of matrix H: H = ΦΛΦ T 46 where Φ is an orthogonal matrix and Λ is diagonal with positive entries λ, λ,..., λ M. Moreover, since the matrix A is left-stochastic and primitive by Assumption 3, we can express its Jordan decomposition in the form: A = p T + Y D N R T 47 where R, Y R N N represent the remaining left and right eigenvectors while D N represents the Jordan structure associated with the eigenvalues inside the unit disc. Theorem Asymptotic Convergence of ER k i: Let Assumptions -3 hold and let λ min denote the smallest eigenvalue of the matrix H: λ min min{λ,..., λ M } 48 Then, when λ min µ >, it holds asymptotically that ER k i µ i = µ i m= m= λ m µ λ m µ pt R v,m p 49 λ m µ λ m µ p lφ T Rv,lΦ o mm 50 l= where the notation fi gi implies that lim fi/gi=. Moreover, λ m is the m-th eigenvalue of H and the N N matrix R v,m is defined as: R v,m Φ T R o v,φ mm... Φ T R o v,n Φ mm 5 where the notation X mm denotes the m-th diagonal element of the matrix X. Proof: See Appendix B. Theorem establishes a closed-form expression for the asymptotic excess-risk of the diffusion algorithm. We observe that the slowest rate at which the asymptotic term converges depends on the smallest eigenvalue of H, which is λ min, and the constant µ. Interestingly, the only dependence on the topology of the network asymptotically is encoded in the Perron vector p of the combination matrix A i.e., most of the eigen-structure of the topology matrices becomes irrelevant asymptotically and only influences the convergence rate in the transient regime. We will see further ahead that the Perron vector p can be optimized to minimize the excess-risk in the asymptotic regime. It is natural that the transient stage should depend on the network geometry because the networked agents are propagating their information over the entire network. The speed of information propagation over a sparsely connected network is determined by the second largest eigenvalue of the combination matrix [4], which is influenced by the degree of network connectivity. Our results show, however, that there is an asymptotic regime where the performance of the diffusion strategy can be made invariant to the specific network topology since the Perron vector p can be designed for general connected networks, as we will see further ahead in 60. Finally, we observe that all agents participating in the network will achieve the same asymptotic performance given by the right-hand-side of 49 as this asymptotic expression for the excess-risk does not depend on any particular node index k but only on the network-wide quantity p T R v,m p. When λ min is not known, and thus it is not clear how to choose µ to satisfy λ min µ >, it is common to choose a large µ that forces λ min µ. In this case, we obtain from 50 that ER k i µ 4i p ltrrv,l, o [λ min µ ] 5 l= This approximation is close in form to the original steady-state performance expression derived for the diffusion algorithm when a constant step-size µ is used [4]. The main difference is that the steady-state term will now diminish at the rate /i when µi = µ/i and λ min µ. By specializing the previous results to the case N = a stand-alone node, we obtain as a corollary the following result for the expected excess-risk that is delivered by the noncooperative stochastic gradient algorithm 6. Corollary Stochastic gradient approximation: Consider recursion 6 and let Assumptions - hold with µi = µ/i and λ min µ >. Then, the excess-risk approaches asymptotically: E w {Jw i Jw o } µ i where λ m µ λ m µ ΦT RvΦ o mm 53 m= µ 4i TrRo v, [λ min µ ] 54 R o v lim E{v i w o v T i w o } 55 and v i is the noise process modeled by Assumption. Observe that are stronger than those in Theorem since we are not only stating that the convergence rate is O/i but we are also giving the exact constant that multiplies the factor /i. In the next section, we examine the relationship between the derived constant and the network size and noise parameters across the network. Following this presentation, we will utilize our mean-square-error expressions to examine the differences between the diffusion strategy 0 and the consensus strategy 3. C. Benefit of Cooperation Up to this point in the discussion, the benefit of cooperation has not yet manifested itself explicitly; this benefit is actually already included in the vector p. Optimization over p will help bring forth these advantages. Thus, observe that the expression for the asymptotic term in 49 is quadratic in p. We can optimize the asymptotic expression over p in order to reduce the excess-risk. We re-write the asymptotic excess-risk 49 as: E{J k w k,i J k w o } µ i pt Zp 56

8 8 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. XX, NO. XX, MONTH 06 TABLE I: Examples of fully-distributed combination rules and their corresponding Perron vector. Combination Rule { a lk p k Average Rule N k, l N k N k 0, otherwise N m Metropolis Rule [4] Hastings Rule [43] max N k, N l, a jk, l N k, l k l = k j N k \{k} 0, otherwise, l N k, l k Nk p k max, N l p k p l a jk, l = k j N k \{k} 0, otherwise m= N p k where Z m= λ m µ λ m µ R v,m 57 Then, we consider the problem of optimizing 56 over p: min p,a A pt Zp subject to Ap = p, T p =, p 0 where A denotes the set of left-stochastic and primitive combination matrices that satisfy the network topology structure. It is generally not clear how to solve this optimization problem over both A and p. We pursue an indirect route. We first remove the optimization over A and determine an optimal p. Subsequently, given the optimal p o, we show that a leftstochastic and primitive matrix A in A can be constructed. The relaxed problem is: min p p T Zp subject to T p =, p 0 58 whose solution is p o = Z T Z j N k \{k} 0, otherwise 59 It is straightforward to verify that T p o =. A combination matrix A that has this p o as its Perron eigenvector is the Hastings rule [4], [44] [3, Lemma.], which is given by, l N k, l k p o k max Nk p, N l o k p o l a lk = a jk, l = k 60 It is possible to see that for agent k to implement the Hastings rule, it needs to know its neighborhood N k which is known to agent k, as well as the number of neighbors that each of its neighbors has this information is easily derived from the immediate neighbors, and the Perron vector p o that the network wishes to obtain. Therefore, the design of the weighting matrix A can be done in a fully distributed manner. Table I lists three fully-distributed combination rules combination rules that can be implemented in a fully-distributed manner and their corresponding Perron vector. To see the effectiveness of this choice for p, we consider the case where λ min µ, so that Z diag{trro v,,..., TrR o v,n } 6 Substituting 59 into 56, we obtain: E{J k w k,i J k w o } µ i a µ 4i T Z 6, [λ min µ ] 63 N k= TrRv,k o where step a is due to 6. First, we will compare this performance with that of the centralized algorithm 7. To do this, we first establish the following result: Corollary Un-weighted Centralized processing: Let Assumptions - hold with µi = µ/i with λ min µ >. Consider the centralized algorithm 7, which has access to all the data from across the network at each iteration. Then, the excess-risk asymptotically satisfies: E w {Jw i Jw o } µ M λ m µ in Φ T R o λ m µ v,kφ mm µ 4iN m= k= 64 TrRv,k, o [λ min µ ] 65 k= where Jw N N k= J kw. Proof: The centralized algorithm 7 is a special case of the diffusion algorithm when A = p T N, where p = N N, which yields a network that satisfies Assumption 3. To see this, consider the diffusion algorithm 0 with A = p T N : ψ k,i = w k,i µi w Qw k,i, x k,i 66 w k,i = p l ψ l,i 67 l= First, observe that after the first iteration, the estimates across the network are now uniform since 67 does not depend on

9 TOWFIC, CHEN, SAYED: EXCESS-RISK OF DISTRIBUTED STOCHASTIC LEARNERS 9 k. We can therefore drop the subscript k from w k,i : ψ k,i = w i µi w Qw i, x k,i 68 w i = p l ψ l,i 69 l= Substituting 68 into 69, we obtain: w i = p l w i µi w Qw i, x l,i l= = p l w i µi p l w Qw i, x l,i l= a = w i µi b = w i µi N l= p l w Qw i, x l,i l= w Qw i, x l,i 70 l= where step a is due to T p = and step b is obtained by substituting p = N N. Observe that 70 is identical to 7, the un-weighted centralized algorithm. Then, using the analysis of the diffusion algorithm in Theorem, we have that ER k i µ M λ m µ i λ m= m µ pt R v,m p = µ M λ m µ in Φ T R o λ m µ v,lφ mm 7 m= l= Since the right-hand-side of 7 does not depend on the agent index k all agents will achieve the same asymptotic performance, we have that the average excess-risk remains the same: E w {Jw i Jw o } = ER k i N k= µ M λ m µ in Φ T R o λ m= m µ v,lφ mm l= a µ 4iN TrRv,l, o [λ min µ ] 7 l= where step a is due to 6, which is the desired result. Comparing 63 to 54 in the special case when TrRv,k o = TrRv o for all k N, we find that the diffusion algorithm offers an N-fold improvement in the excess-risk over the noncooperative solution. Also, comparing 63 to 65 in this case, we observe that asymptotically the diffusion algorithm achieves the same performance as the centralized algorithm 7. More generally, let us consider the case in which the noise profile is not uniform across the agents. We call upon the following inequality: N k= TrRv,k o N TrRv,k o 73 k= which follows from the fact that the harmonic mean of a set of numbers is upper-bounded by their arithmetic mean. Then, we conclude from 73 that the excess-risk of the diffusion strategy is upper-bounded by that of the centralized strategy 7, and equality holds only when the network experiences a spatially uniform gradient noise profile. This implies that the diffusion strategy actually outperforms the implementation studied in [6], which uses a doubly-stochastic combination matrix. Furthermore, in this case of non-uniform noise profile and since TrR o v,k min k N TrRo v,k 74 k= we also find that the diffusion strategy continues to outperform the non-cooperative solution, and guarantees that the performance of all nodes that utilize the diffusion strategy will outperform the best performing node in the non-cooperative solution. On the other hand, the weighted centralized algorithm 8 achieves the same performance as the diffusion strategy since it is a special case of diffusion when A = π T. Corollary 3 Weighted Centralized processing: Let Assumptions - hold with µi = µ/i with λ min µ >. Consider the centralized algorithm 8, which has access to all the data from across the network at each iteration. Then, the excess-risk asymptotically satisfies: E w {Jw i Jw o } µ M λ m µ π i λ m µ kφ T Rv,kΦ o mm µ 4i m= k= 75 πktrr v,k, o [λ min µ ] 76 k= where Jw N N k= J kw. Proof: The centralized algorithm 8 is a special case of the diffusion algorithm when A = π T N where π = col{π,..., π N }, which yields a network that satisfies Assumption 3. The derivations closely mirror that of Corollary except with p = π, where T π =, instead of p = N. We can then use the result of Theorem to obtain 75 and 76. Comparing 75 with 50, we observe that the diffusion strategy achieves the same performance as the weighted centralized strategy 8 since we are free to let p k = π k. Conversely, it is possible to optimize {π k } in the same manner as we did in 58 to obtain that the optimal choice of {π k } would yield an excess-risk performance for the weighted centralized algorithm of: E w {Jw i Jw o } µ i T Z 77 where Z was defined earlier in 57, which is identical to the one obtained by the diffusion strategy with the optimized p found in 6. The reader may also notice that it is further possible to optimize the asymptotic excess-risk curves 6 and 77 over the initial step-size µ for the diffusion and weighted centralized

10 0 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. XX, NO. XX, MONTH 06 strategies. This would entail solving an optimization problem of the form: min λ minµ> µ T Z = µ M T λ m µ λ m µ R v,m m= 78 While this problem is generally difficult to solve in closedform, it is evident that the solution is µ = /λ when λ m = λ for all m M which is the case where the asymptotic Hessian matrix H is white. In general, however, the optimization problem can be solved numerically to obtain the optimal rate when H is known or can be estimated such as in the case of the quadratic cost. D. Comparison to Consensus Strategies In this section, we use the performance results derived so far to show that the diffusion strategy 0 has several advantages over the consensus strategy 3. Specifically, we will show that, asymptotically, the consensus excess-risk curve is worse than the diffusion excess-risk curve. In addition, we will observe through simulations that the overshoot during the transient phase may be significantly worse for consensus implementations. To examine the differences in behavior, we will consider the network excess-risk: ERi ER k i N k= For simplicity, we assume A is symmetric; more generally, we can consider combination policies A that are close-tosymmetric and employ arguments similar to [0]. The final conclusion will be similar to the arguments given here. We can now establish the following result. Theorem 3 Comparing network excess-risks: Let Assumptions -3 hold and λ min µ > with a symmetric A. Then, it holds that the asymptotic network excess-risk achieved by the diffusion strategy 0- is upper bounded by and, hence, better than that achieved by the consensus strategy 3: ER diff i ER cons i, as i 79 Proof: See Appendix F. Result 79 implies that, asymptotically, the curve for the network excess-risk for the diffusion algorithm will be upperbounded by the curve for the consensus algorithm. Even though both algorithms have the same computational complexity, the diffusion algorithm achieves better performance because it succeeds at diffusing the information more thoroughly through the network. This conclusion mirrors the result found in [0], which studied the difference in performance between the diffusion and consensus strategies in the constant step-size scenario. We conclude, therefore, that the manner by which the diffusion strategy propagates information through the network continues to outperform the consensus strategy even in the diminishing step-size case. V. APPLICATION TO DISTRIBUTED INFERENCE OVER REGRESSION AND CLASSIFICATION MODELS In order to illustrate our results, we consider two applications related to distributed linear regression and distributed logistic regression a classification model. In the former case, we will observe that our strategy for choosing the combination weights outperforms the doubly-stochastic weights of [6], and outperforms the consensus solution. We will also see that it is possible to select J k w such that the excess-risk can be interpreted as the Kullback-Leibler KL divergence between the true likelihood function and an estimated likelihood function. A. Quadratic Risk Consider the estimation of an unknown vector w o R M, from observations: y k i = h T k,iw o + z k i 80 where z k i is zero-mean Gaussian noise with variance σz,k and all regressors h k,i R M are zero-mean, spatially independent, and identically distributed. This is a common scenario in linear regression. The likelihood function is given by: py k i h k,i, w = exp πσz,k y ki h T k,i w σ z,k 8 The maximum-likelihood estimate of w o is obtained by minimizing the negative log-likelihood function; this step motivates the following choice for the loss function: Qw, h k,i, y k i y k i h T k,iw 8 In this case, the risk function 3 becomes: J k w = Ey k i h T k,iw 83 and, using 8, the excess-risk 5 is equivalent to measuring the KL divergence: D KL py k i h k,i, w o py k i h k,i, w { pyk i h k,i, w o } E log py k i h k,i, w 84 The results derived in Sec.IV apply to this definition of excessrisk, and we can therefore expect that the KL divergence curve of the diffusion strategy 0 will be below that of the consensus algorithm 3, and that the diffusion strategy will outperform both the non-cooperative solution 6 and the centralized algorithm 7. We simulate the above results by considering a network with N = 5 nodes as illustrated in Figs. a c. The quadratic loss function optimized at each node is 8, where h k,i is a random vector in R M and M = 4 and is a Gaussian random vector with i.i.d. elements and zero mean and unit variance. In addition, the scalar observation y k i is generated according to 80, where z k i is a Gaussian random variable with zero mean and variance σz,k as illustrated in Fig. d. The combination weights are chosen according to the Hastings rule 60 using either the optimized Perron vector 59 or

11 TOWFIC, CHEN, SAYED: EXCESS-RISK OF DISTRIBUTED STOCHASTIC LEARNERS a The fully-connected network topology used to obtain the simulation results in Sec. V-A. b The sparsely-connected network topology used to obtain the simulation results in Sec. V-A. c The line-topology used to obtain the simulation results in Sec. V-A. d Plot of noise variances across the 50 nodes in the simulation. e Plot of the second largest eigenvalue of the combination matrices for various topologies. A fully-connected A matrix always has a single non-zero eigenvalue at. Fig. : Simulation parameters for quadratic risk optimization: a The fully-connected network topology, b The sparselyconnected network topology, c The line-topology used in the simulations, d The noise variance at each node, and e The eigenvalue spectrum of the combination matrix A. the uniform Perron vector that leads to the doubly-stochastic combination matrix p = N N. The magnitude of the second largest eigenvalue of the left- and doubly-stochastic matrices are illustrated in Fig. e. The step-size was optimized via 78. Observe that we let two out of the 5 nodes have particularly bad noise conditions. This should highlight the benefit of using left-stochastic weights as opposed to doublystochastic weights. The simulation results are illustrated in Fig. 4. The curves are averaged over 00 experiments. The network topologies examined in the simulations in this section are illustrated illustrated in Fig. a c. Asymptotic Performance invariance due to network topology: We first illustrate one of the main conclusions we draw from Theorem, which is that the diffusion strategy s asymptotic performance is independent of the particular network topology. To do this, we simulate the diffusion strategy with three network topologies: Fully-connected Fig. a; Sparsely-connected Fig. b; and 3 Line topology Fig. c. We utilize the Hastings rule 60 to obtain both left-stochastic and doubly-stochastic combination matrices by setting the desired Perron vector p to the optimal vector 59 or N N. Such combination matrices can be generated for any connected network topology. The eigen-spectrum of the resulting combination matrices for each topology are illustrated in Fig. e. We observe that as the network becomes better connected, the second-largest eigenvalue is reduced. This eigen-value generally determines the convergence speed of the transient terms. Figure 3 illustrates the simulation results for all of the simulated network topologies. We observe that regardless of the network topology, the diffusion strategy converges to the curve described by 49, albeit at different speeds. The fully-connected network case is theoretically equivalent to the centralized strategies 7 and 8, depending on the vector p used. We conclude, therefore, that the diffusion strategy is invariant to the particular network topology and its asymptotic excess-risk curve only depends on the Perron vector of the combination matrix, which can be freely designed for connected networks. Performance comparisons: We now simulate the diffusion strategy as well as other algorithms discussed in the manuscript. In addition to the non-cooperative 6, diffusion

12 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. XX, NO. XX, MONTH 06 Fig. 3: Simulation of the diffusion strategy for a fully-connected network; a sparsely-connected network; and 3 a line-topology. The network topologies and simulation parameters are listed in Fig.. Best viewed in color. 0, consensus 3, unweighted centralized 7, and weighted centralized 8 strategies, we also examine the performance of non-cooperative and distributed Polyak-Ruppert PR averaging strategies. The non-cooperative PR algorithm can be described as: φ k,i = φ k,i µi w Qφ k,i,x k,i [adaptation] 85 w k,i = w k,i + i φ k,i w k,i [averaging] 86 which is simply the non-cooperative algorithm 6 operating independently, and a time-averaging process of the iterates to smooth the estimates. On the other hand, the distributed version that we implement for comparison takes the form: ψ k,i =φ k,i µi w Qφ k,i, x k,i [adaptation] 87 φ k,i = a lk ψ l,i [aggregation] 88 l= w k,i = w k,i + i φ k,i w k,i [averaging] 89 where the combination matrix with elements {a lk } is the same as the one used for the diffusion and consensus strategies. It has been shown for single-agent stochastic approximation algorithms [], [35], [45], [46] that the averaging operation allows the algorithm to be robust to loss of strong-convexity in the risk function the algorithm will continue to converge at the fastest possible rate for the non-strongly-convex problem, albeit slower than the rate it would have achieved for the strongly-convex problem in addition to achieving the best possible rate regardless of the eigen-structure of the Hessian matrix H which is important when H is ill-conditioned. Unfortunately, these algorithms have also been shown to be sensitive to initialization [] in that transient terms in their excess-risk curve decay at a relatively slow rate of /i in comparison to stochastic gradient descent, which forgets the initialization at a rate of /i λminµ asymptotically Theorem 5, which can be made to decay faster with the choice of µ, and faster in the earlier stages []. In addition, the stepsize sequence µi for the PR algorithms must decay slower than /i; for example µi = µ/i /3. Various methods have been suggested for avoiding such drawbacks such as to start the averaging process after some number of iterations [47] or selecting a small initial step-size µ to reduce the size of the over-shoot. Unfortunately, it is usually not clear how many iterations should elapse before averaging starts. This is still a useful research direction to explore. In our simulations we use µi = µ/i /3 as suggested in [] for the PR averaging strategies. The curves illustrate that the difference in performance between non-cooperative processing 6 and the diffusion algorithm with doubly-stochastic weights i.e., the algorithm studied in [6] is about 4dB 0 log 0 N. We also observe that 0- achieves 0dB per decade decay in simulation /i rate, and that the diffusion strategy with left-stochastic weights outperforms the un-weighted centralized solution 7 and the doubly-stochastic implementation. In comparison to the consensus algorithm 3, the diffusion algorithm 0- is seen to have better transient performance, and the network excess-risk for the consensus strategy remains higher than that for diffusion regardless of the combination matrix used as predicted by Theorem 3. In addition, we observe that the left-stochastic diffusion algorithm approaches the same performance curve as the weighted centralized strategy 8 as predicted by theory. We also observe that the Polyak-Ruppert averaging diffusion scheme exhibits worse performance than the non-averaged case when the condition number of H is unity. This is due to the poor transient performance of the averaged algorithm [], [48]. However, as the condition number of H worsens,

13 TOWFIC, CHEN, SAYED: EXCESS-RISK OF DISTRIBUTED STOCHASTIC LEARNERS 3 a Condition number of the matrix H is. Comparison of noncooperative, centralized, diffusion, and consensus strategies. b Condition number of the matrix H is. Comparison of nonaveraged and averaged diffusion strategies. c Condition number of the matrix H is. Comparison of noncooperative, centralized, diffusion, and consensus strategies. d Condition number of the matrix H is. Comparison of nonaveraged and averaged diffusion strategies. e Condition number of the matrix H is 4. Comparison of noncooperative, centralized, diffusion, and consensus strategies. f Condition number of the matrix H is 4. Comparison of nonaveraged and averaged diffusion strategies. Fig. 4: Comparison between learning curves of non-cooperative processing 6, diffusion algorithm 0-, consensus algorithm 3, unweighted centralized 7, weighted centralized 8, and Polyak-Ruppert averaging scheme for quadratic loss minimization. Best viewed in color. the averaged diffusion algorithm outperforms the standard diffusion scheme. B. Regularized Logistic Regression Now consider an application where the dependent variable yk i is binary i.e, it assumes values from the set {+, } with hk, RM. The log-odds function is assumed to obey the linear model [49, p. 7]: pyk i hk,i ; w log = yk iht 90 k,i w pyk i hk,i ; w Solving for pyk,i hk,i ; w, we have that the likelihood is given by pyk i hk,i ; w = +e yk iht k,i w 9 Then, the maximum-likelihood estimate of wo minimizes the negative log-likelihood function: T Qw, hk,i, yk i = log + e yk ihk,i w 9 Observe that Qw, hk,i, yk i is not strongly-convex. In order to alleviate this difficulty, it is possible to add a small

14 4 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. XX, NO. XX, MONTH 06 a Network topology used to obtain the simulation results in Sec. V-B. b Plot of noise variances across the 8 nodes in the simulation. c Regularized KL divergence attained by nodes that utilize the non-cooperative algorithm 6, the unweighted centralized algorithm 7, the weighted centralized algorithm 8, the diffusion strategy 0-, and consensus strategy 3 with optimized weights 59 and with doubly-stochastic weights p = N. Theoretical curves are from 54, 63 and 65. Curves are averaged over 000 experiments. N Fig. 5: Plot of the simulation parameters and results for the regularized logistic regression simulation. regularizer so that [50], [5]: Qw, h k,i, y k i = ρ w + log + e y kih T k,i w 93 and the excess-risk then has the interpretation of being the regularized KL divergence: D RKL pyk i h k,i ; w o py k i h k,i ; w ρ w w o + E { log pyk i h k,i ; w o py k i h k,i ; w } 94 We generate a random adhoc network of N = 8 nodes illustrated in Fig. 5a where each node samples M = dimensional feature vectors from a Gaussian mixture with two components and probability density function h k,i N, I M + N, I M, where N γ, Σ denotes the multivariate Gaussian density function with mean vector γ and covariance matrix Σ. The labels y k i are generated according to py k i h k,i ; w o = 95 + e y kih T k,i wo where w o is chosen arbitrary at the start of each experiment. Each node performs the diffusion algorithm listed in 0- by computing the gradient according to the instantaneous approximation of the strongly-convex regularized KL divergence 94. Since in this scenario, the noise profile is uniform across the agents, the optimal weights 59 become p o = N N, which leads to a doubly-stochastic combination policy, A. In order to simulate a scenario involving a left-stochastic combination rule, we modify the instantaneous approximations for the gradient vectors across the agents by adding a small

15 TOWFIC, CHEN, SAYED: EXCESS-RISK OF DISTRIBUTED STOCHASTIC LEARNERS 5 additive noise component to these approximations at each agent with varying power profile shown in Fig. 5b. These noise perturbations help model unknown effects or even roundoff errors. The regularization constant ρ is chosen to be 5 in the simulation since it will determine λ min and in order to ensure that our approximation in 63 is valid, we choose µ to be 5. The optimizer is found by optimizing the sum of the regularized KL divergences and using empirical average for the expectation. Figure 5c shows the resulting curves, which are averaged over 500 experiments with a different random network topology per experiment. We plot the performance of the noncooperative algorithm along with its theoretical performance both averaged over the nodes. In addition, we show the performance of the centralized algorithms 7 and 8 along with their theoretical performance obtained from the righthand-side of 65 and 76. We see that the diffusion algorithm outperforms the non-cooperative algorithm and matches the centralized algorithm listed in 7 when the combination policy A is doubly-stochastic, as predicted by Theorem. We also observe that when A is chosen to be left-stochastic, the diffusion strategy outperforms 7, as predicted by 73 and approaches to same performance as the weighted centralized algorithm 8, as expected. VI. CONCLUSIONS AND FUTURE WORK We performed a detailed mean-square-error analysis of the asymptotic convergence behavior of distributed strategies of the consensus and diffusion types. We established that the algorithms converge asymptotically at the rate of O/i and, more importantly, derived the exact constant that multiplies the convergence rate factor /i. By using the derived expressions, we were able to compare the performance of various implementations against each other including noncooperative strategies, centralized strategies, and distributed consensus and diffusion strategies. The analysis revealed that diffusion implementations outperform the other strategies and deliver the best excess-risk performance asymptotically as centralized strategies; we also showed how to optimize this performance over the choice of combination weights. We showed that the optimal weights are left-stochastic rather than doubly-stochastic and derived an expression for them in terms of the Hastings rule. We also observed that the diffusion implementation leads to smaller overshoot during the transient phase in relation to the other strategies in simulations. It was also demonstrated that the asymptotic performance of the difusion strategy is independent of the particular network topology that connects the agents. The analysis in this work was done under the assumption of stationary data distributions, in which case decaying step-sizes are justified. If the statistical distribution of the data drifts with time, then constant stepsizes need to be employed. In this case, the dynamics of the learning network is modified in one important way in that the gradient noise does not die out anymore it is annihilated by decaying step-sizes but not by constant step-sizes. The performance of the networks in the constant step-size regime is studied in [4], [5], [3], [34]. A useful extension of this work is to study the Polyak-Ruppert averaging version for robustness against ill-conditioning and lack of strongconvexity. APPENDIX A PROOF OF THEOREM We denote the error vectors at node k at time i by ψ k,i w o ψ k,i and w k,i w o w k,i. We subtract 0- from w o using 3 to get ψ k,i = w k,i +µi [ w J k w k,i +v k,i w k,i ] 96 w k,i = a lk ψl,i 97 l= Using the mean-value relation, we can write where w J k w k,i = H k,i w k,i 98 H k,i 0 wj k w o t w k,i dt 99 and where we used the fact that w J k w o = 0 since w o optimizes J k w. Substituting 98 into 96, we get ψ k,i = [I µih k,i ] w k,i + µiv k,i w k,i 00 We now derive mean-square-error MSE recursions by noting that x x T x is a convex function of x. Therefore, applying Jensen s inequality [5, p. 77] to 97 we obtain: E{ w k,i F i } a lk E{ ψ l,i F i } 0 l= for k =,..., N. From 00 and using Assumption, we get E{ ψ k,i F i } = E{ w k,i Σ k,i F i }+ µ i E{ v k,i w k,i F i } 0 where Σ k,i I M µih k,i. The matrices Σ k,i can be seen to be positive semi-definite for large enough i and bounded by where 0 Σ k,i γ ii M 03 γi max { µi λ, µiλ } Now note that γ i can be upper-bounded by: 04 γ i = max { µi λ + µ i λ, µiλ + µ iλ } µiλ + µ i λ 05 In order to simplify the notation, we introduce the upper-bound βi µiλ λ + α µi 06 λ where α is defined by 33, namely, E{ v k,i w k,i F i } α w k,i + σ v 07

16 6 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. XX, NO. XX, MONTH 06 Combining 0, 03, and 07, we obtain for k =,..., N: E{ ψ k,i F i } βi w k,i + µ iσ v 08 We introduce the MSE vectors: W i [ w,i,..., w N,i ] T Y i [ ψ,i,..., ψ N,i ] T and rewrite 0, and 08 as: 09 0 E{Y i F i } βiw i + µ iσ v N E{W i F i } A T E{Y i F i } where x y indicates that each element of the vector x is less than or equal to the corresponding element of vector y. Using the fact that if x y then Bx By for any matrix B with non-negative entries, we can combine the above inequality recursions into a single recursion for W i : E{W i F i } βia T W i + µ iσ v N 3 Now, we multiply both sides by p T, where p is the Perron eigenvector of A. This yields the recursion: E{p T W i F i } βip T W i + µ iσ v 4 A. MSE Convergence For the first part of Theorem asymptotic MSE convergence, we evaluate the expectation of both sides of the inequality in 4 to get E{p T W i } βie{p T W i }+µ iσ v 5 Now since µi 0, we conclude that for large enough i > i o, the sequence µ i will assume smaller values than µi. Therefore, a large enough time index, i o > 0 exists such that the following condition is satisfied: 0 < βi = µiλ + λ + α µ i < 6 Furthermore, noting that i= βi =, lim µ iσv/ βi = 0, we then invoke Lemma 4 Appendix G to arrive at the desired result. B. Almost-Sure Convergence We see that 4 fits the form of Lemma 6 Appendix G, where 6 will hold for large enough i, i= βi =, lim µ iσv/ βi = 0, and i= µ iσv <, and we conclude that p T W i 0 almost surely so W i 0 almost surely as well since all the entries of p are strictly positive. This implies that w k,i w o almost surely for all k =,..., N. C. MSE Convergence Rate For the convergence rate, we take the expectation of to get E{p T W i F i } βip T W i + µ iσ v 7 E{p T W i } a i + b i E{p T W i } + c i 8 where a µλ, b µ λ + α, and c σ vµ. Using Lemma 5 Appendix G, we have that when a = λµ >, lim sup E{p T W i } i c a, 9 which in turn implies that each E w k,i diminishes at the rate i, which is our desired result. D. Fourth-Order-Moment Convergence Rate and Observe that by Jensen s inequality, we have that E w k,i 4 a lk E ψ l,i 4 0 l= ψ k,i 4 = I M µih k,i w k,i +µiv k,i w k,i 4 By utilizing Lemma 5 from [53, p. 3], we obtain: E{ ψ k,i 4 F i } θ i w k,i 4 + where 3µ 4 ie{ v k,i w k,i 4 F i }+ 8θ i w k,i E{ v k,i w k,i F i } θ i 4µiλ + µ iλ + λ + µ 4 i λ 4 3 θ i βi µ i 4 Now, using 8, we have that E{ v k,i w k,i 4 F i } α 4 w k,i 4 + σ 4 v4 5 and using 33 and 5 in, we obtain E{ ψ k,i 4 F i } θ i w k,i 4 + or, equivalently, E{ ψ k,i 4 F i } 3µi 4 α 4 w k,i 4 + σ 4 v4+ 8θ i w k,i α w k,i +σ v 6 a i + Oi w k,i 4 + Oi w k,i +Oi 4 7 where a 4λµ. Taking the unconditional expectation of E{ ψ k,i 4 F i }, we obtain, E ψ k,i 4 a i + Oi E w k,i 4 + Oi E w k,i +Oi 4 8

17 TOWFIC, CHEN, SAYED: EXCESS-RISK OF DISTRIBUTED STOCHASTIC LEARNERS 7 Now, observing that E w k,i Oi according to 9, we have that E ψ k,i 4 a i +Oi E w k,i 4 + Oi 3 9 We can now form the network variables so that W 4 i W 4 i [ E w,i 4,..., E w N,i 4] T 30 a i + Oi A T Wi 4 + Oi 3 3 and taking the maximum-norm, we obtain: Wi 4 a i + Oi Wi 4 + e i 3 3 where e is a constant independent of i. Using Lemma 5 Appendix G, we have that when a = 4λµ > : lim sup W 4 i i e a, 33 and therefore E w k,i 4 = Oi when λµ >. APPENDIX B PROOF OF THEOREM First, we establish an equality regarding the relationship between the excess-risk and weighted means-square-error quantities: Lemma Time-evolution of excess-risk: Let Assumptions hold. Then, the excess-risk at agent k is given by ER k i=ati+tti+ho,k i+ho i+ho 3 i 34 where ATi µ jtrg i Ω j,i, [Asymptotic Term] 35 TTi E W0 Ω 0,i, [Transient Term] 36 HO,k i E w T k,i S k,i w k,i, [High-order Term] 37 HO i TrR d,j Ω j,i, [High-order Term] 38 HO 3 i Tr Ω j,i E{B j Wj d T j }, [High-order Term] Ω j,i i t=j+ B T t Σ i t=j+ B T t T Σ E kk H 4 S k,i S k,i H 4 A A I M 43 G i A T E{R g,i Wi }A 44 H I N H 45 B i A T I µih 46 { } H i diag wj k w o t w k,i dt 47 k N 0 d i µia T H H i Wi 48 R d,i E{d i d T i } 49 Moreover, the products that appear in expression 40 for Ω j,i are performed in the following order: Ω j,i = B T j+ B T i ΣB i B j+ 50 Proof: First, observe from 45 that: ER k i = E Wi E kk S k,i = E w k,i S T k,i w k,i = E w k,i T H + S k,i w k,i = E Wi Σ+HO,k i 5 Now to study E Wi Σ, we first verify from 0 that the error quantity Wi evolves according to the following dynamics: Wi = B i Wi +µi A T g i w i +d i 5 Equating the energies of both sides of 5 and taking expectations, we arrive at the following recursion, which relates weighted variances of two successive network error vectors, Wi and Wi : where E Wi Σ = E { Wi B Ti ΣBi } + µ i TrG i Σ + E d i Σ+ E{d T i ΣB i Wi } 53 G i A T E{R g,i Wi }A 54 If we now iterate the recursion 53, we obtain that E Wi Σ = ATi + TTi + HO i + HO 3 i 55 where ATi, TTi, HO i, and HO 3 i were defined in 35, 36, 38, and 39, respectively. Then, combining 5 and 55, we obtain the following representation for the excess-risk: ER k i = ATi+TTi+HO,k i+ho i+ho 3 i 56 where HO,k i was defined in 37. The first term on the right-hand side of 34 models the asymptotic behavior of the network since, as we will see further ahead, it will decay at the slowest rate in comparison to all other terms in 34. In comparison, the second term on the right-hand side of 34 models the transient behavior of the diffusion network since it is governed by the initial error in estimating w o at the different nodes. The remaining terms will decay faster than the first two terms, and thus will be considered high-order terms. It is necessary to study the behavior of all terms in 34 to understand the dynamics of the network. First, we will establish the convergence rate of the so-called asymptotic-term ATi:

18 8 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. XX, NO. XX, MONTH 06 Theorem 4 Convergence rate of asymptotic term ATi: Let Assumptions - hold. It then holds asymptotically that M ATi µ pt λ m α m ir v,m p 57 m= where the notation fi gi implies that lim fi/gi = and α m i is defined as α m i { λ mµ i + Θ i, λmµ λm µ logi/i, λ m µ = 58 Moreover, λ m is the m-th eigenvalue of H and the N N matrix R v,m is defined as: R v,m diag{φ T R o v,φ mm,..., Φ T R o v,n Φ mm } 59 where the notation X mm denotes the m-th diagonal element of the matrix X. Specifically, when λ min µ >, we have that: ATi µ i m= λ m µ λ m µ pt R v,m p 60 Proof: Observe that the covariance matrix E{R g,i Wi } converges to a constant matrix when λµ > : E{R g,i Wi }=E{R g,i N w o }+O i min,γ/ 6 where 0 < γ 4 is from 30. To see this, consider the quantity: ER g,i Wi E{R g,i N w o } a E R g,i Wi R g,i N w o b κ E Wi + κ E Wi γ c κ E Wi + κ E Wi γ d O/i + O/i γ/ 6 where steps a and c are due to Jensen s inequality, step b is due to 34, and step d is due to 4 4 when λµ >. We conclude, therefore, that: lim E{R g,iwi } = R o g 63 where R o g is defined in 3. Combining with 44, we obtain lim G i = G A T R o ga 64 where, as pointed out in 35, R o g is block-diagonal. Since we are interested in the asymptotic performance of the term ATi, we may replace G i with G in the following analysis. Now, with Σ = E kk H, we rewrite ATi as ATi = µ jtrr o gω j,i 65 We observe that 65 is in the exact form required by Lemma Appendix G, where we make the identifications: L R o g, b 0, q 66 where R o g is a block-diagonal matrix with block-matrices {R,..., R N }. We thus obtain 57. Next, we consider the remaining terms of 34, beginning with TTi: Theorem 5 Convergence rate of transient term TTi: Let Assumptions - hold. The transient term in 34 decays asymptotically according to TTi = Θ /i λminµ, as i 67 Proof: See Appendix C. Actually, in App. C, we derive upper and lower bounds for the transient term see 9 and 94. Examining these bounds, we notice that the transient term will grow initially before it starts to decay. The asymptotic rate of decay of the transient error is on the order of /i λminµ, for any value of λ min µ. We now establish the convergence rate of the remaining high-order terms: HO,k i, HO i, and HO 3 i in the case where λ min µ > : Theorem 6 Convergence rate of HO,k i: Let Assumptions hold. It then holds asymptotically that HO,k i = O /i 3/ 68 when λµ >. Proof: We have from 4 that S k,i 0 t 0 a τ k w k,i wjw o s t w k,i H dsdt 0 t s ds dt 0 = τ k 6 w k,i 69 where step a is due to the global Lipschitz condition 8. Therefore, HO i τ k 6 E w k,i 3 b τ k E wk,i 4 3/4 6 c Oi 3/, [λµ > ] 70 where step b is due to Jensen s inequality, which states that Efx fex for a concave function fx: E w k,i 3 = E w k,i 4 3/4 E w k,i 4 3/4 7 and step c is due to 4 when λµ >. Observe that Theorem 6 establishes that HO i = o/i. This is an important result since the asymptotic term converges at the rate /i when λ min µ >. Next, we establish the following lemma for the convergence rate of the second highorder term: HO i: Theorem 7 Convergence rate of HO i: Let Assumptions - hold. It then holds asymptotically that { Θ/i 3 + Θ/i λminµ, < λ min µ 3 HO i Θ logi i, λ 3 min µ = 3 when λ min µ >. 7

19 TOWFIC, CHEN, SAYED: EXCESS-RISK OF DISTRIBUTED STOCHASTIC LEARNERS 9 Proof: See Appendix D. Thus, we have that Theorem 7 demonstrates that the second high-order term HO i = o/i, which is again faster than the asymptotic term ATi. Finally, we examine the final highorder term, HO 3 i: Theorem 8 Convergence rate of HO 3 i: Let Assumptions - hold. It holds asymptotically that HO 3 i { Θ/i 3/ + Θ/i λminµ, < λ min µ 3 Θ logi, λ i 3/ min µ = 3 73 when λ min µ >. Proof: See Appendix E. We observe from Theorems 5 8 that T T i, HO,k i, HO i, and HO 3 i are higher-order terms in comparison to ATi when λ min µ > since they decay at a rate that is faster than that of ATi, Θ/i. Notice that while 4 4 required λµ >, Theorem 4, which establishes the same asymptotic rate, requires λ min µ >. However, due to 9, we know that: λµ > = λ min µ > 74 and thus as long as the condition of 4 4 is satisfied, the condition on λ min µ > will automatically be satisfied. We conclude from 34 and Theorems 4 8 that: ER k i µ i m= λ m µ λ m µ pt R v,m p 75 when λ min µ >, which is our desired result. It is important to observe that it is possible for all terms TTi, HO,k i, HO i, and HO 3 i to depend on the network topology, just as the asymptotic term ATi depends on it through the Perron vector p. However, since all of the former terms decay faster than ATi, they do not contribute to the asymptotic excessrisk curve and thus their dependency on the network topology was not examined. APPENDIX C PROOF OF THEOREM 5 Introduce the Jordan canonical form of matrix A: A = T DT 76 The relation between the quantities {T, D} and {Y, D N, R} in 47 is as follows: T = [ p Y ] [ ] [ ], D =, T T = N R T 77 D N Substituting 4, 46, and 76 into 36, we observe that the weighting matrix Ω 0,i can be written as: Ω 0,i = B T B T i ΣB i B = T Di T E kk T T D Ti T T ΦK 0,i Φ T 78 where K j,i is the following diagonal matrix: K j,i i t=j+ I M µtλ Λ i t=j+ I M µtλ 79 Now, using 77, we conclude that so that T DT i = T D i T = [ p Y ] [ T D i T E kk T T D Ti T T D i N ] [ T N R T = p T + Y D i N RT 80 a = p T + Y D i N RT E kk p T + RD Ti N Y T = p T E kk p T + Y D i N RT E kk p T + p T E kk RD Ti N Y T +Y D i N RT E kk RD Ti N Y T b = pp T + Y D i N RT E kk p T + p T E kk RD Ti N Y T + Y D i N RT E kk RD Ti N Y T 8 where step a is due to 80 and we used the fact that T N E kk N = since E kk contains a single unit element at the k, k entry in step b. The first term of 8 is a constant and does not vary with i. On the other hand, all other terms vary with i. Using Lemma Appendix G, we can now see that all the time-varying terms in 8 decay to zero at least at an exponential rate: Y D i N RT E kk p T Y D i N RT E kk p T 8 p T E kk RD Ti N Y T = Y D i N RT E kk p T 83 YD i N RT E kk RD Ti N Y T E kk Y D i N RT 84 where, from Lemma, Y D i N RT c T T ρdn + ] i 85 And since the spectral radius of D N is strictly less than, we find that all terms in 8, with the exception of pp T, will decay to zero at an exponential rate. We can ignore the decaying terms since the convergence rate will be dominated by a slower term that decays at the rate i λminµ in K 0,i. To see this, we first write down the transient term as: E W0 Ω 0,i { E W T 0 pp T ΦK 0,i Φ T W0} = { } E W T 0 p ΦI N K 0,i p Φ T W0 86 The notation ai bi is defined after 57. We now introduce the linear transformation, W 0, of the initial error vector W0: W 0 p Φ T W0 87 This transformation allows us to rewrite 86 as E W0 Ω 0,i { } E W T 0 I N K 0,i W 0 = E{ w T 0,kK 0,i w 0,k} 88 k=

20 0 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. XX, NO. XX, MONTH 06 where w 0,k is the k-th M block of the block-vector W 0, where W 0 = col{ w 0,,..., w 0,N }. The only remaining dependence on the iteration is now embedded in K 0,i as defined in 79. Examining this diagonal matrix we have that: i K 0,i =diag λ i µjλ,..., λ M µjλ M 89 where λ m is the m-th eigenvalue of the matrix H. Substituting 89 into 88, we obtain: E W0 Ω 0,i i λ m µjλ m E w 0,k,m m= k= 90 where w 0,k,m denotes the m-th element of the vector w 0,k, so that w 0,k = col{ w 0,k,,..., w 0,k,M }. In order to determine i the rate of convergence of the quantity λ m µjλ m, we appeal to Lemma 7 Appendix G. We use 48 in conjunction with 90 to conclude that the transient term can be lower- and upper-bounded by: E W0 Ω 0,i N λ m s m c m,i E w 0,k,m m= k= = e λmµ + log λmµ i j λmµ i i λ m µ λmµ a m= λ m λ m µ λ m µ+ λmµ λmµ + λmµ λ mµ + m= E w 0,k,m k= e λmµ + log λmµ j e λ mµ i λ m µ λmµ 9 λ m λ m µ λ m µ + λmµ λmµ + E w 0,k,m λmµ k= λ mµ + 9 E W0 Ω 0,i N λ m s m c m,i E w 0,k,m = b m= m= k= i+ λmµ i+ i + λ m µ λmµ e λmµ + log λmµ j λ m λ m µ λ m µ+ λmµ λmµ + λmµ m= λ mµ + E w 0,k,m k= e λmµ + log λmµ j e λ mµ i + λ m µ λmµ 93 λ m λ m µ λ m µ + λmµ λmµ + E w 0,k,m λmµ k= λ mµ + 94 where steps a and b are due to the expressions λ m µ/i i and λ m µ/i + i+ asymptotically converging to e λmµ [54, p. 6], which is independent of i. In fact, the numerators in 9 and 93 account for the increase in the excess-risk at the beginning of the iterations. Eventually, however, the denominator terms of the form i λ m µ λmµ and i + λ m µ λmµ will overtake the increase in λ m µ/i i and λ m µ/i + i+ and the excess-risk will begin to decay from that point onwards. Furthermore, examination of 9 and 94 shows that the m-th term decays at the rate Oi λmµ, making the slowest term vanish at Oi λminµ, where λ min is the smallest eigenvalue of the matrix H. APPENDIX D PROOF OF THEOREM 7 Since the matrices R d,i and Bj+ T BT i ΣB i B j+ are positive semi-definite as long as Σ is positive semi-definite, we have that HO i Rd,j Tr µ j Tr µ jω j,i i = E d j µj Tr µ jω j,i 95 where in the first inequality we used the property that 0 TrAB TrATrB for nonnegative matrices A and B Lemma 3 in Appendix G. To evaluate the convergence rate of E d j /µj, we proceed as follows. First, we denote the block diagonal entries of the matrix H j defined by 47 as H k,j Then, we note that d j /µj A T A T 0 w J k w o t w k,j dt 96 H H,j w,j. H H N,j w N,j H H k,j w k,j 97 k= Using the global Lipschitz condition 8, we have that H H k,j = H wj k w o t w k,j dt 0 τ k w k,j Substituting into 97, we obtain d j /µj AT 0 t dt = τ k w k,j 98 τ k w k,j 99 k=

21 TOWFIC, CHEN, SAYED: EXCESS-RISK OF DISTRIBUTED STOCHASTIC LEARNERS so that, upon squaring and using the Cauchy-Schwarz inequality, where d j /µj 4 AT τ τ w k,j 4 00 k= τk 0 k= Taking the expectation, we obtain E d j /µj 4 AT τ E w k,j 4 k= N τ A T W 4 4 j a = Oj, λµ > 0 where Wj 4 is defined in 30 and step a is due to 4. This implies that there exist constants d > 0 and j o such that for all j j o, we have that E d j /µj /j < d. Therefore, we can split the sum in 95 into two sums: j o HO i d j Tr µ jω j,i + E d j Tr Ω j,i j=j o 03 For the first term, we immediately recognize that it fits in the form required by Lemma Appendix G with the identifications: b, L I, q j o. So we conclude that the first term asymptotically becomes: d where j Tr µj d µ Ω j,i p j=j o α m, i = λ m α m, i m= 04 { i 3 λ mµ 3 + Θi λminµ, < λ m µ 3 logi i 3, λ m µ = 3 05 and where we require λ min µ > since step 0 required λµ >, which automatically guarantees that condition λ min µ > is satisfied since λ λ min. Now 04 will converge at the rate of its slowest term and, hence, we have that d j Tr µj Ω j,i j=j o { Θ i 3 + Θi λminµ, < λ min µ 3 Θ logi i, λ 3 min µ = 3 06 On the other hand, for the second term of 03, we recall and observe that the trace can be computed as: TrΩ j,i = TrT D i j T E kk T T D Ti j T T ΦK j,i Φ T a = TrT D i j T E kk T T D Ti j T T m= λ m Γ i λ m µ Γ i Γ j + Γ j + λ m µ b TrT D i j T E kk T T D Ti j T T i λmµ Γ j + λ m Γ j + λ m µ m= c TrT E T E kk T T E T T i λmµ Γ j + λ m Γ j + λ m µ m= d = Trp T NE kk N p T = m= m= i λmµ λ m Γ j + Γ j + λ m µ i λmµ λ m Γ j + Γ j + λ m µ p 07 where step a is due to Lemma 7 Appendix G; step b is due to Lemma 8 Appendix G; step c is due to the fact that D i j E for large i since j will only increase up to j o, which will become small in comparison to large i asymptotically; and step d is due to the fact that T E T is a rank- matrix that is spanned by the left- and righteigenvectors of A corresponding to the eigenvalue. The left eigenvector is N since A is left-stochastic. The right eigenvector is p and is normalized so that the sum of its entries is unity; i.e., p T N = and Ap = p. Then, T E T = p T N. Substituting 07 into the second term of 03, we obtain: j o E d j Tr Ω j,i m= j o i λmµ E d j λ m Γ j + Γ j + λ m µ p a = Oi λminµ 08 where step a is due to the fact that the inner sum has finite bounds. Combining 06 and 08, we have that: { Θi 3 +Θi λminµ, < λ min µ 3 HO i= Θ logi 09 i, λ 3 min µ = 3 APPENDIX E PROOF OF THEOREM 8 The study of the final term of 34 is slightly more complicated than that of the others since in this case the matrix E{B j Wj d T j } is no longer symmetric, let alone positive semi-definite. Nevertheless, using Lemma 3 Appendix G, we can obtain: HO 3 i Tr µ E{B j Wj d T j jω } j,i µj 0 Now, using Jensen s inequality and the submultiplicative norm property of the spectral norm, we have that E{B j Wj d T j } µ j µj E{ B j Wj d j /µj }

22 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. XX, NO. XX, MONTH 06 µj E{ AT I µjh Wj d j /µj } Using 99, we obtain E{B j Wj d T j } µ j µj AT I µjh { } E Wj τ k w k,j k= Now, since Wj N k= w k,j, we have that: { } E Wj τ k w k,j k= { N N } E w k,j τ k w k,j a b k= k= N E w k,j E τ k w k,j k= k= N N E w k,j τ E w k,j 4 3 k= k= where steps a-b are due to the Cauchy-Schwarz inequality and τ N k= τ k. Now, from Theorem, we know that E w k,j = Oj and E w k,j 4 = Oj { when λµ >. Therefore, we can conclude that E Wj } N k= τ k w k,j = Oj 3/ and so E{B j Wj d T j } µ j = Oj / 4 when λµ >. That is, there exist constants c > 0 and j such that for all j j, we have that E{B j Wj d T j } µ j < c j / 5 So we can now mirror the technique in App. D and split the sum in 0 as HO 3 i c j j=j Trj / µ jω j,i + Tr Ω j,i E{Bj Wj d T j } 6 For the first term, we use Lemma Appendix G with the identifications b, L I, q j to conclude that c where Trj / µ jω j,i c µ p j=j α m, i = λ m α m,/ i m= 7 { i 3/ λ mµ 3 + Θi λmµ, < λ m µ 3/ logi i 3/, λ m µ = 3/ 8 Since 7 will converge at the rate of its slowest term, we have that: c j / Tr µj Ω j,i j=j { Θ i 3/ +Θi λminµ, < λ min µ 3/ Θ logi, λ i 3/ min µ = 3/ For the second term, we substitute 07 to obtain j Tr Ω j,i E{Bj Wj d T j } m= j i λmµ λ m 9 E{B j Wj d T j } Γ j+ p Γ j+ λ m µ a = Oi λminµ 0 where step a is due to the fact that B j, Wj, and d j depend on j only and are independent of i so that this finite sum leads to a constant term that is independent of i. Combining 9 and 0, we have that: { Θi 3/ +Θi λminµ, <λ min µ 3 HO 3 i= Θ logi, λ i 3/ min µ= 3 APPENDIX F PROOF OF THEOREM 3 The main difference between the dynamics of the diffusion and consensus implementations is in the definition of the B i matrices in 46 where Bi diff A T I MN µih Bi cons A T µih 3 where A appears in a multiplicative form in and in an additive form in 3. The apparently small change in the order in which operations take place within the consensus and diffusion strategies actually leads to significant differences in the evolution of the error vectors over the respective networks leading to worse transient and asymptotic performance for consensus strategies; these conclusions are consistent with results reported in [0], [3] albeit for constant step-size adaptation. To examine the differences in behavior, we will consider the network excess-risk: ERi ER k i 4 N k= When λ min µ >, we can average the asymptotic terms from 34 to obtain the following expression for the asymptotic excess-risk which can also be obtained by substituting Σ = I N H into 35: T ERi= i µ i jtr H Bt T G Bt T 5 N t=j+ t=j+ as i where B t is either Bt diff or Bt cons, depending on which algorithm we wish to examine. Likewise, the matrix

23 TOWFIC, CHEN, SAYED: EXCESS-RISK OF DISTRIBUTED STOCHASTIC LEARNERS 3 TABLE II: Variables in the Diffusion and Consensus Implementations. Algorithm Diffusion 0- Consensus 3 B i A T I MN µih A T µih λ k,m B i D kk µiλ m D kk µiλ m G A T R o g A Ro g y bt k,m Gyb k,m D kk ybt k,m Ro gy b k,m y bt k,m Ro gy b k,m G is either G diff or G cons, depending on the algorithm, where G diff A T R o ga and G cons R o g. Table II lists the parameters for both implementations. For simplicity, we assume A is symmetric; more generally, we can consider combination policies A that are close-tosymmetric and employ arguments similar to [0]. The final conclusion will be similar to the arguments given here. Now, since A is primitive, the matrix D in the Jordan canonical form of A = T DT will be diagonal with a single eigenvalue at one and with all other eigenvalues strictly inside the unit circle. We let the vectors r k and y k for k = {,..., N} represent the right and left eigenvectors, respectively, of the matrix A T corresponding to the eigenvalue D kk the k-th diagonal element of D, i.e., A T r k = D kk r k, yk TAT = D kk yk T, where we normalize the vectors so that rk Ty l = δ k,l. In fact, since A is symmetric, the eigenvectors {r k } are themselves orthonormal, as well as the eigenvectors {y k }. When Ap = p, then D =, r = N, and y = p. Furthermore, let {s m, m =,..., M} denote the eigenvectors of the matrix H, i.e., Hs m = λ m s m, where λ m is the m-th eigenvalue of H and the eigenvectors s m are normalized so that s m =. We observe that the matrices Bi diff and Bi cons share the same eigenvectors but have different eigenvalues. We denote the eigenvectors of B i by yl,m b. They can be found to be rb l,m = r l s m and yl,m b = y l s m [0], [3]. We now introduce the eigendecomposition B i = k= m= r b k,my bt k,mλ k,m B i 6 where the expressions for the eigenvalues λ k,m B i are listed in Table II for the diffusion and consensus strategies. Now, since the eigenvectors are orthonormal, we have that the finite product of B t matrices is: i t=j+ B t = rk,my b k,m bt k= m= We substitute 7 into 5 to get ERi = µ N j Tr H a = µ N i rk,m b rl,m bt j Tr i t=j+ k=l=m =m = i t=j+ N λ k,m B t λ k,m B t 7 k= l= m = m = y bt k,m Gy b l,m t=j+ λ l,m B t y bt k,m Gy b l,m b = µ N = µ N i rk T r l s T m Hs m i t=j+ j Tr k= m= λ l,m B t N k= l= t=j+ λ k,m B t r k λ m y bt i t=j+ k,mgy b k,m λ k,mb t λ m r k y bt k,mgy b k,m j i λ k,mb t 8 t=j+ where step a is due to H = I N H and r k,m = r k s m and step b is due to the fact that D is assumed to be diagonalizable and H is diagonalizable since it is symmetric, which implies that rk Tr l = r k δ k,l and s T m Hs m = λ k δ m,m, where δ i,j = only when i = j and is zero otherwise. Using Lemma 7 Appendix G and the asymptotic expansion in 66, we can write: ER diff i a = µ N b = µ N c = µ N k= m= k= m= i λ m D kk r k y bt k,mr o gy b k,m j i t=j+ λ m Dkk r k y bt k= m= D i j d = µ N D i j kk j λ m r k y bt λ k,mb t k,mr o gy b k,m i t=j+ k,mr o gy b k,m µjλ m kk j Γ i+ λ m µ Γ j + Γ i + Γ j+ λ m µ k= m= i λ m r k y bt k,mr o gy b k,mi λmµ D i j kk Γ j Γ j + λ m µ 9 where step a is due to the last line of Table II; step b is due to the second line of Table II; step c is due to 46; and step d is due to 65 and the property x Γx = Γx +. For the consensus algorithm, we have: ER cons i a = µ N k= m= λ m r k y bt k,mr o gy b k,m

24 4 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. XX, NO. XX, MONTH 06 b = µ N c = µ N d = µ N k= m= k= m= i j i t=j+ λ k,mb t λ m r k y bt k,mr o gy b k,m i j i t=j+ λ m r k y bt D i j kk k= m= i i λmµd kk D kk µjλ m k,mr o gy b k,m j Γ i + λ m µd kk Γ i + Γ j + Γ j + λ m µd kk λ m r k y bt k,mr o gy b k,m Γ j Γ j + λ m µd kk 30 D i j kk where step a is due to the last line of Table II; step b is due to the second line of Table II; step c is due to 46; and step d is due to 65 and the property x Γx = Γx +. By comparing 9 and 30, we observe that the equations for the asymptotic expected excess-risk for the diffusion and consensus strategies are identical except for the most inner summation. When D kk =, the summands inside the most inner terms are identical. In fact, this variation is the key to the performance difference between the two algorithms. Using Lemma 0 Appendix G with b = 0, we can obtain the following upper-bound for the inner summation of the diffusion strategy for any D kk < : i λmµ i D i j kk Γ j Γ j + λ m µ logd kk i 3 Similarly, using Lemma 0 with b = 0, we can obtain the following lower-bound for the inner summation of the diffusion strategy for any D kk < : D kk i λmµd kk as i. Therefore, ER cons i ER diff i = µ N i λmµd kk Γ j Γ j + λ m µd kk logd kk i D i j kk k= m= D i j kk i λmµ i 3 λ m r k y bt k,mr o gy b k,m Γ j Γ j + λ m µd kk Γ j Γ j + λ m µ D i j kk a µ bt λ m r k y N k,mr o gyk,m b k= m= logd kk i logd kk i = 0 33 for large i, where step a is due to 3 3. APPENDIX G USEFUL LEMMAS In this appendix, we list several useful results that are used in the later appendices. A. Matrix Results We first introduce the following lemma that relates the norm of a matrix power to the power of its spectral radius. Lemma Bound on the norm of a matrix power: Let A denote a matrix whose spectral radius, ρa, is strictly less ρa+ n than. Then, A n c where n N and c is some positive constant. Proof: From [40, p. 99], we have A n c ρa+ɛ n for any ɛ > 0. Now let ɛ = ρa/. We can also bound the trace of the product of two matrices. Lemma 3 A trace inequality: Given a positive semidefinite matrix A R R R 0 and a matrix B R R R, it holds that: TrAB TrA B 34 where is the induced norm or maximum singular value. Furthermore, when B is also positive semi-definite, we have that: 0 TrAB TrA B TrA TrB 35 Proof: Relation 34 was proved in [55]. When the matrix B is also positive semi-definte, we have from [56], [57] that: λ min BTrA TrAB λ max B TrA = B TrA Since the matrix B is positive semi-definite, we have that λ min B 0. Furthermore, we may use the inequality λ max B TrB since B is positive semi-definite to obtain 35 B. Convergence of Inequality Recursions We list the following lemma from [3, p. 45], which demonstrates the convergence of a deterministic recursion. Lemma 4 Convergence of a deterministic recursion: Let a sequence vi satisfy vi ηi vi + ζi ηi =, 0 ηi <, ζi 0, i=0 ζi ηi 0.

25 TOWFIC, CHEN, SAYED: EXCESS-RISK OF DISTRIBUTED STOCHASTIC LEARNERS 5 Then, lim sup vi 0 36 and if vi 0, then vi 0. In addition to demonstrating convergence of recursions of the above form, we are also interested in the convergence rate for special cases of ηi. To this end, we establish the following lemma, which extends Lemma 4 [3, p. 45]. Lemma 5 Convergence rate of a deterministic recursion: Assume that f > 0, d > 0, p > 0 with f > p and let the deterministic sequence vi 0 satisfy vi fi i + O vi + Then, lim sup Proof: Define the sequence Then, it holds that: vi i p d f p ui vi i p d f p d i p ui + = i + p vi + d f p a i+ p f i+ +O d i vi+ i+ p+ d f p b = vi i p + p i +O i f i+ +O i + d i + d f p c = vi i p f p/i + + O/i + d i + d f p = f p i + + O i ui + O i 40 where step a is due to 37, step b is due to the series expansion of + /i p about i =, and step c is due to: + p/i+o/i f/i + + O/i = f p/i + + O/i 4 Recall that we assumed f > p and, therefore, f p i + + O i = 4 and i= lim O/i f p i+ + O/i = 0 43 We now call upon Lemma 4 to deduce that lim sup ui 0, which leads to 38. We also have the following stochastic analogue to Lemma 4 [3, pp ]. Lemma 6 Convergence of a stochastic recursion: Let there be a sequence of random variables v 0,..., v i 0, Ev 0 <, satisfying E{vi v0,..., vi } ηi vi +ζi 44 with ηi=, 0 ηi <, ζi 0, k=0 Then, vi 0 almost surely. ζi ηi 0, i=0 C. Finite Products and Behavior of Special Functions ζi< 45 We now examine the behavior of finite products. We will observe that these products are closely related to different forms of the Gamma function defined in 47 [54]. Lemma 7 Bounds and identities on finite products: Let µ > 0, λ m > 0, and j < i, it holds that i t=j+ λµ t = Γ i + λµ Γ i + where the Gamma function is defined by: Γx 0 Γ j + Γ j + λµ 46 t x e t dt 47 Furthermore, when i is large, it holds that: i c m,i s m λ mµ c m,i s m 48 j where s m e λmµ + log λmµ j 49 c m,i f m,i λ m µ λ m µ + λmµ λmµ + λmµ λ mµ + 50 c m,i f m,i+ λ m µ λ m µ + λmµ λmµ + 5 λmµ λ mµ + i λmµ i f m,i i λ m µ λmµ 5 and indicates the ceiling operator and log denotes the natural logarithm. Proof: For 46, we observe using the property Γx + = x Γx that t=j+ t λµ = Γ t + λµ Γ t λµ 53 and, therefore, i t λµ = Γ j + + λµ Γ Γ i + λµ j + λµ Γ i λµ

26 6 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. XX, NO. XX, MONTH 06 = Γ i + λµ Γ j + λµ and, in the special case when λµ = 0, i t=j+ Then, i t=j+ t = Γ j + + Γ Γ i + j + Γ = Γ i + i Γ j + λµ t = = i t=j+ t λµ t i t=j+ t λµ i t=j+ t a = Γ i + λµ Γ j + λµ Γ j + Γ i where step a is due to For 48, we start from i λ mµ = e i λmµ log j j a = s m e i λmµ j= λmµ + log j 57 where step a is valid since i is assumed to be large and s m was defined in 49. Observe that s m is constant and does not depend on i. On the other hand, the second term in 57 can be bounded by using the following integral bounds for any increasing function fx: b a fxdx b fj j=a b+ a fxdx 58 Applying 58 to the function fx = log λmµ x is an increasing function in x for x > λ m µ, we get {e i λmµ j= λmµ + log j e i λmµ + e i λmµ j= λmµ + log j i+ e λmµ +, which λmµ log x dx λmµ log x dx 59 We can evaluate the definite integrals by noting that the indefinite integral of log λ m µ/x is given by log λ mµ dx = x log λ mµ x x λ m µ logx λ m µ 60 Evaluating the integrals in 59 using 60, we obtain the bounds i λ mµ + log λ mµ dx = i log λ mµ x i λ m µ logi λ m µ λ m µ + log and i+ λ mµ + λ mµ λ m µ + + λ m µ log λ m µ λ m µ + 6 log λ mµ dx = i + log λ mµ x i + λ m µ logi+ λ m µ + λ m µ log λ m µ λ m µ + λ m µ + log λ mµ λ m µ + 6 Substituting 6-6 into 59, we find that the righthand-sides of 59 reduce to e i λmµ log λmµ + x dx = c m,i 63 e i+ λmµ log λmµ + x dx = c m,i 64 Combining with 57, we arrive at 48. In 46, we established that a finite product can be interpreted in terms of Gamma functions. We are interested in the behavior of the Gamma function in the asymptotic regime for its argument. For this purpose, we call upon the following lemma from [54], [58], which describes some properties of the Gamma function. Lemma 8 Properties of Gamma functions: The Gamma function satisfies: for args + a < π and [ M Γs, x = x s e x Γ s + a lim s Γ s + c s a c = 65 m=0 m Γ s +m x m Γ s + O ] M x 66 for x, 3π/ < argx < 3π/, and any integer M. Here, the notation Γs, x denotes the upper incomplete Gamma function: Γs, x x t s e t dt 67 Using the above lemma, we can now examine the convergence rate of a series of Gamma functions. In Lemma 9, we observe that a series of a fraction of Gamma functions will converge at different rates, i ν or logi/i. And in Lemma 0 we observe that when each term in the series is weighted exponentially, the convergence rate will become faster. Lemma 9 Series of Gamma functions: Let b 0, q. Then, i λmµ j b j=q { i b Γ j Γ j + λ m µ λ + mµ b Θ/iλmµ, λ m µ + b logi, λ i b+ m µ = + b 68 Proof: Observe that 65 states that Γ s + a/γ s + c s a c for large s. Thus, for any ɛ > 0, there exists q such that for all s q, the difference between unity and the ratio of Γ s+a/γ s+c to its asymptotic function s a c is small: Γ s + a Γ s + c s a c < ɛ 69 which implies that ɛ s a c < Γ s + a Γ s + c < + ɛ s a c 70

27 TOWFIC, CHEN, SAYED: EXCESS-RISK OF DISTRIBUTED STOCHASTIC LEARNERS 7 Setting s = j, a = 0, c = λ m µ into 70, we obtain the following bound for the summand in 68: ɛ j λmµ b < j b Γ j Γ j + λ m µ < +ɛ j λmµ b 7 Therefore, we can split the sum into two terms, one covering values from q to q, while the other summing over values from q to i : j b Γ j i λmµ Γ j+ λ j=q m µ = q i λmµ j=q a = Oi λmµ + i λmµ j b Γ j Γ j+ λ m µ + i λmµ j=q j=q j b Γ j Γ j + λ m µ j b Γ j Γ j+ λ m µ 7 where step a is due to the fact that the first term is proportional to i λmµ with no other dependence on i. We would like to obtain the convergence rate of the second term in 7. For this, using 7, we have and i λmµ i λmµ j=q j=q j b Γ j Γ j + λ m µ < + ɛ i λmµ j b Γ j Γ j + λ m µ > ɛ i λmµ j=q j λmµ b j=q j λmµ b Therefore, we only need to study the behavior of i λmµ i j=q jλmµ b. We rely on the following integral bounds, which are valid for any positive monotonic function fx: b b b+ fxdx fi fxdx 75 a i=a a and use the following indefinite integral for the case where λ m µ b : x λmµ b dx = xλmµ b + constant 76 λ m µ b to obtain the following bounds i λmµ j=q j λmµ b i λmµ = i q x λmµ b dx i b λ m µ b + Oi λmµ 77 Substituting 77 into 73, we obtain the following upperbound for the second term of 7: i λmµ j=q j b Γ j Γ j + λ m µ < +ɛ i b λ m µ b +Oi λmµ 78 Similarly, we can use the integral bound 75 to obtain the lower-bound: i λmµ i λmµ j=q j λmµ b = i λmµ i q i λ mµ b λ m µ b x λmµ b dx q λmµ b λ m µ b i b λ m µ b + Oi λmµ 79 Substituting 79 into 74, we obtain the following lowerbound for the second term of 7: i λmµ j=q j b Γ j Γ j + λ m µ > ɛ i b λ m µ b +Oi λmµ 80 We conclude from 78 and 80 that for large i, when λ m µ b 0, and for any ɛ > 0, i λmµ Γ j Γ j + λ m µ i b λ m µ b Oi λmµ j=q < ɛ 8 which implies that, as i, j=q i λmµ Γ j Γ j + λ m µ i b λ m µ b +Oi λmµ 8 when λ m µ b 0, which is the first line of 68. On the other hand, when λ m µ b = 0, we may use the following indefinite integral in place of 76: x λmµ b dx = dx = logx + constant 83 x Then, we have that i +b j=q j = i +b i +b i j=q i q j x dx = logi i b+ + Oi b 84 Substituting 84 into 73, we obtain the following upperbound for the second term of 7: i +b j=q j b Γ j Γ j + λ m µ < + ɛ logi i b+ + Oi b 85 Similarly, we can use the integral bound 75 to obtain the lower-bound: i +b j=q j i +b i logi dx q x i b+ + Oi b 86

28 8 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. XX, NO. XX, MONTH 06 Substituting 86 into 74, we obtain the following lowerbound for the second term of 7: i +b j=q j b Γ j Γ j + λ m µ > ɛ logi i b+ +Oi b 87 We conclude from 78 and 80 that when λ m µ b = 0, and for any ɛ > 0, i λmµ Γ j Γ j + λ m µ logi i b+ Oi b < ɛ 88 j=q which implies that, as i, j=q i b Γ j Γ j + λ m µ logi i b+ 89 when λ m µ b = 0, which is the second line of 68. Compared with Lemma 9, we now scale each term in the series by an exponentially decaying weight. We will observe that the convergence rate becomes faster than in Lemma 9. Lemma 0 Series of weighted Gamma functions: Let q, b 0, and 0 < ρ <. Then we have the following: i λmµ and j=q ρ i j ρ logρ lim j b Γ j Γ j + λ m µ = Θ/i+b, i +b i i λmµ j=q as i 90 ρ i j j b Γ j Γ j + λ m µ logρ 9 Proof: Notice that due to 65, we have that for any ɛ > 0, there exists integer q such that for all j q, 7 is satisfied. Therefore, we may divide the sum in 90 into two parts, j < q, and j q : i λmµ j=q = q ρ i i λmµ ρ i j j b Γ j Γ j + λ m µ = Oi λmµ ρ i + i λmµ ρ j j b Γ i j Γ j + λ m µ + j=q j=q ρi j j b Γ j Γ j + λ m µ ρi j j b Γ j Γ j + λ m µ 9 Using the right-most bound in 7 for any ɛ > 0 and large enough q : ρ i j Γ j i λmµ Γ j + λ m µ Oi λmµ ρ i + + ɛ ρ i i λmµ j=q ρ j j λmµ b a Oi λmµ ρ i + + ɛ ρ i i i λmµ e x logρ x λmµ b dx q 93 where step a is due to 75. Now, in order to compute the definite integral on the right-hand-side of 93, we use the following indefinite integral [54, p. 08]: x m e βxn dx = Γγ, βxn nβ γ + constant 94 where γ m + /n, β 0, and n 0. Making the identifications: m λ m µ b, n, β logρ, we get e x logρ x λmµ b dx = Γλ mµ b, logρx logρ λmµ b + constant 95 Using 95 in conjunction with 93, we obtain the following: i λmµ + ɛ ρ i i λmµ ρ i j Γ j Γ j + λ m µ Oi λmµ ρ i + [ Γλ m µ b, q logρ logρ λmµ b Γλ m µ b, i logρ logρ λmµ b ] 96 Now, observe that the first term inside the bracket of 96 is proportional to i λmµ ρ i, so we may combine it with the first term to obtain: ρ i j Γ j i λmµ Γ j + λ m µ Oi λmµ ρ i + ɛ ρ i i λmµ Γλ mµ b, i logρ logρ λmµ b 97 What remains to be done is to characterize the convergence rate of the second term in 97. To do this, we utilize the asymptotic expansion of the upper incomplete Gamma function listed in 66 with the identifications s λ m µ b, x i logρ, M and write Γλ m µ b, i logρ = i logρ λmµ b e i logρ [ + O i logρ ] = i λmµ b logρ λmµ b ρ i + Oρ i i λmµ 3 b 98 Substituting 98 into 97, we obtain ρ i j Γ j i λmµ Γ j + λ m µ Oi λmµ ρ i + ɛ ρ i i λmµ logρ λmµ b [ i λ mµ b logρ λmµ b ρ i + Oρ i i λmµ 3 b ] = Oi λmµ ρ i + + ɛ i b logρ + Oi 3 b 99

29 TOWFIC, CHEN, SAYED: EXCESS-RISK OF DISTRIBUTED STOCHASTIC LEARNERS 9 The lower bound is derived in a similar way starting with 9, except for taking the integral limits for the lower-bound to be {q, i }: ρ i j Γj i λmµ Γj + λ m µ Oi λmµ ρ i + Oi 3 b ɛ ρ i b logρ 300 as i. Since the upper and lower-bounds hold for any ɛ > 0 and large i, we obtain 9. The final intermediate result we will establish concerns the convergence rate of a sequence of traces. Lemma Convergence rate of a series of traces: Assume that Σ = E kk H, b 0, q, and let L denote a block diagonal matrix: L blockdiag{l,..., L N } 30 where each L k R M M. Then, i Ci j b µ jtrlω j,i µ j=q λ m α m,b ip T L mp m= where µi µ/i, Ω j,i is defined by 40, α m,b i and { i b 30 λ + mµ b Θi λmµ, λ m µ + b logi, λ i b+ m µ = + b 303 L m diag{φ T L Φ mm, Φ T L N Φ mm } 304 and Φ is the eigen-basis of the matrix H as defined in Assumption and 46. Proof: Introduce the Jordan canonical form of matrix A: A = T DT 305 Now, recall the definition of the weighting matrix Ω j,i from 40, reproduced here for convenience: Ω j,i i t=j+ B T t Σ i t=j+ Substituting 4, 46, and 305 into Ω j,i, we obtain: B T t T 306 Ω j,i = T Di j T E kk T T D Ti j T T ΦK j,i Φ T 307 where K j,i was defined in 79, repeated here for convenience: K j,i i t=j+ I M µtλ Λ i t=j+ I M µtλ 308 Thus, we can re-write Ci as: Ci = j b Tr LT D i j T E kk T T D Ti j T T j=q µ jφk j,i Φ T 309 We denote the entries of the matrix below by [b mn ]: b b b N T D i j T E kk T T D Ti j T T b b b,n = b N b N b NN 30 so that T D i j T E kk T T D Ti j T T µ jφk j,i Φ T b ΦK j,i Φ T b N ΦK j,i Φ T = µ j b N ΦK j,i Φ T b NN ΦK j,i Φ T Substituting 30 and 3 into 35, we obtain Ci L j b µ jtr... L N b ΦK j,i Φ T b N ΦK j,i Φ T..... b N ΦK j,i Φ T b NN ΦK j,i Φ T = j b µ j b kk Tr Φ T L k ΦK j,i 3 k= Now, we know that the matrix K j,i is diagonal with diagonal elements: i [K j,i ] mm = λ m t=j+ Let us denote the matrix Φ T L k Φ by so that L kk j,i = and 3 becomes µtλ m 33 L k Φ T L k Φ 34 [L k ] [L k ] [L k ] M [L k ] [L k ] [L k ] M [L k ] M [L k ] M [L k ] MM [K j,i ] [K j,i ]... [Kj,i]MM 35 Ci = j b µ j b kk k= [K j,i ] [L k ] [K j,i ] MM [L k ] M [K j,i ] [L k Tr ] [K j,i ] MM [L k ] M..... [K j,i ] [L k ] M [K j,i ] MM [L k ] MM

30 30 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. XX, NO. XX, MONTH 06 = [K j,i ] mm j b µ j b kk [L k] mm 36 m= k= Finally, note that b b b N [L ] mm[l b b b,n ] mm... Tr b N b N b NN [L N ] mm = b kk [L k] mm 37 k= Combining 30, 33,34, 36, and 37, we obtain: Ci = i j b µ jλ m µtλ m m= j=q t=j+ Tr T D i j T E kk T T D Ti j T T L m 38 where L m is defined by 304. Next we observe that the product i t=j+ µtλ m has the same form as the term in 46 when µt = µ/t and, hence, it can be described by ratios of Gamma functions: i t=j+ µtλ m = Γ i λ m µ Γ i Γ j + Γ j + λ m µ 39 We now utilize 65 with the identifications s i, a λ m µ, c 0 to deduce that the first fraction in 39 will converge asymptotically as i λmµ. Therefore, we have i µ jλ m µtλ m λ mµ i λmµ Γ j+ j Γ j + λ m µ t=j+ a = λ mµ i λmµ Γ j Γ j + λ m µ 30 where step a is due to the fact that x Γx = Γx +. Substituting 30 into 38, we get as i : Ci = µ Γ j Γ j + λ m µ Tr T D i j T E kk T T D Ti j T T L m i λ m i λmµ j b m= j=q 3 We use TrA T BCD T = veca T D BvecC to expand the trace as: Ci = µ vect T L mt T λ m m= j b Γ j i λmµ Γ j+ λ m µ D Di j vect E kk T T j=q First, we will examine the following sum: 3 j b Γ j i λmµ Γ j + λ j=q m µ D Di j 33 Using 77 we have D D i j = D i j D i j [ = D i j N ] [ D i N = diag{, D i j N, Di j N, Di N Di N } 34 Substituting 34 into 33, we obtain: j b Γ j i λmµ Γ j + λ j=q m µ D Di j = j b Γ j i λmµ Γ j + λ j=q m µ diag{, D i j N, Di j N, Di N Di N } = diag{τi, A i, A i, A i} 35 where τi i λmµ j b Γ j Γ j + λ m µ j=q ] 36 A i i λmµ j b Γ j Γ j + λ m µ Di j N 37 j=q A i i λmµ j b Γ j Γ j + λ m µ Di j N Di j N j=q 38 From Lemma 9 we conclude that τi converges asymptotically according to 68: { i b τi λ + mµ b Θi λmµ, λ m µ + b logi, λ i b+ m µ = + b = α m,b i 39 where α m,b i was defined in 303. We now examine the convergence rate of the other three sub-matrices in 35. The second and third sub-matrices can be shown to converge to zero at a faster rate via: A i a c i λmµ j b Γ j Γ j + λ m µ Di j N i λmµ j b Γ j Γ j + λ m µ ρdn + i j 330 where step a is due to Lemma. In order to evaluate the rate of decay of the above terms, we appeal to Lemma 0 by making the identification ρ ρd N + where ρ < since the matrix A is primitive. We conclude that the matrix A i will converge to the zero matrix at a faster rate than τi; i.e., In a similar manner, we observe that A i c A i = Θ/i +b 33 i λmµ j b Γ j Γ j + λ m µ ρdn + i j 33

31 TOWFIC, CHEN, SAYED: EXCESS-RISK OF DISTRIBUTED STOCHASTIC LEARNERS 3 Using Lemma 0 with the following identification: ρdn + ρ 333 we also conclude that A i = Θ/i +b. Since the matrices A i and A i in 35 will converge to zero at a relatively high rate, we can now deduce the following asymptotic relationship for 35: j b Γ j i λmµ Γ j + λ m µ D Di j α m,b ie E 334 where α m,b i was defined in 303 and E is an N N matrix with a single one at the, -th element, and all other elements are zero. Now, substituting 334 back into 3 and using the identity TrA T BCD T = veca T D BvecC again, we get Ci= µ λ m α m,b itr L mt E T E kk T T E T T m= 335 We observe that T E T is a rank- matrix that is spanned by the left- and right-eigenvectors of A corresponding to the eigenvalue at one. The left eigenvector is N since A is leftstochastic. The right eigenvector is the Perron vector, p, which is normalized so that the sum of its entries is equal to one, i.e., p T N =, and Ap = p. Then, T E T = p T N. Substituting into 335, we arrive at the desired expression 30. REFERENCES [] Z. J. Towfic, J. Chen, and A. H. Sayed, On the generalization ability of distributed online learners, in Proc. IEEE International Workshop on Machine Learning for Signal Processing MLSP, Santander, Spain, Sep., 0, pp. 6. [] Z. J. Towfic, J. Chen, and A. H. Sayed, Distributed inference over regression and classification models, in Proc. IEEE ICASSP, Vancouver, BC, May, 03, pp [3] M. Zinkevich, M. Weimer, A. Smola, and L. Li, Parallelized stochastic gradient descent, in Proc. Neural Information Processing Systems NIPS, Vancouver, B.C., Canada, Dec. 00, pp [4] F. Yan, S. Sundaram, S.V.N. Vishwanathan, and Y. Qi, Distributed autonomous online learning: Regrets and intrinsic privacy-preserving properties, IEEE Transactions on Knowledge and Data Engineering, vol. 5, no., pp , Nov. 03. [5] S. S. Ram, A. Nedic, and V. V. Veeravalli, Distributed stochastic subgradient projection algorithms forconvex optimization, Journal of Optimization Theory and Applications, vol. 47, no. 3, pp , Jul. 00. [6] S. Barbarossa, S. Sardellitti, and P. Di Lorenzo, Distributed detection and estimation in wireless sensor networks, in Academic Press Library in Signal Processing, vol., R. Chellapa and S. Theodoridis, Eds., pp , Elsevier, 04. [7] S. Theodoridis, K. Slavakis, and I. Yamada, Adaptive learning in a world of projections, IEEE Signal Processing Magazine, vol. 8, no., pp. 97 3, Jan. 0. [8] M. Zargham, A. Ribeiro, A. Ozdaglar, and A. Jadbabaie, Accelerated dual descent for network optimization, in Proc. American Control Conference ACC, San Francisco, CA, Jun. 0, pp [9] M. E. Yildiz, F. Ciaramello, and A. Scaglione, Distributed distance estimation for manifold learning and dimensionality reduction, in Proc. IEEE ICASSP, Taipei, Taiwan, Apr., 009, pp [0] O. Shamir and N. Srebro, Distributed stochastic optimization and learning, in 5nd Annual Allerton Conference on Communication, Control, and Computing Allerton, Allerton, IL, 04, pp [] I. D. Schizas, A. Ribeiro, and G. B. Giannakis, Consensus in ad hoc WSNs with noisy links - Part I: Distributed estimation of deterministic signals, IEEE Transactions on Signal Processing, vol. 56, no., pp , Jan [] T. C. Aysal, M. E. Yildiz, A. D. Sarwate, and A. Scaglione, Broadcast gossip algorithms for consensus, IEEE Transactions on Signal Processing, vol. 57, no. 7, pp , Jul [3] G. Morral, P. Bianchi, and G. Fort, Success and failure of adaptationdiffusion algorithms for consensus in multi-agent networks, in 04 IEEE 53rd Annual Conference on Decision and Control CDC, Los Angeles, CA. Dec., 04, pp [4] J. Chen and A. H. Sayed, On the learning behavior of adaptive networks - Part I: Transient analysis, IEEE Transactions on Information Theory, vol. 6, no. 6, pp , Jun. 05. [5] J. Chen and A. H. Sayed, On the learning behavior of adaptive networks - Part II: Performance analysis, IEEE Transactions on Information Theory, vol. 6, no. 6, pp , Jun. 05. [6] P. Bianchi, G. Fort, and W. Hachem, Performance of a distributed stochastic approximation algorithm, IEEE Transactions on Information Theory, vol. 59, no., pp , Nov. 03. [7] S. Kar and J. M. F. Moura, Sensor networks with random links: Topology design for distributed consensus, IEEE Transactions on Signal Processing, vol. 56, no. 7, pp , Jul [8] A. Nedic and A. Ozdaglar, Distributed subgradient methods for multiagent optimization, IEEE Transactions on Automatic Control, vol. 54, no., pp. 48 6, Jan [9] S. Kar, J.M.F. Moura, and K. Ramanan, Distributed parameter estimation in sensor networks: Nonlinear observation models and imperfect communication, IEEE Transactions on Information Theory, vol. 58, no. 6, pp , Jun. 0. [0] S. Y. Tu and A. H. Sayed, Diffusion strategies outperform consensus strategies for distributed estimation over adaptive networks, IEEE Transactions on Signal Processing, vol. 60, no., pp , Dec. 0. [] A. Agarwal and J. Duchi, Distributed delayed stochastic optimization, in Proc. Neural Information Processing Systems NIPS, Granada, Spain, Dec. 0, pp [] F. Bach and E. Moulines, Non-asymptotic analysis of stochastic approximation algorithms for machine learning, in Proc. Neural Information Processing Systems NIPS, Granada, Spain, Dec. 0, pp [3] Z. J. Towfic and A. H. Sayed, Stability and performance limits of adaptive primal-dual networks, IEEE Transactions on Signal Processing, vol. 63, no., pp , Jun. 05. [4] H. Ouyang, N. He, L. Tran, and A. Gray, Stochastic alternating direction method of multipliers, in Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, Jun., 03, pp [5] S. P. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers, Foundations and Trends in Machine Learning, vol. 3, no., pp., Jul. 0. [6] I. D Schizas, G. Mateos, and G. B. Giannakis, Distributed LMS for consensus-based in-network adaptive processing, IEEE Transactions on Signal Processing, vol. 57, no. 6, pp , Jun [7] A. J. Laub, Matrix Analysis for Scientists and Engineers, Society for Industrial and Applied Mathematics SIAM, PA, 005. [8] B. Widrow and M. E. Hoff, Adaptive switching circuits., in IRE Western Electronic Show and Convention Record, Los Angeles, CA, Aug. 960, vol. 4, pp [9] L. Bottou, Stochastic learning, in Advanced Lectures on Machine Learning, Lecture Notes in Artificial Intelligence, LNAI 376, pp Berlin, Germany, 004. [30] V. N. Vapnik, The Nature of Statistical Learning Theory, Springer, NY, 995. [3] B. T. Polyak, Introduction to Optimization, Optimization Software, NY, 987. [3] A. H. Sayed, Adaptation, learning, and optimization over networks, Foundations and Trends in Machine Learning, vol. 7, no. 4-5, pp. 3 80, Jul. 04. [33] J. Tsitsiklis, D. Bertsekas, and M. Athans, Distributed asynchronous deterministic and stochastic gradient optimization algorithms, IEEE Transactions on Automatic Control, vol. 3, no. 9, pp , Sep [34] A. H. Sayed, Adaptive networks, Proceedings of the IEEE, vol. 0, no. 4, pp , Apr. 04.

3 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. XX, NO. XX, MONTH 06 [35] A. Nemirovski, A. Juditsky, G. Lan, and A.

J. Wainwright, Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization, IEEE Transactions on Information Theory, vol. 58, no. 5, pp. 335 349, May 0. [37] X.

3460 3475, Jul. 0. [38] H. Jeffreys and B. S. Jeffreys, Methods of mathematical physics, Cambridge university press, 3rd edition, 946. [39] J. M. Howie, Real Analysis, Springer-Verlag, London, 0.

32 3 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. XX, NO. XX, MONTH 06 [35] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM Journal on Optimization, vol. 9, no. 4, pp , Jan [36] A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright, Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization, IEEE Transactions on Information Theory, vol. 58, no. 5, pp , May 0. [37] X. Zhao, S.-Y. Tu, and A. H. Sayed, Diffusion adaptation over networks under imperfect information exchange and non-stationary data, IEEE Transactions on Signal Processing, vol. 60, no. 7, pp , Jul. 0. [38] H. Jeffreys and B. S. Jeffreys, Methods of mathematical physics, Cambridge university press, 3rd edition, 946. [39] J. M. Howie, Real Analysis, Springer-Verlag, London, 0. [40] R. A. Horn and C. R. Johnson, Matrix Analysis, Cambridge University Press, NY, 99. [4] X. Zhao and A. H. Sayed, Performance limits of distributed estimation over LMS adaptive networks, IEEE Transactions on Signal Processing, vol. 60, no. 0, pp. 6, Oct. 0. [4] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, and A. H. Teller, Equation of state calculations by fast computing machines, J. Chem. Phys, vol., no. 3, pp , 953. [43] W. K. Hastings, Monte Carlo sampling methods using markov chains and their applications, Biometrika, vol. 57, no., pp , Apr [44] S. P. Boyd, P. Diaconis, and L. Xiao, Fastest mixing markov chain on a graph, SIAM review, vol. 46, no. 4, pp , Dec [45] B. T. Polyak and A. B. Juditsky, Acceleration of stochastic approximation by averaging, SIAM Journal on Control and Optimization, vol. 30, no. 4, pp , 99. [46] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan, Stochastic convex optimization, in Conference on Learning Theory COLT, Montreal, QC, Jun., 009. [47] H. J. Kushner and G. Yin, Stochastic approximation and recursive algorithms and applications, vol. 35, Springer-Verlag, NY, 003. [48] L. Bottou, Stochastic gradient descent tricks, in Neural Networks: Tricks of the Trade, pp Springer, 0. [49] S. Theodoridis and K. Koutroumbas, Pattern Recognition, Academic Press, MA, 4th edition, 008. [50] J. Zhu and E. P. Xing, Maximum entropy discrimination markov networks, The Journal of Machine Learning Research, vol. 0, pp , Nov [5] G. Lebanon and J. Lafferty, Boosting and maximum likelihood for exponential models, in Proc. Neural Information Processing Systems NIPS, Vancouver, British Columbia, Canada, Dec. 00, pp [5] S. P. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University Press, NY, 004. [53] X. Zhao and A. H. Sayed, Asynchronous adaptation and learning over networks - Part I: Modeling and stability analysis, IEEE Transactions on Signal Processing, vol. 63, no. 4, pp. 8 86, Feb. 05. [54] I. S. Gradshteyn and I. M. Ryzhik, Table of Integrals, Series, and Products, Academic Press, NY, 007. [55] J. Saniuk and I. Rhodes, A matrix inequality associated with bounds on solutions of algebraic riccati and lyapunov equations, IEEE Transactions on Automatic Control, vol. 3, no. 8, pp , Aug 987. [56] D. Kleinman and M. Athans, The design of suboptimal linear timevarying systems, IEEE Transactions on Automatic Control, vol. 3, no., pp , Apr 968. [57] Y. Fang, K. A. Loparo, and X. Feng, Inequalities for the trace of matrix product, IEEE Transactions on Automatic Control, vol. 39, no., pp , Dec 994. [58] F. G. Tricomi and A. Erdlyi, The asymptotic expansion of a ratio of gamma functions., Pacific Journal of Mathematics, vol., no., pp. 33 4, Mar. 95. Zaid J. Towfic S 06 M 5 received his B.S. degrees in electrical engineering, computer science, and mathematics from the University of Iowa in 007. He received his M.S. and Ph.D. degrees in electrical engineering from the University of California, Los Angeles UCLA in 009 and 04, respectively. Between 008 and 04, he was a member of the Adaptive Systems Laboratory at UCLA. From 007 to 008, he worked for Rockwell Collins Advanced Technology Center Cedar Rapids, IA working on MIMO, embedded sensing, and digital modulation classification. He is currently a Technical Staff member at MIT Lincoln Laboratory Lexington, MA. His research interests include machine learning, stochastic optimization, signal processing, adaptive systems, and information theory. Jianshu Chen S 0 M 5 received his B.S. and M.S. degrees in Electronic Engineering from Harbin Institute of Technology in 005 and Tsinghua University in 009, respectively. He received his Ph.D degree in Electrical Engineering from University of California, Los Angeles UCLA in 04. Between 009 and 04, he was a member of the Adaptive Systems Laboratory at UCLA. And between 04 and 05, he was a postdoctoral researcher at Microsoft Research, Redmond, WA. He is currently a researcher at Microsoft Research, Redmond, WA. His research interests include statistical signal processing, stochastic approximation, distributed optimization, machine learning, reinforcement learning, and smart grids. Ali H. Sayed S 90 M 9 SM 99 F 0 received the M.S. degree in electrical engineering from the University of Sao Paulo, Brazil, in 989 and the Ph.D. degree in electrical engineering from Stanford University, Stanford, CA, in 99. He is professor and former chairman of electrical engineering at the University of California, Los Angeles, USA, where he directs the UCLA Adaptive Systems Laboratory. An author of more than 480 scholarly publications and six books, his research involves several areas including adaptation and learning, statistical signal processing, distributed processing, network science, and biologically inspired designs. Dr. Sayed has received several awards including the 05 Education Award from the IEEE Signal Processing Society, the 04 Athanasios Papoulis Award from the European Association for Signal Processing, the 03 Meritorious Service Award, and the 0 Technical Achievement Award from the IEEE Signal Processing Society. Also, the 005 Terman Award from the American Society for Engineering Education, the 003 Kuwait Prize, and the 996 IEEE Donald G. Fink Prize. He served as Distinguished Lecturer for the IEEE Signal Processing Society in 005 and as Editor-in Chief of the IEEE TRANSACTIONS ON SIGNAL PROCESSING His articles received several Best Paper Awards from the IEEE Signal Processing Society 00, 005, 0, 04. He is a Fellow of the American Association for the Advancement of Science AAAS. He is recognized as a Highly Cited Researcher by Thomson Reuters.

Social Learning over Weakly-Connected Graphs

1 Social Learning over Weakly-Connected Graphs Hawraa Salami, Student Member, IEEE, Bicheng Ying, Student Member, IEEE, and Ali H. Sayed, Fellow, IEEE arxiv:1609.03703v2 [cs.si] 6 Jan 2017 Abstract In