Asynchronous Gossip Algorithms for Stochastic Optimization

Asynchronous Gossip Algoriths for Stochastic Optiization S. Sundhar Ra ECE Dept. University of Illinois Urbana, IL 680 ssrini@illinois.edu A. Nedić IESE Dept. University of Illinois Urbana, IL 680 angelia@illinois.edu V. V. Veeravalli ECE Dept. University of Illinois Urbana, IL 680 vvv@illinois.edu Abstract We consider a distributed ulti-agent networ syste where the goal is to iniize an objective function that can be written as the su of coponent functions, each of which is nown partially with stochastic errors to a specific networ agent. We propose an asynchronous algorith that is otivated by rando gossip schees where each agent has a local Poisson cloc. At each tic of its local cloc, the agent averages its estiate with a randoly chosen neighbor and adjusts the average using the gradient of its local function that is coputed with stochastic errors. We investigate the convergence properties of the algorith for two different classes of functions. First, we consider differentiable, but not necessarily convex functions, and prove that the gradients converge to zero with probability. Then, we consider convex, but not necessarily differentiable functions, and show that the iterates converge to an optial solution alost surely. I. INTRODUCTION The proble of iniizing of a su of functions when each coponent function is available partially with stochastic errors to a specific networ agent is an iportant proble in the context of wired and wireless networs [, [4, [7, [8. These probles require the design of optiization algoriths that are distributed, i.e. without a central coordinator and local in the sense that each agent can only use its local objective function and can exchange soe liited inforation with its iediate neighbors. In this paper, we propose an asynchronous distributed algorith that is inspired by the rando gossip averaging schee of [7. Each agent has a local Poisson cloc and aintains an iterate sequence. At each tic of its local cloc, the agent first randoly selects a neighbor, and coputes the average of its current iterate and the iterate received fro the selected neighbor. Then, the agent adjusts the coputed average using the gradient of its local function, which is nown only with stochastic errors. We investigate the convergence properties of the algorith under two different assuptions on the objective functions: a differentiable but not necessarily convex, and b convex but not necessarily differentiable. The algorith in this paper is related to the distributed consensus-based optiization algorith proposed in [ and further studied in [4, [6, [8, [, [, [7, [8. In consensus-based algoriths, each agent aintains an This research is supported by a Vodafone Graduate Fellowship, and by NSF Awards CNS-08670 and CMMI-0748. iterate sequence and updates using its local function gradient inforation. These algoriths are synchronous and require the agents to update siultaneously, which is in contrast with the asynchronous algorith proposed in this paper. A different distributed odel has been proposed in [ and also studied in [, [, [, where the coplete objective function inforation is available to each agent, with the ai of distributing the processing by allowing an agent to update only a part of the decision vector. Related to the algorith of this paper is also the literature on increental algoriths [4, [, [4, [, [7, [9, [0, [4, [6, [7, [0, where the networ agents sequentially update a single iterate sequence and only one agent updates at any given tie in a cyclic or a rando order. While being local, the increental algoriths differ fundaentally fro the algorith studied in this paper where all agents aintain and update their own iterate sequence. In addition, the wor in this paper is related to a uch broader class of gossip algoriths used for averaging [, [8. Since we are interested in the effect of stochastic errors, our wor is also related to the stochastic subgradient ethods [, [0, [. The novelty of our wor is in several directions. First, our gossip-based asynchronous algoriths allow the agents to use the stepsize based on the nuber of their local updates; thus the stepsize is not coordinated aong the agents. Second, we study the convergence of the algorith when the functions are non-convex, which is unlie the recent trend in the distributed networ optiization where typically convex functions are considered see e.g., [6, [8, [, [, [, [7, [8. Third, we are dealing with the general case where the agents copute their subgradients with stochastic errors. Due to agent inforation exchange, the stochastic errors propagate across agents and tie, which together with the stochastic nature of the agent stepsizes, highly coplicates the convergence analysis. Our analysis cobines the ideas used to study the basic gossip-averaging algorith [7 with the tools that are generally used to study the convergence of the stochastic gradient schees. The rest of the paper is organized in the following anner. In the next section, we describe the proble of our interest, There are papers that discuss the convergence of cyclic increental algoriths when the functions are nonconvex e.g., [9. However, cyclic increental algoriths are not distributed since the agents have to be organized in a cycle by a central coordinator.

present our algorith and assuptions. In Section III, aong other preliinaries, we investigate the asyptotic properties of the agent disagreeents. In Section IV, the convergence properties of the algorith are studied. We conclude with a discussion in Section V. II. PROBLEM, ALGORITHM AND ASSUMPTIONS We consider a networ of agents that are indexed by,..., ; when convenient, we will use V = {,..., }. The networ has a static topology that is represented by the bidirectional graph V, E, where E is the set of lins in the networ. We have {i, j} E if agent i and agent j can counicate with each other. We assue that the networ [i.e., the graph V, E is connected. The networ objective is to solve the following optiization proble : iniize fx := f i x i= subject to x R, where f i : R R for all i. The function f i is only nown to agent i that can copute the gradient f i x with stochastic errors The goal is to solve proble using an algorith that is distributed and local. where x i,0, i V are initial iterates of the agents, x I,J = x I, x J,, f i x denotes the gradient of f i at x, ɛ i, is the stochastic error and Γ i denotes the total nuber of agent i updates up to the tie Z. B. Assuptions We ae the following assuption on the functions. Assuption : The gradients are uniforly bounded, i.e., sup x R f i x C for soe C > 0 and for all i V. In addition to this, we will use two coplientary sets of assuptions on the functions f i, as discussed later. Let F be the σ-algebra generated by the entire history of the algorith up to tie Z, i.e., F = {I l, J l, ɛ Il,l, ɛ Jl,l; 0 l }. We ae the following assuptions on the stochastic errors. Assuption : With probability, we have: a E [ ɛ i, F ν for all and i V, and soe ν. b E[ɛ I, F, I, J = 0, E[ɛ J, F, I, J = 0. The assuption is satisfied, for exaple, when the errors are zero ean, independent across tie and have bounded second oents. A. Asynchronous Gossip Optiization Algorith Let Ni be the set of neighbors of agent i, i.e. Ni = III. PRELIMINARIES {j V : {i, j} E}. Each agent has a local cloc that All vectors are colun vectors, x i to denotes the i-th tics at a Poisson rate 4 of. At each tic of its cloc, agent coponent of a vector x, and x denotes the Euclidean i averages its iterate with a randoly selected neighbor j nor of a vector x. We use to denote a vector with Ni, where each neighbor has an equal chance of being all coponents equal to. In our analysis, we frequently selected. Agents i and j then adjust their averages along the invoe the following result due to Robbins and Siegund negative direction of f i and f j, respectively, which are see Lea, Chapter., [. coputed with stochastic errors. Lea : Let Ω, F, P be a probability space and F 0 As in [7 we will find it easier to study the gossip algoriths in ters of a single virtual cloc that tics whenever F... be a sequence of sub σ-fields of F. Let {u }, {v }, {q } and {w } be F -easurable rando variables, where any of the local Poisson cloc tics. Thus, the virtual cloc {u } is uniforly bounded below, and {v }, {q } and {w } tics according to a Poisson process with rate. Let Z are non-negative. Let =0 w <, =0 q < and denote the -th tic of the virtual cloc and let I denote the index of the agent whose local cloc actually ticed at E[u F q u v w that instant. The fact that the Poisson clocs at each agent hold with probability. Then, with probability, the sequence {u } converges and are independent iply that I is uniforly distributed in the =0 set V. In addition, the eoryless property of the Poisson <. arrival process ensure that the process {I } is i.i.d. Let J A. Relative Frequency of Agents Updates denote the rando index of the agent counicating with We characterize the nuber Γ i of ties agent i updates agent I. Observe that J, conditioned on I, is uniforly its iterate until tie Z inclusively see Eq.. Define the distributed in the set NI. Let x i, denote agent i iterate event E i, = {I = i} {J = i}. This is essentially at tie iediately before Z. The iterates evolve according the event that agent i updates its iterate at tie Z. It is to easy to see that {E { i, } are independent events with the xi,j x i, = Γ i f i x I,J ɛ i, if i {I, J } sae tie invariant probability distribution. Define γ i to x i, otherwise, be the probability of event E i,. Since I is uniforly distributed on the set V and J, conditioned on I = j, is uniforly distributed on the set Nj, it follows that By coponentwise application, our results and proofs can be extended γ to the case when x is a finite-diension vector. i = j Ni Nj. See [6 for the otivation for studying stochastic errors. 4 The odel and the analysis can be easily extended to the case when the clocs have different rates. If the function is not differentiable but convex then f i x denotes a subgradient. We will discuss this later.

Define χ A to be the indicator function of an event A, and note that Γ i = l= χ E i,. Since the events {χ Ei, } are i.i.d., fro the law of iterated logariths [9, we can conclude that for any p, q > 0, with probability, Γ i γ i li p for all i V. q We can therefore conclude that with probability, for all i V and for all sufficiently large, Γ i, Γ i γ i p. 4 q B. Alternative Representation of the Algorith We next give the algorith in a ore convenient for for our analysis. Let e i denote the unit vector with only its i-th coponent being non-zero. Define W = I e I e J e I e J. Since {I }, {J } are i.i.d. sequences, {W } is also an i.i.d. sequence. Define W = E[W. Since each W is syetric and doubly stochastic with probability, W is also syetric and doubly stochastic. Further, the axiu eigenvalue of W is, and is not a repeated eigenvalue when the networ is connected 6. We also have E [ W = W see [7. Let x be the vector with coponents x i,, i =,...,. Then, fro the definition of the ethod in, we have where p = and x I,J then have x = W x p for, Γ i f i x I,J ɛ i, e i, = x I, x J, /. Define y = x. We y = x = W x p. By the doubly stochasticity of W, with probability, it follows y = x p C. Agent Consensus = y p. 6 We use x y to quantify the disagreeent between the agents, and we show that the disagreeents converge to 0. Lea : Let Assuptions and a hold. Then, with probability, we have li x y = 0. = x y < and 6 In this case, W is a stochastic irreducible atrix and λ = is its largest real eigenvalue with a unique right eigenvector, see e.g. [, Corollary, page 6. Proof: Fro and 6 it follows E[ x y F [ = E W x p y p F E[ W x y F E[ p F, 7 where the inequality follows fro the triangle inequality of nors and the doubly stochasticity of W. The first ter can be estiated using the relation E [ W W = E[W = W iplying that W is positive sei-definite as follows: E [ W x y F = x y E [ W W x y = x y W x y = λ i vi x y, i= where λ i is the i-th largest eigenvalue and v i is the corresponding eigenvector of W. The last step follows fro the eigenvector decoposition of the syetric positive seidefinite atrix W. Recall that λ = the largest value of W and the corresponding eigenvector is. Hence, Eˆ W x y F λ x y. 8 We next estiate the second ter in 7. Using and the boundedness of the gradients Assuption, we can conclude that for sufficiently large, we have Eˆ p F E4 X Γ i fi xi,j ɛ i, F 4E4 X ` fi x Γ i I,J ɛ i, F 4 C ν 9 Fro 7, 8, 9, and the Jensen s inequality we can see that E[ x y 4C ν λ E[ x y, where λ <. Therefore, we have for sufficiently large, E[ x y E[ x y λ E[ x y 4C ν. Using the deterinistic analog of Lea, we E[ x see that y <, which iplies x y < with probability. We next prove the second part of the stateent. As a consequence of the preceding result, it follows li inf x y = 0. We only need to prove alost sure convergence of x y to coplete the proof. Fro the definitions of x and y in and 6, we obtain Eˆ x y F

= E" W x p y p Eˆ W x y F F # p E[ W x y F v " # u t E p p F # E" p p F, 0 where in the last step we use Cauchy-Schwartz inequality. We next estiate the last ter in 0, as follows E[ p p F E [ p F E[ p E [ p F 4E [ p F. F In the last step we use the fact that only two coponents of p are non-zero. Using this in 0, substituting fro 8 and 9, and taing into account λ <, we obtain E [ x y F x y 8 λ C ν x y x y 6 C ν. As shown earlier, we have < with probability. We can invoe Lea to conclude that x y converges with probability. IV. CONVERGENCE ANALYSIS We here study the convergence of the algoriths under two different sets of conditions. The first requires the function to be differentiable with Lipschitz continuous gradient, i.e., fx fy L x y. A point x R is a stationary point of fx if fx = 0. A global iniu of fx is also a stationary point of fx. Typically, when the objective function is non-convex and iterative ethods are eployed, the iterates ay converge to a stationary point. Theore : Let Assuptions and hold, and let the function fx be bounded below with Lipschitz derivatives. Then, with probability, we have li x i, y = 0 for all i V, {fx i, } converges, and li inf fx i, = 0. Proof: Lea asserts that li x i, y = 0. Next, fro the definition of p in we obtain p = Γ i f i x I,J ɛ i, = γ i f i y ɛ i, Γ i Γ i f i x I,J Γ i f i y Γ i f i y. γ i Taing conditional expectations, and using, and the boundedness of the gradient we obtain i E hp F fy 4 E X f i y F fy γ i X E[ɛ i, F X L E[ x i, y F i= i= X C E» Γ i γ i F. i= Since γ i is the probability that agent i updates at tie Z it follows that E f i y F = fy, γ i so that the first ter in is equal to 0. Using Assuption, we can see that the second ter is 0. Further, note fro 4 and Lea it follows that the last three ters in are suable. Thus, fro 6 we obtain y = y fy a, where E[a F <. Additionally, fro Assuption a, Lea showing that x i, y 0 and relation 4 it follows that E [ a F <. The result now follows fro classic stochastic optiization theory see [, or Chapter of [6. Observe that, in view of Lipschitz continuity of the gradient, the assuption that the gradients are bounded is equivalent to the following standard assuption. Assuption : The sequences {x i, }, i V, are bounded with probability. This assuption is iplicit and not very easy to establish. We refer the reader to Chapter of [6 for soe discussions on techniques to verify this assuption. We will next investigate the convergence when the functions are convex, but not necessarily differentiable. At points

where the gradient does not exist, we use the notion of subgradient. A vector gx is a subgradient of a function g at a point x do g if the following relation holds gx y x gy gx for all y do g. We next discuss the convergence of the algoriths. Theore : Let Assuptions and hold. Assue that X = Argin x R fx is non-epty, and f i x is convex for each i V. Then, with probability, the sequences {x i, }, i V, converge to the sae point in X. Proof: Let x be an arbitrary point in X. Using 6 we obtain y x = y p x y x p y x p y x p x I, x J, x p i= y x i, p. Fro the definition of p in and the subgradient inequality in we can write y x y x ɛ I, ɛ J, xi, x J, f i x I,J f i x Γ i x Γ i p i= y x i, p y x ɛ I, ɛ J, f i y f i x Γ i x xi, x J, Γ i f i x I,J f i y Γ i p i= y x i, p. Using the subgradient inequality and subgradient boundedness Assuption to bound the fourth ter, we get y x y x ɛ I, ɛ J, C i= f i y f i x Γ i x xi, x J, Γ i y x i, Γ i p i= y x i, p. Taing conditional expectations and using, we obtain Eˆ y x F y x E4 X f iy f ix F Γ i E4 ɛ xi, x J, I, ɛ J, x F Γ i C X y x i, i= E[ p F P i= y x i, Eˆ p F. Using the bounds in 9, we obtain for sufficiently large, Eˆ y x F y x E4 X f iy f ix F Γ i E4 ɛ xi, x J, I, ɛ J, x F Γ i C X y x i, i= 4C ν P i= y x i, y x E4 X xi, x J, 8C ν f iy f ix γ i x F E4 ɛ I, ɛ J, F γ i E4 X f iy f ix γ i Γ i F 6C 4ν P i= y x i, 8C ν. Note fro Assuption b that the third ter is 0. Since γ i is the probability that agent i updates at tie Z, we have Eˆ y x F y x fy fx E4 X f iy f ix γ i Γ i F 6C 4ν P i= y x i, 8C ν. Using the subgradient inequality and the inequality a < a, we can bound the third ter as follows f i y f i x γ i Γ i C y x γ i Γ i C γ i Γ i y x.

Cobining the two preceding relations we obtain E [ y x F CE γ i Γ i F y x fy fx CE γ i Γ i F 6C 4ν i= y x i, 8C ν. Using 4, we can see that the conditions of Lea are satisfied. Therefore { y x } converges and fy fx < with probability, which iplies that {y } converges to a point in the set X with probability. This and the fact li x i, y = 0 for all i V, with probability, shown in Lea iply that {x i, } converge to the sae point in X, with probability. V. DISCUSSION Using very siilar ideas the algorith and the proof of convergence can be extended to the case when x is a finite diensional vector. When the proble in is a constrained optiization proble where x is restricted to a convex and closed set X, then the algorith in can be extended by projecting onto the set X at each iteration. It is easy to obtain a convergence result siilar to Theore for this case using Euclidean projection inequalities. As a part of our future wor, we plan to investigate optiization algoriths based on different gossip schees. REFERENCES [ T. Aysal, M. Yildiz, A. Sarwate, and A. Scaglione, Broadcast gossip algoriths: Design and analysis for consensus, Proceedings of the 47th IEEE Conference on Decision and Control, 008. [ D. P. Bertseas and J. N. Tsitsilis, Parallel and distributed coputation: Nuerical ethods, Athena Scientific, 997. [, Gradient convergence in gradient ethods with errors, SIAM Journal of Optiization 0 000, no., 67 64. [4 D. Blatt, A. O. Hero, and H. Gauchan, A convergent increental gradient ethod with constant stepsize, SIAM Journal of Optiization 8 007, no., 9. [ V. Borar, Asynchronous stochastic approxiations, SIAM Journal on Control and Optiization 6 998, no., 840 8. [6, Stochastic approxiation: A dynaical viewpoint, Cabridge University Press, 008. [7 S. Boyd, A. Ghosh, B. Prabhaar, and D. Shah, Randoized gossip algoriths, IEEE Transactions on Inforation Theory 006, no. 6, 08 0. [8 A. Diais, A. Sarwate, and M. Wainwright, Geographic gossip: Efficient averaging for sensor networs, IEEE Transactions on Signal Processing 6 008, no., 0 6. [9 R. Dudley, Real analysis and probability, Cabridge University Press, 00. [0 Y. Eroliev, Stochastic prograing ethods, Naua, Moscow, 976. [, Stochastic quasi-gradient ethods and their application to syste optiization, Stochastics 9 98, no., 6. [ A. A. Gaivoronsi, Convergence properties of bacpropogation for neural nets via theory of stochastic gradient ethods. Part., Optiization Methods and Software 4 994, no., 7 4. [ R. G.Gallager, Discrete stochastic processes, Kluwer Acadeic Publishers, Norwell, Massachusetts, USA, 996. [4 B. Johansson, On distributed optiization in networed systes, Ph.D. thesis, Royal Institute of Technology, Stochol, Sweden, 008. [ B. Johansson, M. Rabi, and M. Johansson, A siple peer-to-peer algorith for distributed optiization in sensor networs, Proceedings of the 46th IEEE Conference on Decision and Control, 007, pp. 470 470. [6 B. Johanssson, T. Keviczsy, M. Johansson, and K. Johansson, Subgradient ethods and consensus algoriths for solving convex optiization probles, Proceedings of the 47th IEEE Conference on Decision and Control, 008, pp. 48 490. [7 K. C. Kiwiel, Convergence of approxiate and increental subgradient ethods for convex optiization, SIAM Journal on Optiization 4 00, no., 807 840. [8 I. Lobel and A. Ozdaglar, Distributed subgradient ethods over rando networs, Lab. for Inforation and Decision Systes, MIT, Report 800, 008. [9 A. Nedić and D. P. Bertseas, Increental subgradient ethod for nondifferentiable optiization, SIAM Journal of Optiization 00, 09 8. [0, The effect of deterinistic noise in sub-gradient ethods, Tech. report, Lab. for Inforation and Decision Systes, MIT, 007. [ A. Nedić, A. Olshevsy, A. Ozdaglar, and J. N. Tsitsilis, Distributed subgradient algoriths and quantization effects, Proceedings of the 47th IEEE Conference on Decision and Control, 008. [ A. Nedić and A. Ozdaglar, On the rate of convergence of distributed asynchronous subgradient ethods for ulti-agent optiization, Proceedings of the 46th IEEE Conference on Decision and Control, 007, pp. 47 476. [ B. T. Polya, Introduction to optiization, Optiization Software Inc., 987. [4 M. G. Rabbat and R. D. Nowa, Quantized increental algoriths for distributed optiization, IEEE Journal on Select Areas in Counications 00, no. 4, 798 808. [ S. Sundhar Ra, A. Nedić, and V. V. Veeravalli, Distributed stochastic subgradient algorith for convex optiization, Available at http://arxiv.org/abs/08.9, 008. [6 S. Sundhar Ra, A. Nedić, and V. V. Veeravalli, Increental stochastic sub-gradient algoriths for convex optiization, Available at http://arxiv.org/abs/0806.09, 008. [7 S. Sundhar Ra, V. V. Veeravali, and A. Nedić, Sensor networs: When theory eets practice, ch. Distributed and recursive estiation, Springer, 009. [8 S. Sundhar Ra, V. V. Veeravalli, and A. Nedić, Distributed and nonautonoous power control through distributed convex optiization, IEEE INFOCOM, 009. [9 M. V. Solodov, Increental gradient algoriths with stepsizes bounded away fro zero, Coputational Optiization and Algoriths 998, no.,. [0 M. V. Solodov and S. K. Zavriev, Error stability properties of generalized gradient-type algoriths, Journal of Optiization Theory and Applications 98 998, no., 66 680. [ J. N. Tsitsilis, Probles in decentralized decision aing and coputation, Ph.D. thesis, Massachusetts Institute of Technology, 984. [ J. N. Tsitsilis, D. P. Bertseas, and M. Athans, Distributed asynchronous deterinistic and stochastic gradient optiization algoriths, IEEE Transactions on Autoatic Control 986, no. 9, 80 8.