Network Newton Distributed Optimization Methods

Size: px

Start display at page:

Download "Network Newton Distributed Optimization Methods"

Esmond Edwards
6 years ago
Views:

1 Nework Newon Disribued Opimizaion Mehods Aryan Mokhari, Qing Ling, and Alejandro Ribeiro Absrac We sudy he problem of minimizing a sum of convex objecive funcions where he componens of he objecive are available a differen nodes of a nework and nodes are allowed o only communicae wih heir neighbors. The use of disribued gradien mehods is a common approach o solve his problem. Their populariy nowihsanding, hese mehods exhibi slow convergence and a consequen large number of communicaions beween nodes o approach he opimal argumen because hey rely on firs order informaion only. This paper proposes he nework Newon (NN) mehod as a disribued algorihm ha incorporaes second order informaion. This is done via disribued implemenaion of approximaions of a suiably chosen Newon sep. The approximaions are obained by runcaion of he Newon sep s Taylor expansion. This leads o a family of mehods defined by he number K of Taylor series erms kep in he approximaion. When keeping K erms of he Taylor series, he mehod is called NN-K and can be implemened hrough he aggregaion of informaion in K-hop neighborhoods. Convergence o a poin close o he opimal argumen a a rae ha is a leas linear is proven and he exisence of a radeoff beween convergence ime and he disance o he opimal argumen is shown. The numerical experimens corroborae reducions in he number of ieraions and he communicaion cos ha are necessary o achieve convergence relaive o firs-order alernaives. Index Terms Muli-agen nework, disribued opimizaion, Newon s mehod. I. INTRODUCTION Disribued opimizaion algorihms are used o solve he problem of minimizing a global cos funcion over a se of nodes in siuaions where he objecive funcion is defined as a sum of local funcions. To be more precise, consider a variable x R p and a conneced nework conaining n agens each of which has access o a local funcion f i : R p R. The agens cooperae in minimizing he aggregae cos funcion f : R p R aking values f(x) := n i=1 f i(x). I.e., agens cooperae in solving he problem x := argmin x R p f(x) = argmin x R p n f i (x). (1) i=1 Problems of his form arise ofen in, e.g., decenralized conrol sysems [3], [4], wireless sysems [5], [6], sensor neworks [7] [9], and large scale machine learning [10] [1]. There are differen algorihms o solve (1) in a disribued manner. The mos popular choices are decenralized gradien Work in his paper is suppored by NSF CAREER CCF , ONR N , and NSFC Aryan Mokhari and Alejandro Ribeiro are wih he Deparmen of Elecrical and Sysems Engineering, Universiy of Pennsylvania, 00 Souh 33rd Sree, Philadelphia, PA 19104, USA. {aryanm, aribeiro}@seas.upenn.edu. Qing Ling is wih he Deparmen of Auomaion, Universiy of Science and Technology of China, 96 Jinzhao Road, Hefei, Anhui, 3006, China. qingling@mail.usc.edu.cn. Par of he resuls in his paper appeared in [1] and []. This paper expands he resuls and presens convergence proofs ha are referenced in [1] and []. descen (DGD) [13] [16], disribued implemenaions of he alernaing direcion mehod of mulipliers [7], [17] [0], and decenralized dual averaging [1], []. Alhough here are subsanial differences beween hem, hese mehods can be generically absraced as combinaions of local descen seps followed by variable exchanges and averaging of informaion among neighbors. A feaure common o all of hese algorihms is he slow convergence rae in ill-condiioned problems since hey operae on firs order informaion only. This is no surprising because gradien descen mehods in cenralized seings where he aggregae funcion gradien is available a a single server have he same difficulies in problems wih skewed curvaure [see Chaper 9 of [3]]. This issue is addressed in cenralized opimizaion by Newon s mehod ha uses second order informaion o deermine a descen direcion adaped o he objecive s curvaure [see Chaper 9 of [3]]. In general, second order mehods are no available in disribued seings because disribued approximaions of Newon seps are difficul o devise. In he paricular case of flow opimizaion problems, hese approximaions are possible when operaing in he dual domain and have led o he developmen of he acceleraed dual descen mehods [4], [5]. As would be expeced, hese mehods resul in large reducions of convergence imes. Our goal is o develop approximae Newon s mehods o solve (1) in disribued seings where agens have access o heir local funcions only and exchange variables wih neighboring agens. We do so by inroducing Nework Newon (NN), a mehod ha relies on disribued approximaions of Newon seps for he global cos funcion f o accelerae convergence of DGD. We begin he paper wih an alernaive formulaion of (1) and a brief discussion of DGD (Secion II). We hen inroduce a reinerpreaion of DGD as an algorihm ha uilizes gradien descen o solve a penalized version of (1) in lieu of he original opimizaion problem (Secion II-A). This reinerpreaion explains convergence of DGD o a neighborhood of x. The volume of his neighborhood is given by he relaive weigh of he penaly funcion and he original objecive which is conrolled by a penaly coefficien. If gradien descen on he penalized funcion finds an approximae soluion o he original problem, he same soluion can be found wih a much smaller number of ieraions by using Newon s mehod. Alas, disribued compuaion of Newon seps requires global communicaion beween all nodes in he nework and is herefore impracical (Secion III). To resolve his issue we approximae he Newon sep of he penalized objecive funcion by runcaing he Taylor series of he exac Newon sep (Secion III-A). This approximaion resuls in a family of mehods indexed by he number of erms of he Taylor expansion ha are kep in he approximaion. The mehod ha resuls from keeping K of hese erms is ermed

2 NN-K. A fundamenal observaion here is ha he Hessian of he penalized funcion has a sparsiy srucure ha is he same sparsiy paern of he graph. Thus, when compuing erms in he Hessian inverse expansion, he firs order erm is as sparse as he graph, he second erm is as sparse as he wo hop neighborhood, and, in general, he k-h erm is as sparse as he k-hop neighborhood of he graph. Thus, implemenaion of he NN-K mehod requires aggregaing informaion from K hops away. Increasing K makes NN-K arbirarily close o Newon s mehod a he cos of increasing he communicaion overhead of each ieraion. We poin ou ha he same Taylor series is used in he developmen of he ADD algorihms, bu his is done o solve a nework uiliy maximizaion problem in he dual domain [4]. The Taylor expansion is uilized here o solve a consensus opimizaion problem in he primal domain. Convergence of NN-K o he opimal argumen of he penalized objecive is esablished (Secion IV). We do so by esablishing several auxiliary bounds on he eigenvalues of he marices involved in he definiion of he mehod (Proposiions 1-3 and Lemma ). We show ha a measure of he error beween he Hessian inverse approximaion uilized by NN-K and he acual inverse Hessian decays exponenially wih he mehod index K. This exponenial decrease hins ha using a small value of K should suffice in pracice. Convergence is formally claimed in Theorem 1 ha shows he convergence rae is a leas linear. I follows from his convergence analysis ha larger penaly coefficiens resul in faser convergence ha comes a he cos of increasing he disance beween he opimal soluions of he original and penalized objecives. We also sudy he convergence rae of he NN mehod as an approximaion of Newon s mehod (Secion IV-A). We show ha for all ieraions excep he firs few, a weighed gradien norm associaed wih NN-K ieraes follows a decreasing pah akin o he pah ha would be followed by Newon ieraes (Lemma 3). The only difference beween hese residual pahs is ha he NN-K pah conains a erm ha capures he error of he Hessian inverse approximaion. Leveraging his similariy, i is possible o show ha he rae of convergence is quadraic in a specific inerval whose lengh depends on he order K of he seleced nework Newon mehod (Theorem ). Exisence of his quadraic convergence phase explains why NN-K mehods converge faser han DGD as we observe in experimens. I is also worh remarking ha he error in he Hessian inverse approximaion can be made arbirarily small by increasing he mehod s order K and, as a consequence, he quadraic phase can be made arbirarily large. We wrap up he paper wih numerical analyses (Secion V). We firs demonsrae he advanages of NN-K relaive o alernaive primal and dual mehods for he minimizaion of a family of quadraic objecive funcions (Secion V-A). Then, we sudy he effec of objecive funcion condiion number and show ha he NN mehod ouperforms firs-order alernaives significanly in ill-condiioned problems (Secion V-B). Furher, we sudy he effec of nework opology on he performance of NN (Secion V-C). Moreover, we compare he convergence rae of NN in heory and pracice o show he ighness of he bounds in his paper (Secion V-D). The paper closes wih concluding remarks (Secion VI). Noaion. Vecors are wrien as x R n and marices as A R n n. The null space of marix A is denoed by null(a) and he span of a vecor by span(x). We use x and A o denoe he Euclidean norm of vecor x and marix A, respecively. The gradien of a funcion f(x) is denoed as f(x) and he Hessian marix is denoed as f(x). The i-h larges eigenvalue of marix A is denoed by µ i (A). II. DISTRIBUTED GRADIENT DESCENT The nework ha connecs he n agens is assumed conneced, symmeric, and specified by he neighborhoods N i ha conain he lis of nodes ha can communicae wih i for i = 1,..., n. In problem (1) agen i has access o he local cos f i (x) and agens cooperae o minimize he global cos f(x). This specificaion is more naurally formulaed by an alernaive represenaion of (1) in which node i selecs a local decision vecor x i R p. Nodes hen ry o achieve he minimum of heir local objecive funcions f i (x i ), while keeping heir variables equal o he variables x j of neighbors j N i. This alernaive formulaion can be wrien as n {x i } n i=1 := argmin f i (x i ), {x i} n i=1 i=1 s.. x i = x j, for all i, j N i. () Since he nework is conneced, he consrains x i = x j for all i and j N i imply ha (1) and () are equivalen and we have x i = x for all i. This mus be he case because for a conneced nework he consrains x i = x j for all i and j N i collapse he feasible space of () o a hyperplane in which all local variables are equal. When all variables are equal, he objecives in (1) and () coincide and so do heir opima. DGD is an esablished disribued mehod o solve () which relies on he inroducion of nonnegaive weighs w ij 0 ha are null if and only if j / N i {i} he use of ime varying weighs w ij is common in DGD implemenaions bu no done here; see, e.g., [13]. Leing N be a discree ime index and α a given sepsize, DGD is defined by he recursion n x i,+1 = w ij x j, α f i (x i, ), i = 1,..., n. (3) j=1 Since w ij = 0 when j i and j / N i, i follows from (3) ha each agen i updaes is variable x i by performing an average over he esimaes x j, of is neighbors j N i and is own esimae x i,, and descending hrough he negaive local gradien f i (x i, ). The weighs in (3) canno be arbirary. To express condiions on he se of allowable weighs define he marix W R n n wih enries w ij. We require he weighs o be symmeric, i.e., w ij = w ji for all i, j, and such ha he weighs of a given node sum up o 1, i.e., n j=1 w ij = 1 for all i. If he weighs sum up o 1 we mus have W1 = 1 which implies ha I W is rank deficien. I is also cusomary o require he rank of I W o be exacly equal o n 1 so ha he null space of I W is null(i W) = span(1). We herefore have he following hree resricions on he marix W, W T = W, W1 = 1, null(i W) = span(1). (4)

3 3 If he condiions in (4) are rue, i is possible o show ha (3) approaches he soluion of (1) in he sense ha x i, x for all i and large, [13]. The acceped inerpreaion of why (3) converges is ha nodes are gradien descending owards heir local minima because of he erm α f i (x i, ) bu also perform an average of neighboring variables n j=1 w ijx j,. This laer consensus operaion drives he agens o agreemen. In he following secion we show ha (3) can be alernaively inerpreed as a penaly mehod. A. Penaly mehod inerpreaion I is illuminaing o define marices and vecors so as o rewrie (3) as a single equaion. To do so define he vecors y := [x 1 ;... ; x n ] and h(y) := [ f 1 (x 1 );... ; f n (x n )]. Vecor y R np concaenaes he local vecors x i, and he vecor h(y) R np concaenaes he gradiens of he local funcions f i aken wih respec o he local variable x i. Noice ha h(y) is no he gradien of f(x) and ha a vecor y wih h(y) = 0 does no necessarily solve (1). To solve (1) we need o have x i = x j for all i and j wih n i=1 f i(x i ) = 0. In any even, o rewrie (3) we also define he marix Z := W I R np np as he Kronecker produc of he weigh marix W R n n and he ideniy marix I R p p. I is hen ready o see ha (3) is equivalen o y +1 = Zy αh(y ) = y [ (I Z)y + αh(y ) ], (5) where in he second equaliy we added and subraced y and regrouped erms. Inspecion of (5) reveals ha he DGD updae formula a sep is equivalen o a (regular) gradien descen algorihm being used o solve he program y := argmin F (y) := min 1 n yt (I Z)y+α f i (x i ). (6) i=1 This inerpreaion has been previously used in [14], [6] o design a Neserov ype acceleraion of DGD. Indeed, given he definiion of he funcion F (y) := (1/)y T (I Z) y + α n i=1 f i(x i ) i follows ha he gradien F (y ) is given by g := F (y ) = (I Z)y + αh(y ). (7) Using (7) we rewrie (5) as y +1 = y g and conclude ha DGD descends along he negaive gradien of F (y) wih uni sepsize. The expression in (3) is jus a disribued implemenaion of gradien descen ha uses he gradien in (7). To confirm ha his is rue, observe ha he ih elemen of he gradien g = [g 1, ;... ; g n, ] is given by g i, = (1 w ii )x i, j N i w ij x j, + α f i (x i, ). (8) The gradien descen ieraion y +1 = y g is hen equivalen o (3) if we enrus node i wih he implemenaion of he descen x i,+1 = x i, g i,, where, we recall, x i, and x i,+1 are he ih componens of he vecors y and y +1. Observe ha he local gradien componen g i, can be compued using local informaion and he x j, ieraes of is neighbors j N i. This is as i should be, because he descen x i,+1 = x i, g i, is equivalen o (3). Is i a good idea o descend on F (y) o solve (1)? To some exen. Since we know ha he null space of I W is null(i W) = span(1) and ha Z = W I we know ha he null space of I Z is he se of consensus vecors, i.e., null(i Z) = { y = [x 1 ;... ; x n ] } x1 = = x n. Thus, (I Z)y = 0 holds if and only if x 1 = = x n. Since he marix I Z is posiive semidefinie and symmeric, he same is rue of he square roo marix (I Z) 1/. Therefore, he opimizaion problem in () is equivalen o he opimizaion problem n ỹ := argmin f i (x i ), s.. (I Z) 1/ y = 0. (9) x i=1 Indeed, for y = [x 1 ;... ; x n ] o be feasible in (9) we mus have x 1 = = x n. This is he same consrain imposed in () from where i follows ha we mus have ỹ = [x 1;... ; x n] wih x i = x for all i. The unconsrained minimizaion in (6) is a penaly version of (9). The penaly funcion associaed wih he consrain (I Z) 1/ y = 0 is he squared norm (1/) (I Z) 1/ y and he corresponding penaly coefficien is 1/α. Inasmuch as he penaly coefficien 1/α is sufficienly large, he opimal argumens y and ỹ are no oo far apar. The reinerpreaion of (3) as a penaly mehod demonsraes ha DGD is an algorihm ha finds he opimal soluion of (6), no (9) or is equivalen original formulaions in (1) and (). Using a fixed α he disance beween y and ỹ is of order O(α), [15]. To solve (9) we need o inroduce a rule o progressively decrease α. In he following secion we exploi he reinerpreaion of (5) as a mehod o minimize (6) o propose an approximae Newon algorihm ha can be implemened in a disribued manner. III. NETWORK NEWTON Insead of solving (6) wih a gradien descen mehod as in DGD, we can solve (6) using Newon s mehod. To implemen Newon s mehod we need o compue he Hessian H := F (y ) of F evaluaed a y so as o deermine he Newon sep d := H 1 g. Sar by differeniaing wice in (6) in order o wrie H as H := F (y ) = I Z + αg, (10) where G R np np is a block diagonal marix formed by blocks G ii, R p p defined as G ii, = f i (x i, ). (11) I follows from (10) and (11) ha he Hessian H is block sparse wih blocks H ij, R p p having he sparsiy paern of Z, which is he sparsiy paern of he graph. The diagonal blocks are of he form H ii, = (1 w ii )I + α f i (x i, ) and he off diagonal blocks are no null only when j N i in which case H ij, = w ij I. While he Hessian H is sparse, he inverse H is no. I is he laer ha we need o compue he Newon sep d := g. To overcome his problem we spli he diagonal and off diagonal blocks of H and rely on a Taylor s expansion of he inverse This spliing echnique is inspired from he Taylor s expansion used in [4]. To be precise, wrie H = H 1

4 4 D B where he marix D is defined as D := αg + (I diag(z)) := αg + (I Z d ), (1) where in he second equaliy we defined Z d := diag(z) for fuure reference. Since he diagonal weighs mus be w ii < 1, he marix I Z d is posiive definie. The same is rue of he block diagonal marix G because he local funcions are assumed srongly convex. Therefore, he marix D is block diagonal and posiive definie. The ih diagonal block D ii, R p of D can be compued and sored by node i as D ii, = α f i (x i, ) + (1 w ii )I. To have H = D B we mus define B := D H. Considering he definiions of H and D in (10) and (1), i follows ha B = I Z d + Z. (13) Noe ha B is ime-invarian and depends on he weigh marix Z only. As in he case of he Hessian H, he marix B is block sparse wih blocks B ij R p p having he sparsiy paern of Z, which is he sparsiy paern of he graph. Node i can compue he diagonal blocks B ii = (1 w ii )I and he off diagonal blocks B ij = w ij I using informaion abou is own and neighbors weighs. Proceed now o facor D 1/ from boh sides of he spliing relaionship o wrie H = D 1/ (I D 1/ BD 1/ )D 1/. When we consider he Hessian inverse H 1, we can use he Taylor series (I X) 1 = D 1/ BD 1/ o wrie H 1 = D 1/ k=0 j=0 Xj wih X = ( ) k D 1/ BD 1/ 1/ D. (14) The sum in (14) converges if he absolue value of all he eigenvalues of he marix D 1/ BD 1/ are sricly less han 1. For he ime being we assume his o be he case bu we will prove ha his is rue in Secion IV. When he series converge, we can use runcaions of his series o define approximaions o he Newon sep as we explain in he following secion. Remark 1 The Hessian decomposiion H = D B wih he marices D and B in (1) and (13), respecively, is no he only valid decomposiion ha we can use for Nework Newon. Any decomposiion of he form H = D ± B is valid if D is posiive definie and he eigenvalues of he marix D 1/ B D 1/ are in he inerval ( 1, 1). An example alernaive decomposiion is given by he marices D = αg and B = I Z. This decomposiion has he advanage of separaing he effecs of he funcion in D and he effecs of he nework in B. The decomposiion in (1) and (13) exhibis faser convergence of he series in (14) because he marix D in (1) accumulaes more weigh in he diagonal han he marix D = αg. The sudy of alernaive decomposiions is beyond he scope of his paper. A. Disribued approximaions of he Newon sep Nework Newon (NN) is defined as a family of algorihms ha rely on runcaions of he series in (14). The Kh member of his family, NN-K, considers he firs K + 1 erms of he series o define he approximae Hessian inverse Ĥ (K) 1 := D 1/ K k=0 ( ) k D 1/ BD 1/ 1/ D. (15) NN-K uses he approximae Hessian Ĥ(K) 1 as a curvaure correcion marix ha is used in lieu of he exac Hessian inverse H 1 descending along he Newon sep d := H 1 along he NN-K sep d (K) := Ĥ(K) 1 o esimae he Newon sep. I.e., insead of g we descend g as an approximaion of d. Using he explici expression for Ĥ(K) 1 in (15) we wrie he NN-K sep as d (K) = D 1/ K k=0 ( ) k D 1/ BD 1/ 1/ D g, (16) where, we recall, g as he gradien of he funcion F (y) defined in (7). The NN-K updae can hen be wrien as y +1 = y + ɛ d (K), (17) where ɛ is a properly seleced sepsize see Theorem 1 for specific condiions. The algorihm defined by recursive applicaion of (17) can be implemened in a disribued manner because he runcaed series in (15) has a local srucure conrolled by he parameer K. To explain his saemen beer define he componens d (K) i, d (K) = [d (K) 1, ;... ; d(k) (17) requires ha node i compues d (K) i, he local descen x i,+1 = x i, + ɛd (K) i, R p of he NN-K sep n, ]. A disribued implemenaion of so as o implemen. The key observaion here is ha he sep componen d (K) i, can indeed be compued hrough local operaions. Specificially, begin by noing ha as per he definiion of he NN-K descen direcion in (16) he sequence of NN descen direcions saisfies d (k+1) = D 1 Bd (k) D 1 g = D 1 ( Bd (k) g ). (18) Since he marix B has he sparsiy paern of he graph, his recursion can be decomposed ino local componens ( ) d (k+1) i, = D 1 ii, B ij d (k) j, g i,, (19) j N i {i} The marix D ii, = α f i (x i, ) + (1 w ii )I is sored and compued a node i. The gradien componen g i, = (1 w ii )x i, j N i w ij x j, + α f i (x i, ) is also sored and compued a i. Node i can also evaluae he values of he marix blocks B ii = (1 w ii )I and B ij = w ij I. Thus, if he NN-k sep componens d (k) j, are available a neighbors j, node i can deermine he NN-(k + 1) sep componen d (k+1) i, upon being communicaed ha informaion. The expression in (19) represens an ieraive compuaion embedded inside he NN-K recursion in (17). A ime index, we compue he local componen of he NN-0 sep = D 1 ii, g i,. Upon exchanging his informaion wih neighbors we use (19) o deermine he NN-1 sep d (1) i,. These d (0) i, can be exchanged o compuer d () i, as in (19). Repeaing his procedure K imes, nodes ends up having deermined heir

5 5 Algorihm 1 Nework Newon-K mehod a node i Require: Iniial ierae x i,0. Weighs w ij. Penaly coefficien α. 1: B marix blocks: B ii = (1 w ii)i and B ij = w iji : for = 0, 1,,... do 3: D marix block: D ii, = α f i(x i,) + (1 w ii)i 4: Exchange ieraes x i, wih neighbors j N i. 5: Gradien: g i, = (1 w ii)x i, w ijx j, + α f i(x i,) j N i 6: Compue NN-0 descen direcion d (0) i, = D 1 ii, gi, 7: for k = 0,..., K 1 do 8: Exchange elemens d (k) i, of he NN-k [ sep wih neighbors ] 9: NN-(k+1) sep: d (k+1) i, = D 1 ii, B ijd (k) j, gi, j N i,j=i 10: end for 11: Updae local ierae: x i,+1 = x i, + ɛ d (K) i,. 1: end for NN-K sep componen d (K) i,. The resuling NN-K mehod is summarized in Algorihm 1. The descen ieraion in (17) is implemened in Sep 11. Implemenaion of his descen requires access o he NN-K descen direcion d (K) i, which is compued by he loop in seps Sep 6 iniializes he loop by compuing he NN-0 sep d (0) i, = D 1 ii, g i,. The core of he loop is in Sep 9 which corresponds o he recursion in (19). Sep 8 sands for he variable exchange ha is required o implemen Sep 9. Afer K ieraions hrough his loop, he NN-K descen direcion is compued and can be used in Sep 11. Boh, Seps 6 and 9, require access o he local gradien componen g i,. This is evaluaed in Sep 5 afer receiving he prerequisie informaion from neighbors in Sep 4. Seps 1 and 3 compue he blocks B ii,, B ij,, and D ii, required in seps 6 and 9. d (K) i, Remark By rying o approximae he Newon sep, NN- K ends up reducing he number of ieraions required for convergence. Furhermore, he larger K is, he closer ha he NN-K sep ges o he Newon sep, and he faser NN- K converges. We will jusify hese asserions boh, analyically in Secion IV, and numerically in Secion V. I is imporan o observe, however, ha reducing he number of ieraions reduces he compuaional cos bu no necessarily he communicaion cos. In DGD, each node i shares is vecor x i, R p wih each of is neighbors j N i. In NN-K, node i exchanges no only he vecor x i, R p wih is neighboring nodes, bu i also communicaes ieraively he local componens of he descen direcions {d (k) i, }K 1 k=0 Rp so as o compue he descen direcion d (K) i,. Hence, a each ieraion, node i sends N i vecors of size p o is neighbors in DGD, while in NN-K i sends (K+1) N i vecors of he same size. Unless he original problem is well condiioned, NN-K also reduces oal communicaion cos unil convergence, even hough he cos of each individual ieraion is larger. However, he use of large K is unwarraned because he added benefi of beer approximaing he Newon sep does no compensae he increase in communicaion cos. IV. CONVERGENCE ANALYSIS In his secion we show ha as ime progresses he sequence of objecive funcion values F (y ) [cf. (6)] approaches he opimal objecive funcion value F (y ). In proving his claim we make he following assumpions. Assumpion 1 There exis consans 0 δ < 1 ha lower and upper bound he diagonal weighs for all i, 0 < δ w ii < 1, i = 1,..., n. (0) Assumpion The local objecive funcions f i (x) are wice differeniable and he eigenvalues of he local Hessians are bounded wih posiive consans 0 < m M <, i.e. mi f i (x) MI. (1) Assumpion 3 The local objecive funcion Hessians f i (x) are Lipschiz coninuous wih respec o he Euclidian norm wih parameer L. I.e., for all x, ˆx R p, i holds f i (x) f i (ˆx) L x ˆx. () The lower bound in Assumpion 1 is more a definiion han a consrain. To be more precise, he weighs w ij are posiive if and only if j N i or j = i. This observaion verifies exisence of a lower bound for he local weighs w ii ha is defined as δ > 0 in Assumpion 1. The upper bound < 1 on he weighs w ii is rue for all conneced neworks as long as neighbors j N i are assigned nonzero weighs w ij > 0. This is because he marix W is doubly sochasic [cf. (4)], which implies ha w ii = 1 j N i w ij < 1 as long as w ij > 0. The lower bound m for he eigenvalues of local objecive funcion Hessians f i (x) is equivalen o he srong convexiy of local objecive funcions f i (x) wih parameer m. The srong convexiy assumpion for he local objecive funcions f i (x) saed in Assumpion is cusomary in Newon-based mehods, since he Hessian of objecive funcion should be inverible o esablish Newon s mehod [Chaper 9 of [3]]. The upper bound M for he eigenvalues of local objecive funcion Hessians f i (x) is similar o he condiion ha gradiens f i (x) are Lipschiz coninuous wih parameer M for he case ha funcions are wice differeniable. The resricion imposed by Assumpion 3 is cusomary in he analysis of second order mehods, see Secion of [3], which guaranees ha he Hessians F (y) are also Lipschiz coninuous as we show in he following lemma. Lemma 1 Consider he definiion of objecive funcion F (y) in (6). If Assumpion 3 holds hen he objecive funcion Hessian H(y) =: F (y) is Lipschiz coninuous wih parameer αl, i.e., for all y, ŷ R np we have Proof: See Appendix A. H(y) H(ŷ) αl y ŷ. (3) Lemma 1 saes ha he penaly objecive funcion inroduced in (6) has he propery ha he Hessians are Lipschiz coninuous, while he Lipschiz consan is a funcion of he penaly coefficien 1/α. Thus, if we increase he penaly coefficien 1/α, or, equivalenly, decrease α, he objecive funcion F (y) approaches a quadraic form because he curvaure becomes consan.

6 6 To prove convergence properies of NN we need bounds for he eigenvalues of he block diagonal marix D, he block sparse marix B, and he Hessian H. These eigenvalue bounds are esablished in he following proposiion using he condiions imposed by Assumpions 1 and. Proposiion 1 Consider he definiions of marices H, D, and B in (10), (1), and (13), respecively. If Assumpions 1 and hold rue, hen he eigenvalues of marices H, D, and B are uniformly bounded as αmi H ((1 δ) + αm)i, (4) ((1 ) + αm)i D ((1 δ) + αm)i, (5) Proof: See Appendix B. 0 B (1 δ)i. (6) Proposiion 1 saes ha Hessian marix H and block diagonal marix D are posiive definie, while marix B is posiive semidefinie. As we noed in Secion III, for he expansion in (14) o be valid he eigenvalues of he marix D 1/ BD 1/ mus be nonnegaive and sricly smaller han 1. The following proposiion saes ha his is rue for all imes. Proposiion Consider he definiions of he marices D in (1) and B in (13). If Assumpions 1 and hold rue, he marix D 1/ BD 1/ is posiive semidefinie and is eigenvalues are bounded above by a consan ρ < 1 0 D 1/ BD 1/ ρi, (7) where ρ := (1 δ)/((1 δ) + αm). Proof: See Appendix C. The resuls in Proposiion 1 would lead o he rivial upper bound (1 δ)/(αm + (1 )) for he eigenvalues of D 1/ BD 1/. The upper bound in Proposiion is igher and follows from he srucure of he marix D 1/ BD 1/. The bounds for he eigenvalues of D 1/ BD 1/ in (7) guaranee convergence of he Taylor series in (14). As menioned in Secion III, NN-K runcaes he firs K summands of he Hessian inverse Taylor series in (14) o approximae he Hessian inverse of he objecive funcion in opimizaion problem (6). To evaluae he performance of NN-K we sudy he error of he Hessian inverse approximaion by defining he error marix E R np np as E := I Ĥ(K) 1/ H Ĥ (K) 1/. (8) The error marix E measures closeness of he Hessian inverse approximaion marix Ĥ(K) 1 and he exac Hessian inverse H 1 a ime. Based on he definiion of he error marix E, if he Hessian inverse approximaion Ĥ(K) 1 approaches he exac Hessian inverse H 1 he error marix E approaches he zero marix 0. We herefore bound he error of he Hessian inverse approximaion by developing a bound for he eigenvalues of E. This bound is provided in he following proposiion. Proposiion 3 Consider he NN-K mehod in (1)-(17) and he definiion of error marix E in (8). Furher, recall he definiion of he consan ρ := (1 δ)/(αm + (1 δ)) < 1 in Proposiion. The error marix E is posiive semidefinie and all is eigenvalues are upper bounded by ρ K+1, Proof: See Appendix D. 0 E ρ K+1 I. (9) Proposiion 3 assers ha he error in he approximaion of he Hessian inverse, hereby on he approximaion of he Newon sep, is bounded by ρ K+1. This resul corroboraes he inuiion ha he larger K is, he closer ha d (K) i, approximaes he Newon sep. This closer approximaion comes a he cos of increasing he communicaion cos of each descen ieraion. The decrease of his error being proporional o ρ K+1 hins ha using a small value of K should suffice in pracice. Furher o decrease ρ we can increase δ or increase α. Increasing δ calls for assigning subsanial weigh o w ii. Increasing α comes a he cos of moving he soluion of (6) away from he soluion of (9) and is equivalen (1). Bounds on he eigenvalues of he objecive funcion Hessian H are cenral o he convergence analysis of Newon s mehod [Chaper 9 of [3]]. Lower bounds for he Hessian eigenvalues guaranee ha he marix is nonsingular. Upper bounds imply ha he minimum eigenvalue of he Hessian inverse H 1 is sricly larger han zero, which, in urn, implies a sric decremen in each Newon sep. Analogous bounds for he eigenvalues of he NN approximae Hessian inverses Ĥ(K) 1 are required. These bounds are sudied in he following lemma. Lemma Consider he NN-K mehod as defined in (1)-(17). If Assumpions 1 and hold rue, we have λi where consans λ and Λ are defined as Ĥ(K) 1 ΛI, (30) 1 λ:= (1 δ)+αm and Λ:= 1 ρ K+1 (1 ρ)((1 )+αm). (31) Proof: See Appendix E. According o he resul in Lemma, he NN-K approximae Hessian inverses Ĥ(K) 1 are sricly posiive definie and have all of heir eigenvalues bounded beween he posiive and finie consans λ and Λ. This is rue for all K and uniform across all ieraion indexes. Considering hese eigenvalue bounds and he fac ha g is a descen direcion, he approximae Newon sep Ĥ(K) 1 g enforces convergence of he ierae y o he opimal argumen y of he penalized objecive funcion F (y) in (6). In he following heorem we show ha if he sepsize ɛ is properly chosen, he sequence of objecive funcion values F (y ) converges a leas linearly o he opimal objecive funcion value F (y ). Theorem 1 Consider he NN-K mehod as defined in (1)- (17) and he objecive funcion F (y) as inroduced in (6). Furher, recall he definiions of he lower and upper bounds

7 7 λ and Λ, respecively, for he eigenvalues of he approximae Hessian inverse Ĥ(K) 1 in (31). If he sepsize ɛ is chosen as [ ] ɛ min 1, 3mλ 5 1 LΛ 3 (F (y 0 ) F (y )) 1, (3) and Assumpions 1-3 hold, he sequence F (y ) converges o he opimal argumen F (y ) a leas linearly as F (y ) F (y ) (1 ζ) (F (y 0 ) F (y )), (33) where he consan 0 < ζ < 1 is explicily given by ζ := ( ɛ)ɛαmλ αɛ3 LΛ 3 (F (y 0 ) F (y )) 1 6λ 3 Proof: See Appendix F.. (34) Theorem 1 shows ha he objecive funcion error sequence F (y ) F (y ) asympoicly converges o zero and ha he rae of convergence is a leas linear. Noe ha according o he definiion of he convergence parameer ζ in Theorem 1 and he definiions of λ and Λ in (31), increasing α leads o faser convergence. This observaion verifies exisence of a radeoff beween rae and accuracy of convergence. For large values of α he sequence generaed by nework Newon converges faser o he opimal soluion of (6). These faser convergence comes a he cos of increasing he disance beween he opimal soluions of (6) and (1). Conversely, smaller α implies smaller gap beween he opimal soluions of (6) and (1), bu he convergence rae of NN-K is slower. In he following secion, we illusrae he connecion beween nework Newon and he cenralized Newon s mehod. A. Analysis of nework Newon as a Newon-like mehod To connec he proposed NN mehod wih he classic Newon s mehod, we firs sudy he difference beween hese mehods. In paricular, he following lemma shows ha he convergence of he norm of he weighed gradien 1 g in NN-K is akin o he convergence of Newon s mehod wih consan sepsize. The difference is he appearance of a erm associaed wih he error of he Hessian inverse approximaion as we formally sae nex. Lemma 3 Consider he NN-K mehod as defined in (1)-(17). If Assumpions 1-3 hold, he sequence of weighed gradiens D 1/ g +1 saisfies g +1 ( 1 ɛ + ɛρ K+1)[ ] 1 + Γ 1 (1 ζ) ( 1) 4 1 g + ɛ Γ 1 g, (35) where he consans Γ 1 and Γ are defined as Γ 1 := (αɛlλ) 1 (F (y 0 ) F (y )) 1 4 λ 3 4 ((1 ) + αm) Proof: See Appendix G. αlλ, Γ :=. λ((1 )+αm) 1 (36) As per Lemma 3 he weighed gradien norm g +1 is upper bounded by erms ha are linear and quadraic on he weighed norm 1 g associaed wih he previous ierae. This is akin o he gradien norm decrease of Newon s mehod wih consan sepsize. Noe ha if he error of Hessian inverse approximaion which is characerized by ρ K+1 becomes zero, by seing ɛ = 1 we can simplify (35) as g +1 Γ 1 g. This resul shows quadraic convergence when Γ 1 g < 1. However, he erm ρ K+1 is no zero in general. Alhough, he error of Hessian inverse approximaion is no zero, he resul in (35) is very similar o he one for he classic Newon s mehod. To make his connecion clearer, furher noe ha for all excep he firs few ieraions he erm Γ 1 (1 ζ) ( 1)/4 0 is close o 0 and he relaion in (35) can be simplified o g +1 (1 ɛ+ɛρ K+1 ) 1 g + ɛ Γ 1 g. (37) In (37), he coefficien in he linear erm is reduced o (1 ɛ + ɛρ K+1 ) and he coefficien in he quadraic erm says a ɛ Γ. If, for discussion purposes, we se ɛ = 1 as in Newon s quadraic phase, he upper bound in (37) is furher reduced o g +1 ρ K+1 1 g + Γ 1 g. (38) The equaion in (38) makes he connecion beween NN and Newon s clear, because he exac same resul would hold for Newon s mehod if we se ρ = 0. The NN mehod can no have a quadraic convergence phase for he res of he ieraions like he one for Newon s mehod because of he erm ρ K+1 1 g. However, since he consan ρ (cf. Proposiion ) is smaller han 1 he erm ρ K+1 can be made arbirarily small by increasing he approximaion order K. Equivalenly, his means ha by selecing K o be large enough, we can make he quadraic erm in (38) dominan and observe a quadraic convergence phase. The boundaries of his quadraic convergence phase are formally deermined in he following Theorem using he resul in (35). Theorem Consider he NN-K mehod as defined in (1)- (17). Define he sequence η := [(1 ɛ + ɛρ K+1 )(1 + Γ 1 (1 ζ) ( 1)/4 )] and he ime 0 as he firs ime a which sequence η is smaller han 1, i.e. 0 := argmin { η < 1}. If Assumpions 1-3 hold, hen for all 0 when he sequence 1 g saisfies η (1 η ) ɛ 1 Γ g < 1 η ɛ, (39) Γ he sequence of scaled gradien norms is such ha g +1 ɛ Γ 1 η 1 g. (40) Proof: Based on he definiion of η, we can rewrie (35) as g +1 η 1 g + ɛ Γ 1 g. (41) We use his expression o prove he inequaliy in (40). To do so, rearrange erms in he firs inequaliy in (39) and wrie η ɛ Γ 1 η 1 g. (4)

8 8 Muliplying boh sides of (4) by η 1 η 1 g g yields η ɛ Γ 1 η 1 g. (43) Subsiuing η 1 g in (41) by is upper bound in (43) implies ha g +1 η ɛ Γ 1 η 1 g + ɛ Γ 1 g = ɛ Γ 1 η 1 g. (44) To verify quadraic convergence, i is necessary o prove ha he sequence i 1 g i of weighed gradien norms is decreasing. For his o be rue we mus have ɛ Γ D 1 1/ 1 η g < 1. (45) Bu (45) is rue because we are looking a a range of gradiens ha saisfy he second inequaliy in (39). As per Theorem 1 y is converging o y a a rae ha is a leas linear. Thus, he gradiens g will be such ha a some poin in ime hey saisfy he righmos inequaliy in (39). A ha poin in ime, progress owards y proceeds a a quadraic rae as indicaed by (40). This quadraic rae of progress is mainained unil he lefmos inequaliy in (39) is saisfied, a which poin he linear erm in (35) dominaes and he convergence rae goes back o linear. Furhermore, making K sufficienly large i is possible o reduce η arbirarily and make he quadraic convergence region las longer. In pracice, his calls for making K large enough so ha η is close o he desired gradien norm accuracy. Remark 3 For a quadraic funcion F, he Lipschiz consan for he Hessian is L = 0. Then, he opimal choice of sepsize for NN-K is ɛ = 1 as a resul of sepsize rule in (3). Moreover, he consans for he linear and quadraic erms in (35) are Γ 1 = Γ = 0 as i follows from heir definiions in (36). For quadraic funcions we also have ha he Hessian of he objecive funcion H = H and he block diagonal marix D = D are ime-invarian. Thus, we can rewrie (35) as g +1 ρ K+1 g. (46) Noe ha Newon s mehod converges in a single sep in quadraic programming. This propery follows from (46) because Newon s mehod is equivalen o NN-K as K. The expression in (46) saes ha NN-K converges linearly wih a consan decrease facor of ρ K+1 per ieraion. This in conras wih firs order mehods like DGD ha converge wih a linear rae ha depends on he problem condiion number. V. NUMERICAL ANALYSIS In his secion, we sudy he performance of NN-K in he minimizaion of a disribued quadraic objecive. For each agen i we consider a posiive definie diagonal marix A i S ++ p and a vecor b i R p o define he local objecive funcion f i (x) := (1/)x T A i x + b T i x. Therefore, he global cos funcion f(x) is wrien as n 1 f(x) := xt A i x + b T i x. (47) i=1 The difficuly of solving (47) is given by he condiion number of he marices A i. To une condiion numbers we generae diagonal marices A i wih random diagonal elemens a ii. The firs p/ diagonal elemens a ii are drawn uniformly a random from he discree se {1, 10 1,..., 10 ξ } and he nex p/ are uniformly and randomly chosen from he se {1, 10 1,..., 10 ξ }. This choice of coefficiens yields local marices A i wih eigenvalues in he inerval [10 ξ, 10 ξ ] and global marices n i=1 A i wih eigenvalues in he inerval [n10 ξ, n10 ξ ]. The linear erms b T i x are added so ha he differen local funcions have differen minima. The vecors b i are chosen uniformly a random from he box [0, 1] p. The graph is graph is d-regular and generaed by creaing a cycle and hen connecing each node wih he d/ nodes ha are closes in each direcion. The diagonal weighs in he marix W are se o w ii = 1/ + 1/(d + 1) and he off diagonal weighs o w ij = 1/(d + 1) when j N i. A. Comparison wih exising mehods In his secion we compare he performance of he proposed NN mehod wih primal mehods such as DGD in [13] and he acceleraed version of DGD (Acc. DGD) in [14]. For he Acc. DGD mehod, we assume ha he sepsize parameer and he momenum coefficiens are consan as in he case for he cenralized acceleraed gradien descen. This makes he comparison beween Acc. DGD, DGD, and NN fair, since our aim is o compare heir performances in solving he penalized objecive funcion. Moreover, we consider he convergence pahs of he disribued ADMM (DADMM) in [18] and he exac firs order mehod EXTRA in [16]. Alhough EXTRA operaes in he primal domain, i has been shown ha i can be inerpreed as a saddle-poin mehod [7]. Thus, we consider EXTRA in he caegory of dual mehods which has a linear convergence rae as DADMM. We compare hese mehods in solving (47) for he case ha here are n = 100 nodes in he nework and he dimension of he vecor x is p = 0. We assume ha he graph is 4-regular. Furher, we se he condiion number parameer o ξ = and he penaly parameer o α = The momenum coefficien for he acceleraed DGD is 0.9. Noe ha among he values {0.1, 0.,..., 0.9, 1}, he bes performance belongs o he momenum coefficien 0.9 which we use in he experimens. As he condiion number of he problem is relaively large, i.e., , he NN mehod performs beer han DGD and Acc. DGD in erms of he number of ieraions and oal number of local informaion exchanges as hey are illusraed in Fig. 1 and Fig., respecively. In he case ha he condiion number of he objecive funcion is no significanly large wih respec o he dimension of he problem, he acceleraed DGD would be a beer choice relaive o NN. The comparison wih dual mehods shows ha in erms of ieraions and rounds of communicaions DADMM and

9 9 Relaive error x x x0 x DGD Acc. DGD ADMM EXTRA NN-0 NN-1 NN- Relaive error x x x0 x DGD Acc.DGD NN-0 NN-1 NN Number of ieraions Fig. 1: Comparison of DGD, Acc. DGD, DADMM, EXTRA, NN-0, NN-1, and NN- in erms of number of ieraions Number of local informaion exchanges Fig. 3: Relaive error of DGD, Acc. DGD, NN-0, NN-1, and NN- vs number of local info. exchanges for a well-condiioned problem. Relaive error x x x0 x DGD Acc. DGD ADMM EXTRA NN-0 NN-1 NN- Relaive error x x x0 x DGD Acc. DGD NN-0 NN-1 NN Number of local informaion exchanges Fig. : Comparison of DGD, Acc. DGD, DADMM, EXTRA, NN-0, NN-1, and NN- in erms of rounds of local informaion exchanges Number of local informaion exchanges Fig. 4: Relaive error of DGD, Acc. DGD, NN-0, NN-1, and NN- vs number of local info. exchanges for an ill-condiioned problem differen varians of NN perform relaively well and afer some poin DADMM ouperform NN and oher primal mehods because i converges o he opimal argumen of he original problem insead of he penalized funcion. On he oher hand, each sep of DADMM requires solving a convex program which can be compuaionally cosly. We observe ha EXTRA also has a linear convergence rae o he exac opimal soluion, and is accuracy becomes beer han all primal mehods. However, EXTRA is a firs-order mehod and is convergence a he beginning is relaively slower han NN. This advanage of NN resuls from incorporaion of he curvaure informaion of he objecive funcion. These observaions show ha by incorporaing he idea of NN and EXTRA we should be able o come up wih a second-order mehod ha has a linear convergence rae o he exac soluion of (47) while i can perform well in ill-condiioned problems. B. Effec of objecive funcion condiion number We sudy he effec of condiion number on he convergence rae of NN and show ha NN is less sensiive o he objecive funcion condiion number wih respec o primal firs-order mehods, e.g., DGD in [13] and acceleraed DGD in [14]. To do so, we compare he performances of he menioned mehods in solving he problem in (47) for small and large condiion numbers. The parameers are he same as he parameers in Fig. 1 excep he choice of he condiion number parameer ξ. We firs consider he case ha ξ = 1 which leads o condiion number The convergence pahs of DGD, acceleraed DGD, NN-0, NN-1, and NN- in erms of he num- ber of local informaion exchanges are shown in Fig. 3. The performance of variaions of NN are no significanly beer han DGD and acceleraed DGD. In paricular, DGD and Acc. DGD boh ouperform NN-1 and NN- in erms of he oal communicaions unil convergence. Thus, acceleraed DGD is he bes opion among he primal mehods for problems wih small condiion number. To explore he performance of hese mehods for an illcondiioned problem we se he condiion number parameer ξ = 3 which leads o he condiion number for he considered realizaion. Fig. 4 illusrae he convergence pahs of he considered primal mehods in erms of he number of local informaion exchanges. As we observe, he advanage of he nework Newon mehods is subsanial in his seing and hey ouperform DGD and acceleraed DGD in erms of communicaion cos. C. Effec of nework opology We proceed o compare he performance of NN in differen nework opologies. In paricular, we consider five differen opologies which are random graphs wih conneciviy probabiliies p c = 0.5 and p c = 0.35, complee graph, cycle, and line. Noe ha in random graphs, we generae he edges beween nodes wih probabiliy p c. The complee graph is a graph ha all nodes are conneced o each oher direcly. A cycle graph is a conneced graph ha each node has degree. A line graph is a cycle graph ha is missing an edge. The parameers are he same as he parameers in Fig. 1 excep he nework graph and he way ha we generae

10 10 Relaive error x x x0 x NN- pc = 0.5 NN- pc = 0.35 NN- complee graph NN- cycle NN- line Number of ieraions Fig. 5: Relaive error of NN- vs num. of ieraions for random graphs wih p c = {0.5, 0.35}, complee graph, cycle graph, and line graph. Weighed gradien norm 1 g NN-0 T.B. NN-1 T.B. NN- T.B. NN-0 NN-1 NN Number of ieraions Fig. 7: Comparison of he heoreical bound (T.B.) in (46) wih he empirical resul for a quadraic programming. Relaive error x x x0 x NN- pc = 0.5 NN- pc = 0.35 NN- complee graph NN- cycle NN- line observaions imply ha for he graphs ha δ is larger we expec faser linear convergence. The convergence pahs in Fig 5 reinforce his claim. Noe ha δ for he considered graphs are δ pc=0.5 = , δ pc=0.35 = , δ com = 0.51, δ cycle = 0.75, δ line = These numbers jusify he similariy of he convergence pahs of line and cycle graphs and he slow convergence rae of he complee graph Toal communicaions beween nodes 10 5 Fig. 6: Relaive error of NN- vs num. of communicaions for random graphs wih p c ={0.5, 0.35}, complee graph, cycle graph, and line graph. he weigh marix W. We generae he weigh marix W using he formula W = I L/τ where L is he Laplacian marix of he graph and τ/ is he larges eigenvalue of he Laplacian L. We compare he performance of NN- for all hese neworks in erms of he number of ieraions and he oal number of communicaions beween nodes. Noice ha in his secion we use oal communicaions beween node insead of he number of local informaion exchanges (rounds of local communicaions) since he degrees of nodes in he differen neworks are no equal. The convergence pahs of NN- for he considered opologies in erms of he number of ieraions and he oal number of communicaions are demonsraed in Fig. 5 and Fig 6, respecively. The firs imporan observaion is he accuracy of convergence. According o he resuls in [15], if we define β < 1 as he second larges magniude of he eigenvalues of W, hen he accuracy of convergence is proporional o 1/(1 β). Thus, he graphs wih smaller β converge o a smaller neighborhood of he opimal argumen. In paricular, he parameer β for he complee graph which has he mos accurae convergence is β = 0.5, while for he line graph ha has he leas accurae convergence pah β = The second imporan observaion is he rae of convergence for NN- in hese nework opologies. I follows from he resul in Theorem 1 ha for a quadraic objecive funcion he consan of linear convergence becomes 1 αmλ. Therefore, for larger values of λ we expec faser convergence. Noe ha λ is large when δ = min i w ii is large and close o 1. These D. Tighness of he bounds In his secion, we sudy he ighness of he heoreical bounds in he paper. To do so, we compare he empirical convergence raes of NN-0, NN-1, and NN- wih he heoreical resul in Lemma 3. As we discussed in Remark 3, for a quadraic objecive funcion he sequence of weighed gradiens of NN-K saisfies he inequaliy g +1 ρ K+1 g. We refer o his rae as T.B. which sands for heoreical bound. Figure 7 illusraes he heoreical bounds and empirical convergence pahs of NN-0, NN-1, and NN- for he quadraic problem in (47). As we observe, he convergence raes of all mehods are faser han heir heoreical bounds a he beginning, bu afer almos 10 ieraions heir convergence rae becomes similar o he heoreical bound in (46). To be clearer, he slopes of he acual convergence pahs and heir corresponding heoreical bounds become equal afer almos 10 ieraions. This observaion shows ha he bound in (46) is reasonably igh and he sequence of weighed gradiens for NN-K diminishes wih facor ρ K+1. VI. CONCLUSIONS We developed he nework Newon mehod as an approximae Newon mehod for solving consensus opimizaion problems. The algorihm builds on a reinerpreaion of disribued gradien descen as a penaly mehod and relies on an approximaion of he Newon sep of he corresponding penalized objecive funcion. To approximae he Newon direcion we runcae he Taylor series of he exac Newon sep. This leads o a family of mehods defined by he number K of Taylor series erms kep in he approximaion. When we keep K erms of he Taylor series, he mehod is called NN-K and can be implemened hrough he aggregaion of informaion in K-hop neighborhoods. We showed ha NN converges a leas linearly o he soluion of he penalized objecive, and, consequenly,

11 11 o a neighborhood of he opimal argumen for he original opimizaion problem. We compleed he convergence analysis of NN-K by showing ha he sequence of ieraes generaed by NN-K converges a a quadraic rae in a specific inerval. Numerical analyses compared he performances of NN-K wih differen choices of K for minimizing quadraic objecives. We observed ha all NN-K mehods work faser han disribued gradien descen in erms of number of ieraions and number of communicaions. APPENDIX A PROOF OF LEMMA 1 Consider wo vecors y := [x 1 ;... ; x n ] R np and ŷ := [ˆx 1 ;... ; ˆx n ] R np. Based on he Hessian expression in (10), we simplify he Euclidean norm H(y) H(ŷ) as H(y) H(ŷ) = α G(y) G(ŷ) = α max f i (x i ) f i.(ˆx i ). (48) i=1,...,n By using 3 and (48) we obain ha H(y) H(ŷ) αl max x i ˆx i αl y ŷ. (49) i Therefore, he claim in (3) follows. APPENDIX B PROOF OF PROPOSITION 1 The Gershgorin circle heorem saes ha each eigenvalue of a marix A lies wihin a leas one of he Gershgorin discs D(a ii, R ii ) where he cener a ii is he ih diagonal elemen of A and he radius R ii := j i a ij is he sum of he absolue values of all he non-diagonal elemens of he ih row. Hence, Gershgorin discs can be considered as inervals of widh [a ii R ii, a ii + R ii ] for I W, where a ii = 1 w ii and R ii = j i w ij = j i w ij. Therefore, all he eigenvalues of I W are in a leas one of he inervals [1 w ii j i w ij, 1 w ii + j i w ij]. Since j w ij = 1, i can be derived ha 1 w ii = n j i w ij. Thus, he Gershgorin inervals can be simplified as [0, (1 w ii )] for i = 1,..., n. This observaion in associaion wih he fac ha (1 w ii ) (1 δ) implies ha he eigenvalues of I W are in he inerval [0, (1 δ)] and consequenly he eigenvalues of I Z are bounded as 0 I Z (1 δ)i. (50) Since marix G is block diagonal and he eigenvalues of each diagonal block G ii, = f i (x i, ) are bounded by consans 0 < m M < as menioned in (1), we obain mi G MI. (51) Considering he definiion of he Hessian H := I Z + αg and he bounds in (50) and (51), he firs claim follows. The definiion of he marix D in (1) yields D = αg + (I n W d ) I p, (5) where W d is defined as W d := diag(w). Noe ha marix I n W d is diagonal and he i-h diagonal componen is 1 w ii. Since he local weighs saisfy δ w ii, we obain ha he eigenvalues of I n W d are bounded below and above by 1 and 1 δ, respecively. Since he eigenvalues of (I n W d ) and (I n W d ) I p are idenical we obain (1 )I np (I n W d ) I p (1 δ)i np (53) Considering he relaion in (5) and bounds in (51) and (53), he second claim follows. Based on he definiion of B in (13), we can wrie B = (I W d + W) I. (54) Noe ha in he i-h row of marix I W d +W, he diagonal componen is 1 w ii and he jh componen is w ij for all j i. Using Gershgorin heorem and he same argumen ha we esablished for he eigenvalues of I Z, we can wrie 0 I W d + W (1 δ)i. (55) Based on (55) and (54), he las claim follows. APPENDIX C PROOF OF PROPOSITION According o he resul of Proposiion 1, D is posiive definie and B is posiive semidefinie which immediaely implies ha D 1/ BD 1/ is posiive semidefinie. Recall he definiion of D in (1) and define he marix ˆD as a special case of marix D for α = 0. I.e., ˆD := (I Z d ). Noice ha ˆD is diagonal, ime invarian, and only depends on he srucure of he nework. Since ˆD is diagonal and each diagonal componen 1 w ii is sricly larger han 0, ˆD is posiive definie and inverible. Hence, we can wrie D 1 BD 1 = (D 1 ˆD 1 )( ˆD 1 B ˆD 1 1 )( ˆD D 1 ). (56) We proceed o find an upper bound for he eigenvalues of he marix ˆD 1/ B ˆD 1/ in (56). Observing he fac ha marices ˆD 1/ B ˆD 1/ and B ˆD 1 are similar, eigenvalues of hese marices are idenical. Hence, we proceed o characerize an upper bound for he eigenvalues of marix B ˆD 1. Based on he definiions of B and ˆD, he produc B ˆD 1 is given by B ˆD 1 = (I Z d + Z) ((I Z d )) 1. Therefore, he blocks of he marix B ˆD 1 are given by [B ˆD 1 ] ii = 1 I and [B ˆD 1 ] ij = w ij I. (57) (1 w jj ) Thus, each diagonal componen of he marix B ˆD 1 is 1/ and ha he sum of non-diagonal componens of column i is np j=1,j i B ˆD 1 [ji] = 1 np j=1,j i w ji 1 w ii = 1. (58) Consider (58) and apply Gershgorin heorem o obain 0 µ i (B ˆD 1 ) 1 i = 1,..., n, (59) where µ i (B ˆD 1 ) indicaes he i-h eigenvalue of he marix B ˆD 1. The bounds in (59) and similariy of he marices B ˆD 1 and ˆD 1/ B ˆD 1/ show ha he eigenvalues of he marix ˆD 1/ B ˆD 1/ are uniformly bounded in he inerval 0 µ i ( ˆD 1/ B ˆD 1/ ) 1. (60) Based on (56), o characerize he bounds for he eigen-

12 1 values of D 1/ BD 1/, he bounds for he eigenvalues of he marix ˆD 1/ D 1/ should be sudied as well. Noice ha according o he definiions of ˆD and D, he produc ˆD 1/ D 1/ is block diagonal and he i-h diagonal block is [ ] ( ˆD1/ D 1/ α ) 1/ f i (x i, ) = ii (1 w ii ) + I. (61) Observe ha according o Assumpion 1, he eigenvalues of local Hessian marices f i (x i ) are bounded by m and M. Furher noice ha he diagonal elemens of weigh marix w ii are bounded by δ and, i.e. δ w ii. Considering hese bounds we can show ha he eigenvalues of marices (α/(1 w ii )) f i (x i, ) + I are lower and upper bounded as [ αm (1 δ) + 1 ] I α f i (x i, ) (1 w ii ) + I [ αm (1 ) + 1 ] I. (6) By considering he bounds in (6) and he expression in (61), he eigenvalues of he marix ˆD 1/ D 1/ are bounded as [ ] 1 (1 ) ( ) [ ] 1 1 µi ˆD D 1 (1 δ), (1 )+αm (1 δ)+αm (63) for i = 1,..., n. Observing he decomposiion in (56), he norm of he marix D 1/ BD 1/ is upper bounded as BD 1 ˆD 1/ ˆD 1 B ˆD 1. (64) Considering he symmery of marices ˆD1/ D 1/ and ˆD 1/ B ˆD 1/, and he upper bounds for heir eigenvalues in (60) and (63), respecively, we can subsiue he norm of hese wo marices by he upper bounds of heir eigenvalues and simplify he upper bound in (64) o BD 1/ (1 δ) (1 δ) + αm. (65) Since D 1/ BD 1/ is posiive semidefinie and symmeric, he resul in (7) follows. APPENDIX D PROOF OF PROPOSITION 3 In his proof and he res of he proofs we denoe he Hessian approximaion as Ĥ 1 insead of Ĥ(K) 1 for simplificaion of equaions. To prove lower and upper bounds for he eigenvalues of he error marix E we firs develop a simplificaion for he marix I H Ĥ 1 in he following lemma. Lemma 4 Consider he NN-K mehod as defined in (1)-(17). The marix I H Ĥ 1 can be simplified as I H Ĥ 1 = ( BD 1 ) K+1. (66) Proof: Check Lemma in [4]. Proof of Proposiion 3: Recall he resul in Proposiion. Since he marices D 1/ BD 1/ and B D 1 are similar (conjugae) he ses of eigenvalues of hese wo marices are idenical. Thus, he eigenvalues of BD 1 are bounded as 0 µ i (BD 1 ) ρ, (67) for i = 1,,..., np. This resul in associaion wih (66) yields 0 µ i (I H Ĥ 1 ) ρ K+1. (68) Observe ha he error marix E = I Ĥ 1/ H Ĥ 1/ is he conjugae of marix I H Ĥ 1. Hence, he bounds for he eigenvalues of marix I H Ĥ 1 also hold for he eigenvalues of error marix E and he claim in (9) follows. APPENDIX E PROOF OF LEMMA Based on he Cauchy-Schwarz inequaliy, he produc of he norms is larger han norm of he producs. This observaion and he definiion of Ĥ 1 in (15) lead o Ĥ 1 I+D 1 BD [D 1 BD 1 ] K. (69) As a resul of Proposiion 1 he eigenvalues of D are bounded below by (1 ) + αm. Thus, he maximum eigenvalue of is inverse D 1 is smaller han 1/((1 ) + αm), and, herefore, he norm of he marix D 1/ is bounded above as [(1 ) + αm] 1/. (70) Based on he resul in Proposiion, he eigenvalues of D 1/ BD 1/ are smaller han ρ. Furher, using he symmery and posiive definieness of D 1/ BD 1/ we obain BD 1/ ρ. (71) Using he riangle inequaliy in (69) o claim ha he norm of he sum is smaller han he sum of he norms and subsiuing he bounds in (70) and (71) ino he resuling expression yield 1 Ĥ 1 (1 ) + αm K ρ k. (7) k=0 Since ρ < 1, he sum K k=0 ρk can be simplified o (1 ρ K+1 )/(1 ρ). Considering his simplificaion for he sum in (7), he upper bound in (30) for he eigenvalues of he approximae Hessian inverse Ĥ 1 follows. In expression (15), all he summands excep he firs one, D 1, are posiive semidefinie. Hence, he approximae Hessian inverse Ĥ 1 is he sum of he marix D 1 and K posiive semidefinie marices and as a resul we can conclude ha D 1 Ĥ 1. (73) Proposiion 1 shows ha he eigenvalues of D are bounded above by (1 δ) + αm which leads o he conclusion ha here exiss a lower bound for he eigenvalues of D 1, ((1 δ) + αm) 1 I D 1. (74) The claim in (30) follows from he resuls in (73) and (74). APPENDIX F PROOF OF THEOREM 1 To prove global convergence of he Nework Newon mehod we firs inroduce wo echnical lemmas. In he firs lemma, we develop an upper bound for he objecive funcion

13 13 value F (y) using he firs hree erms of is Taylor expansion. In he second lemma, we consruc an upper bound for he error F (y +1 ) F (y ) in erms of F (y ) F (y ). Lemma 5 Consider he funcion F (y) defined in (6). If Assumpions and 3 hold, hen for any y, ŷ R np F (ŷ) F (y) + F (y) T (ŷ y) (75) + 1 (ŷ y)t F (y)(ŷ y) + αl 6 ŷ y 3. Proof : The claim follows from he Lipschiz coninuiy of he Hessian wih consan αl and Theorem 7.7 in [8] which characerizes he error of aylor s expansion. In he following lemma, we use he resul in Lemma 5 o esablish an upper bound for he error F (y +1 ) F (y ). Lemma 6 Consider he NN-K mehod as defined in (1)-(17). Furher, recall he definiion of y as he opimal argumen of he objecive funcion F (y). If Assumpions 1-3 hold, hen F (y +1 ) F (y ) [ 1 ( ɛ ɛ ) αmλ ] [F (y ) F (y )] + αlɛ3 Λ 3 [F (y 6λ 3 ) F (y )] 3. (76) Proof: By seing ŷ := y +1 and y := y in (75) we obain F (y +1 ) F (y ) + g T (y +1 y ) (77) + 1 (y +1 y ) T H (y +1 y )+ αl 6 y +1 y 3, where g := F (y ) and H := F (y ). From he definiion of he NN-K updae in (16) we can wrie he difference of wo consecuive variables as y +1 y = ɛĥ 1 g. Making his subsiuion ino (77) implies F (y +1 ) F (y ) ɛg T g + ɛ gt Ĥ 1 H Ĥ 1 Ĥ 1 + αlɛ3 6 Ĥ 1 g 3. (78) According o (8), we can subsiue Ĥ 1/ H Ĥ 1/ in (78) by I E which leads o F (y +1 ) F (y ) ɛg T g Ĥ 1 g + ɛ 1 gt Ĥ (I E 1 )Ĥ + αlɛ3 6 Ĥ 1 g 3. (79) Proposiion 3 shows ha E is posiive semidefinie, and, herefore, he quadraic form g T Ĥ 1/ E Ĥ 1/ g is nonnegaive. Considering his lower bound we can simplify (79) o ( ) ɛ ɛ F (y +1 ) F (y ) g T Ĥ 1 g + αlɛ3 6 Ĥ 1 g 3. (80) Since ɛ < 1, we obain ha ɛ ɛ is posiive. Moreover, recall he resul of Lemma ha all he eigenvalues of he Hessian inverse approximaion Ĥ 1 are lower and upper bounded by λ and Λ, respecively. These wo observaions imply ha we can replace he erm g T Ĥ 1 g by is lower bound λ g. Moreover, exisence of upper bound Λ for he eigenvalues of Hessian inverse approximaion Ĥ 1 implies ha he erm g Ĥ 1 g 3 is upper bounded by Λ 3 g 3. Subsiuing hese bounds for he second and hird erms of (80) and subracing F (y ) from boh sides of inequaliy (80) leads o ( ) ɛ ɛ λ F (y +1 ) F (y ) F (y ) F (y ) g + αlɛ3 Λ 3 g 3. 6 (81) Since he funcion F is srongly convex wih consan αm we can wrie [see Eq. (9.9) in [3]], F (y ) F (y ) 1 αm F (y ). (8) Rearrange erms in (8) o obain αm(f (y ) F (y )) as a lower bound for F (y ) = g. Now subsiue he lower bound αm(f (y ) F (y )) for squared norm of gradien g in he second summand of (81) o obain F (y +1 ) F (y ) [ 1 ( ɛ ɛ ) αmλ ] (F (y ) F (y )) + αlɛ3 Λ 3 g 3. (83) 6 Since he eigenvalues of he Hessian are upper bounded by (1 δ) + αm, for any vecors ŷ and y in R np we can wrie F (y) F (ŷ) + F (ŷ) T (1 δ) + αm (y ŷ) + y ŷ. (84) According o he definiion of λ in (31), we can subsiue (1 δ) + αm by 1/λ. Implemening his subsiuion and minimizing boh sides of he equaliy wih respec o y yields F (y ) F (ŷ) λ F (ŷ). (85) Seing ŷ = y, replacing F (y ) by g, and aking he square roo of boh sides of he resuling inequaliy yields g [ λ 1 [F (y ) F (y )] ] 1/. (86) Replace he upper bound in (14) for he norm of he gradien g in he las erm of (83) o obain (76). Proof of Theorem 1: To simplify upcoming derivaions define he sequence β as β :=( ɛ)ɛαmλ ɛ3 αlλ 3 [F (y ) F (y )] 1. (87) 6λ 3 Recall he resul of Lemma 6. Facorizing F (y ) F (y ) from he erms of he righ hand side of (76) in associaion wih he definiion of β in (87) implies ha we can simplify (76) as F (y +1 ) F (y ) (1 β )(F (y ) F (y )). (88) I remains o show ha for all ime seps, he consans β saisfy 0 < β < 1. We firs show ha β < 1 for all 0. Based on (87) we can wrie β ( ɛ)ɛαmλ. (89) Considering (ɛ 1) 0 we have ɛ( ɛ) 1. Furher, by inequaliies m < M and 1 δ > 0, we obain αm < αm + (1 δ). Thus, αm/(αm + (1 δ)) < 1 which is

14 14 equivalen o αmλ < 1. I follows from hese inequaliies ha ( ɛ)ɛαmλ < 1. (90) Tha β < 1 follows by combining (89) wih (90). To prove ha 0 < β for all 0 we prove ha his is rue for = 0 and hen prove ha he β sequence is increasing. According o (3), we can wrie [ ] 3mλ 5 1 ɛ, (91) LΛ 3 (F (y 0 ) F (y )) 1 By compuing he squares of boh sides of (91), muliplying he righ hand side of he resuling inequaliy by o make he inequaliy sric, and facorizing αmλ we obain 6λ 3 ɛ < αmλ. (9) αlλ 3 [F (y 0 ) F (y )] 1 If we now divide boh sides of he inequaliy in (9) by he firs muliplicand in he righ hand side of (9) we obain ɛ αlλ 3 [F (y 0 ) F (y )] 1 6λ 3 < αmλ. (93) Observe ha based on he hypohesis in (3) he sep size ɛ is smaller han 1 and i is hen rivially rue ha ɛ 1. This observaion shows ha if we muliply he righ hand side of (93) by (1 ɛ/) he inequaliy sill holds, ɛ αlλ 3 (F (y 0 ) F (y )) 1 6λ 3 < αm( ɛ)λ. (94) Muliply boh sides of (94) by ɛ and rearrange erms o obain αmɛ( ɛ)λ ɛ3 αlλ 3 [F (y 0 ) F (y )] 1 >0. (95) 6λ 3 Based on (87), he resul in (95) yields β 0 > 0. Observing ha β 0 is posiive, o show ha for all he sequence of β is posiive i is sufficien o prove ha he sequence β is increasing. We use srong inducion o prove β < β +1 for all 0. By seing = 0 in (88) we obain F (y 1 ) F (y ) (1 β 0 )(F (y 0 ) F (y )). (96) Considering he resul in (96) and he fac ha 0 < β 0 < 1, we obain ha he objecive funcion error a ime = 1 is sricly smaller han he error a ime = 0, i.e. F (y 1 ) F (y ) < F (y 0 ) F (y ). (97) According o (87), a smaller objecive funcion error F (y ) F (y ) leads o a larger coefficien β. This observaion combined wih he resul in (97) leads o β 0 < β 1. (98) To complee he srong inducion argumen assume now ha β 0 < β 1 < < β 1 < β and proceed o prove ha if his is rue we mus have β < β +1. Begin by observing ha since 0 < β 0 he inducion hypohesis implies ha for all u {0,..., } he consan β u is also posiive, i.e., 0 < β u. Furher recall ha for all he sequence β is also smaller han 1 as already proved. Combining hese wo observaions we have 0 < β u < 1 for all u {0,..., }. Consider now he inequaliy in (88) and uilize he fac ha 0 < β u < 1 for all u {0,..., } o conclude ha F (y u+1 ) F (y ) < F (y u ) F (y ), (99) for all u {0,..., }. Seing u = in (99) we conclude ha F (y +1 ) F (y ) < F (y ) F (y ). By furher repeaing he argumen leading from (98) o (97) we can conclude ha β < β +1. (100) The srong inducion proof is complee and we can claim ha 0 < β 0 < β 1 < < β < 1, (101) for all imes. The resuls in (88) and (101) imply lim F (y ) F (y ) = 0. To conclude ha he rae is a leas linear simply observe ha if he sequence β is increasing as per (101), he sequence 1 β is decreasing and saisfies 0 < 1 β < 1 β 0 < 1, (10) for all ime seps. Applying he inequaliy in (88) recursively and considering he inequaliy in (10) yields F (y ) F (y ) (1 β 0 ) (F (y 0 ) F (y )). (103) Considering ζ = β 0, he claim in (33) follows. APPENDIX G PROOF OF LEMMA 3 To simplify noaion we use Ĥ 1 o indicae he approximae Hessian inverse Ĥ(K) 1. Based on Lemma 1..3 in [9], he Lipschiz coninuiy of Hessians wih consan αl yields g +1 g + ɛh Ĥ 1 g ɛ αl Ĥ 1 g, (104) where we have used y +1 y = ɛĥ 1 g. Based on he definiion of marix norm we can wrie [g +1 g + ɛh Ĥ 1 g ] g +1 g + ɛh Ĥ 1 g. (105) Subsiuing g +1 g + ɛh Ĥ 1 g in he righ hand side of (105) by he upper bound in (104) leads o [g +1 g +ɛh Ĥ 1 g ] ɛ αl 1 D Ĥ 1 g. (106) Based on he riangle inequaliy, for any vecors a and b, and a posiive consan C, if he relaion a b C holds, hen a b + C. Thus, we can use he resul in (106) o wrie g +1 [g ɛh Ĥ 1 g ] + ɛ αl 1 D Ĥ 1 g. (107) Wrie D 1/ g as he sum (1 ɛ)(d 1/ g ) + ɛ(d 1/ g ) and use he riangle inequaliy o obain g +1 (1 ɛ) g + ɛ [I H Ĥ 1 ]g + ɛ αl 1 D Ĥ 1 g. (108)

15 15 Use he resul in Lemma 4 o wrie [I H Ĥ 1 ]g = [D 1 BD 1 ] K+1 D 1 g. (109) The resul in Proposiion implies ha [D 1/ BD 1/ ] K+1 ρ K+1. Considering his upper bound and he simplificaion in (109) we can wrie [I H Ĥ 1 ]g ρ K+1 g. (110) Subsiue he upper bound in (110) ino (108) and use he inequaliy Ĥ 1 g Ĥ 1 g o wrie g +1 (1 ɛ + ɛρ K+1 ) g + αɛ L Ĥ 1 g. (111) D 1/ Noe ha D 1 D 1 1 is bounded above as D 1 D 1 D 1 D D 1 D 1. (11) 1 1 The eigenvalues of D and D 1 are bounded below by αm+ (1 ). Thus, he eigenvalues of D 1 and D 1 1 are bounded above by 1/(αm + (1 )). Hence, D 1 D 1 ((1 ) + αm) D D 1. (113) 1 The difference D D 1 can be simplified as α(g G 1 ). Moreover, H H 1 = α(g G 1 ). Thus, D D 1 = H H 1. This observaion in conjuncion wih he Lipschiz coninuiy of he Hessians wih parameer αl implies ha D D 1 αl y y 1. (114) Replace D D 1 in (113) by he bound in (114) o obain D 1 D 1 αl 1 ((1 ) + αm) y y 1. (115) Noe ha g T (D 1 D 1 1 )g is bounded above by D 1 D 1 1 g. Considering he upper bound for D 1 D 1 1 in (115), he erm g T (D 1 D 1 1 )g is bounded above by g T (D 1 D 1 1 )g αl y y 1 g ((1 ) + αm). (116) Using he resul in (116), and simplifacions g T D 1 1 g = g and g T D 1 g = g, we can wrie 1 g 1 g + αl y y 1 g ((1 ) + αm). (117) For any consans a, b, and c if a b + c holds, hen a b + c holds. Using his resul and (117) we obain g 1 g + (αl y y 1 ) 1 g. (118) (1 ) + αm Considering he updae in (17) we can subsiue y y 1 by ɛĥ 1 1 g 1. Applying his subsiuion ino (118) yields g 1 g + [αɛl Ĥ 1 1 g 1 ] 1 g. (119) (1 ) + αm If we subsiue g by he upper bound in (119) and subsiue Ĥ 1 1 g 1 by he upper bound Ĥ 1 1 g 1, he inequaliy in (111) can be wrien as g +1 ( 1 ɛ + ɛρ K+1) 1 g ( ) 1 ɛ + ɛρ K+1 + [αɛl Ĥ 1 1 g 1 ] 1/ g (1 ) + αm + αɛ L Ĥ 1 g. (10) D 1/ Noe ha µ min (D 1/ 1 ) g 1 g. Considering his inequaliy and he lower bound ((1 δ) + αm) 1/ for he eigenvalues of D 1/ 1 we can wrie g ((1 δ) + αm) 1/ 1 g. (11) Subsiue g by he upper bound in (11), use he definiion λ := 1/((1 δ) + αm), replace he norms he norms Ĥ 1 and Ĥ 1 1 by heir upper bound Λ, and use he fac ha is bounded above by 1/((1 )+αm) 1/ o rewrie he righ hand side of (10) as g +1 (1 ɛ+ɛρ K+1 )[1 + C 1 g 1 1 ] 1 g αɛ LΛ + λ((1 ) + αm) 1 1 g, (1) where C 1 := [ αɛlλ/λ((1 ) + αm) ] 1/. According o (31), we can subsiue 1/((1 δ) + αm) by λ. Applying his subsiuion ino (84) and minimizing he boh sides of (84) wih respec o y yields F (y ) F (ŷ) λ F (ŷ). (13) Since (13) holds for any ŷ, we se ŷ := y 1. By rearranging he erms and aking heir square roos, we obain an upper bound for he gradien norm F (y 1 ) = g 1 as g 1 [ λ 1 [F (y 1 ) F (y )] ] 1. (14) The resul in Theorem 1 and he relaion in (14) allow us o show ha g 1 1/ is upper bounded by g 1 1 [ λ 1 (1 ζ) 1 (F (y 0 ) F (y )) ] 1 4. (15) Consider he definiion of Γ in (36) and subsiue he upper bound in (15) for g 1 1/ o updae (1) as g +1 ( 1 ɛ+ɛρ K+1)[ ] 1 + C (1 ζ) g + ɛ Γ 1 g, (16) where C := C 1 [(F (y 0 ) F (y ))/λ] 1/4. Based on he definiions of C and Γ 1 we obain ha C = Γ 1. This observaion in associaion wih (16) leads o he claim in (35). REFERENCES [1] A. Mokhari, Q. Ling, and A. Ribeiro, Nework newon, in Signals, Sysems and Compuers, h Asilomar Conference on. IEEE, 014, pp [], An approximae newon mehod for disribued opimizaion, in Acousics, Speech and Signal Processing (ICASSP), 015 IEEE Inernaional Conference on. IEEE, 015, pp [3] Y. Cao, W. Yu, W. Ren, and G. Chen, An overview of recen progress in he sudy of disribued muli-agen coordinaion, IEEE Transacions on Indusrial Informaics, vol. 9, pp , 013.

16 [4] C. G. Lopes and A. H. Sayed, Diffusion leas-mean squares over adapive neworks: Formulaion and performance analysis, Signal Processing, IEEE Transacions on, vol. 56, no. 7, pp. 31 3136, 008.

Nowak, Decenralized source localizaion and racking [wireless sensor neworks], in Acousics, Speech, and Signal Processing, 004. Proceedings.(ICASSP 04). IEEE Inernaional Conference on, vol. 3.

Giannakis, Consensus in ad hoc wsns wih noisy links par i: Disribued esimaion of deerminisic signals, Signal Processing, IEEE Transacions on, vol. 56, no. 1, pp. 350 364, 008. [8] U. A. Khan, S.

16 16 [4] C. G. Lopes and A. H. Sayed, Diffusion leas-mean squares over adapive neworks: Formulaion and performance analysis, Signal Processing, IEEE Transacions on, vol. 56, no. 7, pp , 008. [5] A. Ribeiro, Ergodic sochasic opimizaion algorihms for wireless communicaion and neworking, Signal Processing, IEEE Transacions on, vol. 58, no. 1, pp , 010. [6] M. G. Rabba and R. D. Nowak, Decenralized source localizaion and racking [wireless sensor neworks], in Acousics, Speech, and Signal Processing, 004. Proceedings.(ICASSP 04). IEEE Inernaional Conference on, vol. 3. IEEE, 004, pp. iii 91. [7] I. D. Schizas, A. Ribeiro, and G. B. Giannakis, Consensus in ad hoc wsns wih noisy links par i: Disribued esimaion of deerminisic signals, Signal Processing, IEEE Transacions on, vol. 56, no. 1, pp , 008. [8] U. A. Khan, S. Kar, and J. M. Moura, Diland: An algorihm for disribued sensor localizaion wih noisy disance measuremens, Signal Processing, IEEE Transacions on, vol. 58, no. 3, pp , 010. [9] M. Rabba and R. Nowak, Disribued opimizaion in sensor neworks, in Proceedings of he 3rd inernaional symposium on Informaion processing in sensor neworks. ACM, 004, pp [10] R. Bekkerman, M. Bilenko, and J. Langford, Scaling up machine learning: Parallel and disribued approaches. Cambridge Universiy Press, 011. [11] K. I. Tsianos, S. Lawlor, and M. G. Rabba, Consensus-based disribued opimizaion: Pracical issues and applicaions in large-scale machine learning, Communicaion, Conrol, and Compuing (Alleron), 01 50h Annual Alleron Conference on, pp , 01. [1] Y. Low, D. Bickson, J. Gonzalez, C. Guesrin, A. Kyrola, and J. M. Hellersein, Disribued graphlab: a framework for machine learning and daa mining in he cloud, Proceedings of he VLDB Endowmen, vol. 5, no. 8, pp , 01. [13] A. Nedic and A. Ozdaglar, Disribued subgradien mehods for muliagen opimizaion, Auomaic Conrol, IEEE Transacions on, vol. 54, no. 1, pp , 009. [14] D. Jakoveic, J. Xavier, and J. M. Moura, Fas disribued gradien mehods, Auomaic Conrol, IEEE Transacions on, vol. 59, no. 5, pp , 014. [15] K. Yuan, Q. Ling, and W. Yin, On he convergence of decenralized gradien descen, arxiv preprin arxiv: , 013. [16] W. Shi, Q. Ling, G. Wu, and W. Yin, Exra: An exac firs-order algorihm for decenralized consensus opimizaion, arxiv preprin arxiv: , 014. [17] S. Boyd, N. Parikh, E. Chu, B. Peleao, and J. Ecksein, Disribued opimizaion and saisical learning via he alernaing direcion mehod R in Machine Learning, vol. 3, of mulipliers, Foundaions and Trends no. 1, pp. 1 1, 011. [18] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, On he linear convergence of he admm in decenralized consensus opimizaion, IEEE Trans. on Signal Processing, vol. 6, no. 7, pp , 014. [19] T.-H. Chang, M. Hong, and X. Wang, Muli-agen disribued opimizaion via inexac consensus admm, Signal Processing, IEEE Transacions on, vol. 63, no., pp , 015. [0] A. Mokhari, W. Shi, Q. Ling, and A. Ribeiro, Dqm: Decenralized quadraically approximaed alernaing direcion mehod of mulipliers, IEEE Transacions on Signal Processing, vol. 64, no. 19, pp , Oc 016. [1] J. C. Duchi, A. Agarwal, and M. J. Wainwrigh, Dual averaging for disribued opimizaion: convergence analysis and nework scaling, Auomaic Conrol, IEEE Trans. on, vol. 57, no. 3, pp , 01. [] K. I. Tsianos, S. Lawlor, and M. G. Rabba, Push-sum disribued dual averaging for convex opimizaion. CDC, pp , 01. [3] S. Boyd and L. Vandenberghe, Convex opimizaion. Cambridge universiy press, 004. [4] M. Zargham, A. Ribeiro, A. Ozdaglar, and A. Jadbabaie, Acceleraed dual descen for nework flow opimizaion, Auomaic Conrol, IEEE Transacions on, vol. 59, no. 4, pp , 014. [5] E. Wei, A. Ozdaglar, and A. Jadbabaie, A disribued newon mehod for nework uiliy maximizaion i: algorihm, Auomaic Conrol, IEEE Transacions on, vol. 58, no. 9, pp , 013. [6] D. Jakoveic, J. M. Moura, and J. Xavier, Disribued neserov-like gradien algorihms, in Decision and Conrol (CDC), 01 IEEE 51s Annual Conference on. IEEE, 01, pp [7] A. Mokhari and A. Ribeiro, Dsa: decenralized double sochasic averaging gradien algorihm, Journal of Machine Learning Research, vol. 17, no. 61, pp. 1 35, 016. [8] T. M. Aposol, Calculus, volume I. John Wiley & Sons, 007, vol. 1. [9] Y. Neserov, Inroducory lecures on convex opimizaion: A basic course. Springer Science & Business Media, 013, vol. 87. Aryan Mokhari received he B. Sc. degree in elecrical engineering from Sharif Universiy of Technology, Tehran, Iran, in 011, and he M.S. degree in elecrical engineering from he Universiy of Pennsylvania, Philadelphia, PA, in 014. Since 01, he has been working owards he Ph.D. degree in he Deparmen of Elecrical and Sysems Engineering, Universiy of Pennsylvania, Philadelphia, PA. From June o Augus 010, he was an inern a he Advanced Digial Sciences Cener, Singapore, Singapore. He was a research inern wih he Bigdaa Machine Learning group a Yahoo!, Sunnyvale, CA, from June o Augus 016. His research ineress lie in he areas of opimizaion, machine learning, conrol, and signal processing. His curren research focuses on developing mehods for large-scale opimizaion problems. Qing Ling received he B.E. degree in auomaion and he Ph.D. degree in conrol heory and conrol engineering from Universiy of Science and Technology of China in 001 and 006, respecively. From 006 o 009, he was a Pos-Docoral Research Fellow in he Deparmen of Elecrical and Compuer Engineering, Michigan Technological Universiy. Since 009, he has been an Associae Professor in he Deparmen of Auomaion, Universiy of Science and Technology of China. His curren research focuses on decenralized opimizaion of neworked muli-agen sysems. Alejandro Ribeiro received he B.Sc. degree in elecrical engineering from he Universidad de la Republica Orienal del Uruguay, Monevideo, in 1998 and he M.Sc. and Ph.D. degree in elecrical engineering from he Deparmen of Elecrical and Compuer Engineering, he Universiy of Minnesoa, Minneapolis in 005 and 007. From 1998 o 003, he was a member of he echnical saff a Bellsouh Monevideo. Afer his M.Sc. and Ph.D. sudies, in 008 he joined he Universiy of Pennsylvania, Philadelphia, where he is currenly he Rosenbluh Associae Professor a he Deparmen of Elecrical and Sysems Engineering. His research ineress are in he applicaions of saisical signal processing o he sudy of neworks and neworked phenomena. His focus is on srucured represenaions of neworked daa srucures, graph signal processing, nework opimizaion, robo eams, and neworked conrol. Dr. Ribeiro received he 014 O. Hugo Schuck bes paper award, he 01 S. Reid Warren, Jr. Award presened by Penn s undergraduae suden body for ousanding eaching, he NSF CAREER Award in 010, and paper awards a he 016 SSP Workshop, 016 SAM Workshop, 015 Asilomar SSC Conference, ACC 013, ICASSP 006, and ICASSP 005. Dr. Ribeiro is a Fulbrigh scholar and a Penn Fellow.

A Decentralized Second-Order Method with Exact Linear Convergence Rate for Consensus Optimization

A Decentralized Second-Order Method with Exact Linear Convergence Rate for Consensus Optimization 1 A Decenralized Second-Order Mehod wih Exac Linear Convergence Rae for Consensus Opimizaion Aryan Mokhari, Wei Shi, Qing Ling, and Alejandro Ribeiro Absrac This paper considers decenralized consensus