Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)

Size: px

Start display at page:

Download "Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)"

Robyn McKinney
6 years ago
Views:

1 Network Newton Aryan Mokhtari, Qing Ling and Alejandro Ribeiro University of Pennsylvania, University of Science and Technology (China) Asilomar Conference on Signals, Systems and Computers Pacific Grove, CA, November 4, 2014 Mokhtari, Ling, Ribeiro Network Newton 1

2 Distributed optimization Network with n nodes. Each node i has access to local function f i (x) n Collaborate to minimize global objective f (x) = f i (x) Sample subsets to train classifier i=1 f 2 (x) f 5 (x) f 8 (x) f 1 (x) f 3 (x) f 6 (x) f 9 (x) f 4 (x) f 7 (x) f 10 (x) Nodes can operate (train or estimate) locally but would benefit by sharing Cost of aggregating functions is large Comms and computation Recursive exchanges with neighbors j N i to aggregate global information Mokhtari, Ling, Ribeiro Network Newton 2

3 Methods for distributed optimization n Replicate common variable at each node f (x 1,..., x n) = f i (x i ) Enforce equality between neighbors x i = x j (thus between all nodes) i=1 f 2 (x) f 5 (x) f 8 (x) f 1 (x) f 3 (x) f 6 (x) f 9 (x) f 4 (x) f 7 (x) f 10 (x) Operate recursively to enforce equality asymptotically. Differ on how. Distributed gradient descent, recursive averaging, [Nedic, Ozdaglar 09] Distributed dual descent, prices, [Rabbat et al 05] Distributed ADMM, prices, [Schizas et al 08] All are first order methods, thus, convergence times not always reasonable Mokhtari, Ling, Ribeiro Network Newton 3

4 Methods for distributed optimization n Replicate common variable at each node f (x 1,..., x n) = f i (x i ) Enforce equality between neighbors x i = x j (thus between all nodes) i=1 f 2 (x 2 ) f 5 (x 5 ) f 8 (x 8 ) f 1 (x 1 ) f 3 (x 3 ) f 6 (x 6 ) f 9 (x 9 ) f 4 (x 4 ) f 7 (x 7 ) f 10 (x 10 ) Operate recursively to enforce equality asymptotically. Differ on how. Distributed gradient descent, recursive averaging, [Nedic, Ozdaglar 09] Distributed dual descent, prices, [Rabbat et al 05] Distributed ADMM, prices, [Schizas et al 08] All are first order methods, thus, convergence times not always reasonable Mokhtari, Ling, Ribeiro Network Newton 4

5 (Approximate) Network Newton (NN) Reinterpret distributed gradient descent (DGD) as a penalty method Newton step for objective + penalty requires global coordination Approximate with local operations by truncating Taylor s series of Hessian inverse Hessian is neighbor sparse kth term of series is k-hop neighbor sparse NN-k aggregates information from k-hop neighborhood f 2 (x 2 ) f 5 (x 5 ) f 8 (x 8 ) f 1 (x 1 ) f 3 (x 3 ) f 6 (x 6 ) f 9 (x 9 ) f 4 (x 4 ) f 7 (x 7 ) f 10 (x 10 ) NN-k converges linearly always and exhibits a quadratic phase in a range Mokhtari, Ling, Ribeiro Network Newton 5

6 (Approximate) Network Newton (NN) Reinterpret distributed gradient descent (DGD) as a penalty method Newton step for objective + penalty requires global coordination Approximate with local operations by truncating Taylor s series of Hessian inverse Hessian is neighbor sparse kth term of series is k-hop neighbor sparse NN-k aggregates information from k-hop neighborhood NN-1 f 2 (x 2 ) f 5 (x 5 ) f 8 (x 8 ) f 1 (x 1 ) f 3 (x 3 ) f 6 (x 6 ) f 9 (x 9 ) f 4 (x 4 ) f 7 (x 7 ) f 10 (x 10 ) NN-k converges linearly always and exhibits a quadratic phase in a range Mokhtari, Ling, Ribeiro Network Newton 6

7 (Approximate) Network Newton (NN) Reinterpret distributed gradient descent (DGD) as a penalty method Newton step for objective + penalty requires global coordination Approximate with local operations by truncating Taylor s series of Hessian inverse Hessian is neighbor sparse kth term of series is k-hop neighbor sparse NN-k aggregates information from k-hop neighborhood NN-1 NN-2 f 2 (x 2 ) f 5 (x 5 ) f 8 (x 8 ) f 1 (x 1 ) f 3 (x 3 ) f 6 (x 6 ) f 9 (x 9 ) f 4 (x 4 ) f 7 (x 7 ) f 10 (x 10 ) NN-k converges linearly always and exhibits a quadratic phase in a range Mokhtari, Ling, Ribeiro Network Newton 7

8 (Approximate) Network Newton (NN) Reinterpret distributed gradient descent (DGD) as a penalty method Newton step for objective + penalty requires global coordination Approximate with local operations by truncating Taylor s series of Hessian inverse Hessian is neighbor sparse kth term of series is k-hop neighbor sparse NN-k aggregates information from k-hop neighborhood NN-1 NN-2 NN-3 f 2 (x 2 ) f 5 (x 5 ) f 8 (x 8 ) f 1 (x 1 ) f 3 (x 3 ) f 6 (x 6 ) f 9 (x 9 ) f 4 (x 4 ) f 7 (x 7 ) f 10 (x 10 ) NN-k converges linearly always and exhibits a quadratic phase in a range Mokhtari, Ling, Ribeiro Network Newton 8

9 Decentralized Gradient Descent (DGD) Problem in distributed form min x 1,...,x n n f i (x i ), i=1 s.t. x i = x j, for j N i With nonnegative doubly stochastic weights W = [w ij ], DGD update at node i is x i,t+1 = j=i,j N i w ij x j,t α f i (x i,t ) Average of local and neighboring variables + local gradient descent Mokhtari, Ling, Ribeiro Network Newton 9

10 Decentralized Gradient Descent (DGD) Rewrite DGD in vector form (aggregate variable y := [x 1;... ; x n]), y t+1 = Wy t αh(y t) Weight matrix W, W := W I. Gradient h(y) := [ f1(x 1);... ; f n(x n)] Reorder terms in vector form DGD y t+1 = y t [ (I W)y t + αh(y t) ] Gradient descent on function F (y) := 1 n 2 yt (I W) y + α f i (x i ) i=1 Mokhtari, Ling, Ribeiro Network Newton 10

11 DGD as a penalty method (reconsidering the mystery of DGD) Why do gradient descent on F (y) := 1 2 yt (I W) n y + α f i (x i )? i=1 Weight matrix W is constructed such that null(i W) = span(1) Thus null(i W) = span(1 I) and (I W)y = 0 if and only if x i = x j Same is true of (I W) 1/2 problem in distributed form is equivalent to min y n f i (x i ), s.t. (I W) 1/2 y = 0 i=1 DGD is a penalty method to solve this (equivalent) problem Squared norm penalty 1 2 (I W) 1/2 y 2 with coefficient 1/α Converges to wrong solution. Not far from right if α is small Gradient descent in F (y) works Why not using Newton steps on F (y)? Mokhtari, Ling, Ribeiro Network Newton 11

12 Newton method for penalized objective function Penalized objective function F (y) = 1 n 2 yt (I W)y + α f i (x i ) i=1 To implement Newton on F (y) need Hessians H t = I W + αg t G t is block diagonal with blocks G ii,t = 2 f i (x i,t ) Hessian H t has the sparsity pattern of W = sparsity pattern of graph Can be computed with local information + exchanges with neighbors Newton step depends on Hessian inverse d t := H 1 t g t Inverse of H t is, in general, not block sparse nor locally computable Mokhtari, Ling, Ribeiro Network Newton 12

13 Network Newton Hessian approximation Define diagonal matrix D t := αg t + 2(I diag( W)) Define block graph sparse matrix B := I 2diag( W) + W ( ) Split Hessian as H t = D t B = D 1/2 t I D 1/2 t BD 1/2 t D 1/2 t Use Taylor series (I X) 1 = k=0 Xk to write Hessian inverse as H 1 t = D 1/2 t k=0 ( ) kd D 1/2 t BD 1/2 1/2 t t Define NN-K step d (K) t := Ĥ(K) 1 t g t by truncating sum at Kth term Ĥ (K) 1 t := D 1/2 t K k=0 ( ) k D 1/2 t BD 1/2 1/2 t D t ( ) k D 1/2 t BD 1/2 t graph sparse D 1/2 t BD 1/2 t k-hop neighborhood sparse Mokhtari, Ling, Ribeiro Network Newton 13

14 Distributed computation of NN-K step Recursion for NN-k steps. Define d (0) t d (k+1) t = D 1 t Bd (k) t = D 1 t g t and for all other k D 1 t g t Given that D t is diagonal, can rewrite recursion componentwise as d (k+1) i,t d (k+1) i,t = D 1 ii,t n j=1 B ij d (k) j,t D 1 ii,t g i,t But given that B is graph sparse B ij = 0 unless i and j are neighbors = D 1 ii,t j N i,j=i B ij d (k) j,t D 1 ii,t g i,t Local piece of NN-(k + 1) step is computed as a function of Local matrices, local gradient components, local piece of NN-k step Pieces of the NN-k step of neighboring nodes. Can exchange. Mokhtari, Ling, Ribeiro Network Newton 14

15 NN-K Algorithm at node i (0) Initialize at x i,0. Repeat for times t = 0, 1,... (1) Exchange local iterates x i,t with neighboring nodes j N i. (2) Compute local gradient components g i,t = (1 w ii )x i,t j N i w ij x j,t + α f i (x i,t ) (3) Initialize NN step computation with NN-0 step d (0) i,t (4) Repeat for k = 0, 1,..., K 1 = D 1 ii,t g i,t (5) Exchange local elements d (k) i,t of NN-k step with neighbors j N i. (6) Compute local component of NN-(k + 1) step d (k+1) i,t = D 1 ii,t j N i,j=i (7) Update local iterate: x i,t+1 = x i,t + ɛ d (K) i,t B ij d (k) j,t D 1 ii,t g i,t Mokhtari, Ling, Ribeiro Network Newton 15

16 Assumptions Assumption 1 The local objective functions f i (x) are twice differentiable The Hessians 2 f i (x) have bounded eigenvalues mi 2 f i (x) MI Assumption 2 The local Hessians are Lipschitz continuous 2 f i (x) 2 f i (ˆx) L x ˆx Assumption 3 The local weights w ii are bounded 0 δ w ii < 1 i = 1,..., n. The upper bound is implied by connectivity condition. Mokhtari, Ling, Ribeiro Network Newton 16

17 Linear convergence of NN-K Theorem For a specific choice of stepsize ɛ the sequence F (y t) converges to the optimal argument F (y ) at least linearly with constant 0 < 1 ζ α < 1, i.e., F (y t) F (y ) (1 ζ α) t (F (y 0) F (y )) ɛ is the minimum of 1 and a constant depending on problem parameters Trade-off between convergence rate and accuracy Large α implies small ζ α and faster convergence Smaller choices of α implies more accurate convergence Mokhtari, Ling, Ribeiro Network Newton 17

18 Superlinear convergence Lemma Lemma For specific values of Γ 1 and Γ 2 the sequence of weighted gradient norm satisfies ] D 1/2 t g t+1 (1 ɛ+ɛρ K+1 ) [1 + Γ 1(1 ζ α) t 1 4 D 1/2 +ɛ t 1 gt 2 D 1/2 2 Γ 2 t 1 gt where ρ < 1. D 1 2 t g t+1 is upper bounded by linear and quadratic terms of D 1 2 Similar to the convergence analysis of Newton method with constant ɛ For t large enough Γ 1(1 ζ α) t t 1 gt There must be intervals in which the quadratic term dominates linear term Rate of convergence is quadratic in that interval Mokhtari, Ling, Ribeiro Network Newton 18

19 Quadratic phase of NN-K convergence Theorem For η t := [(1 ɛ + ɛρ K+1 )(1 + Γ 1(1 ζ) (t 1)/4 )] and t 0 := argmin t {t η t < 1}, we have that for all t t 0 if then ηt(1 η t) < D 1/2 ɛ 2 t 1 Γ gt < 1 η t, 2 ɛ 2 Γ 2 D 1/2 t g t+1 ɛ2 Γ 2 1 η t D 1/2 t 1 gt 2. Quadratic convergence of D 1/2 gt in a specified interval. t 1 Mokhtari, Ling, Ribeiro Network Newton 19

20 Numerical results Convergence path for f (x) := 100 i=1 xt A i x/2 + b T i x Condition number = 10 3, α = 10 2, graph is d regular with d = 4, ɛ = 1 xi x 2 x 2 error = 1 n n i= DGD NN-0 NN-1 NN Number of iterations DGD is slower than different versions of NN-K NN-K with larger K converges faster in terms of number of iterations Mokhtari, Ling, Ribeiro Network Newton 20

21 Numerical results Number of required information exchanges to achieve accuracy e = 10 2 n = 100, d = {4,..., 10} and c.n.= {10 2, 10 3, 10 4 } Empirical distribution Empirical distribution DGD Number of information exchanges NN Number of information exchanges Empirical distribution Empirical distribution NN-0 Number of information exchanges NN Number of information exchanges Different versions of NN-k have almost similar performances DGD is slower than all versions of NN-K by an order of magnitude Mokhtari, Ling, Ribeiro Network Newton 21

22 Numerical Results Convergence of NN-K and DGD with decreasing α Divide α by 10 when algorithm is converged DGD NN-0 NN-1 NN DGD NN-0 NN-1 NN-2 error 10 3 error number of iterations a) α 0 = number of iterations b) α 0 = 10 1 Exact convergence is achieved by decreasing α Larger initial value for α leads to faster convergence for both algorithms Mokhtari, Ling, Ribeiro Network Newton 22

23 Conclusions Introduced a network optimization formulation Each agent has local cost function f i Global cost f = n i=1 f i Network Newton is proposed as a second-order distributed method Approximates Newton by truncating the Taylor series of Hessian inverse Linear convergence is established Quadratic convergence in a specific interval is shown NN has faster convergence relative to DGD according to numerical results Mokhtari, Ling, Ribeiro Network Newton 23

Efficient Methods for Large-Scale Optimization

Efficient Methods for Large-Scale Optimization Aryan Mokhtari Department of Electrical and Systems Engineering University of Pennsylvania aryanm@seas.upenn.edu Ph.D. Proposal Advisor: Alejandro Ribeiro