Distributed online optimization over jointly connected digraphs

Size: px

Start display at page:

Download "Distributed online optimization over jointly connected digraphs"

Phillip Heath
5 years ago
Views:

1 Distributed online optimization over jointly connected digraphs David Mateos-Núñez Jorge Cortés University of California, San Diego Southern California Optimization Day UC San Diego, May 23, / 29

2 Overview: Distributed online optimization Distributed optimization Online optimization Case study: medical diagnosis 2 / 29

3 Distributed online optimization Why distributed? information is distributed across group of agents need to interact to optimize performance Why online? information becomes incrementally available need adaptive solution 3 / 29

$Machine learning in healthcare Medical findings, symptoms: age factor amnesia before impact deterioration in GCS score open skull fracture loss of consciousness$

4 Machine learning in healthcare Medical findings, symptoms: age factor amnesia before impact deterioration in GCS score open skull fracture loss of consciousness vomiting Any acute brain finding revealed on Computerized Tomography? (-1 = not present, 1 = present) The Canadian CT Head Rule for patients with minor head injury 4 / 29

5 Binary classification feature vector of patient s: w s = ((w s ) 1,..., (w s ) d 1 ) true diagnosis: y s { 1, 1} wanted weights: x = (x 1,..., x d ) predictor: h(x, w s ) = x (w s, 1) margin: m s (x) = y s h(x, w s ) 5 / 29

6 Binary classification feature vector of patient s: w s = ((w s ) 1,..., (w s ) d 1 ) true diagnosis: y s { 1, 1} wanted weights: x = (x 1,..., x d ) predictor: h(x, w s ) = x (w s, 1) margin: m s (x) = y s h(x, w s ) Given the data set {w s } P s=1, estimate x Rd by solving min f (x) = min x Rd x R d P l(m s (x)) s=1 where the loss function l : R R is decreasing and convex 5 / 29

7 Review of distributed convex optimization 6 / 29

8 In the diagnosis example Health center i {1,..., N} manages a set of patients P i f (x) = i=1 s P i l(y s h(x, w s )) = f i (x) i=1 7 / 29

9 In the diagnosis example Health center i {1,..., N} manages a set of patients P i f (x) = i=1 s P i l(y s h(x, w s )) = f i (x) i=1 Goal: best predicting model w h(x, w) min x R d f i (x) i=1 using local information 7 / 29

10 What do we mean by using local information? Agent i maintains an estimate x i t of x arg min x R d f i (x) Agent i has access to f i Agent i can share its estimate x i t with neighboring agents i=1 f 1 f 2 f 5 a 13 a 14 a 21 a 25 A = a 31 a 32 a 41 a 52 f 3 f 4 8 / 29

11 What do we mean by using local information? Agent i maintains an estimate x i t of x arg min x R d f i (x) Agent i has access to f i Agent i can share its estimate x i t with neighboring agents i=1 f 1 f 2 f 5 a 13 a 14 a 21 a 25 A = a 31 a 32 a 41 a 52 f 3 f 4 Application to distributed estimation in wireless sensor networks and beyond... sensor is any channel for the machine to learn 8 / 29

12 How do agents agree on the optimizer? Spreading of information (gossip, time-varying topologies, B-joint connectivity) Relation between consensus & local minimization 9 / 29

13 How do agents agree on the optimizer? Spreading of information (gossip, time-varying topologies, B-joint connectivity) Relation between consensus & local minimization A. Nedić and A. Ozdaglar, TAC, 09 N zk+1 i = a ij,k z j k η tg, xk+1 i = Π X (zk+1 i ), j=1 where A k = (a ij,k ) is doubly stochastic and g f i (xk i ) J. C. Duchi, A. Agarwal, and M. J. Wainwright, TAC, 12 9 / 29

14 Saddle-point dynamics The minimization problem can be regarded as min x R d f i (x) = i=1 min x 1,...,x N R d x 1 =...=x N f i (x i ) = i=1 min x (R d ) N Lx=0 f i (x i ), i=1 where (Lx) i = N j=1 a ij(x i x j ) 10 / 29

15 Saddle-point dynamics The minimization problem can be regarded as min x R d f i (x) = i=1 min x 1,...,x N R d x 1 =...=x N f i (x i ) = i=1 min x (R d ) N Lx=0 f i (x i ), i=1 where (Lx) i = N j=1 a ij(x i x j ) The augmented Lagrangian when L is symmetric is F (x, z) := f (x) + a 2 x L x + z L x, which is convex-concave, and the saddle-point dynamics F (x, z) ẋ = = f (x) a Lx Lz x F (x, z) ż = = Lx z 10 / 29

16 Weight-balanced digraphs ẋ = f (x) a Lx Lz ż = Lx Lagrangian F (x, z) f (x) + a 2 x L x + z L x, f (x) = N i=1 f i (x i ) 11 / 29

17 Weight-balanced digraphs ẋ = f (x) a Lx Lz ż = Lx Lagrangian F (x, z) f (x) + a 2 x L x + z L x, f (x) = N i=1 f i (x i ) ẋ = F (x, z) x = f (x) a 1 2( L + L ) x L z (Non distributed!) ż = F (x, z) z = Lx J. Wang and N. Elia (with L = L), Allerton, / 29

18 Weight-balanced digraphs ẋ = f (x) a Lx Lz ż = Lx Lagrangian F (x, z) f (x) + a 2 x L x + z L x, f (x) = N i=1 f i (x i ) F (x, z) ẋ = = f (x) a 1 x 2( ) L + L x L z (Non distributed!) changed to f (x) a Lx Lz ż = F (x, z) z = Lx B. Gharesifard and J. Cortés, CDC, 12 Note: a > 0; otherwise the linear part of the saddle-point dynamics is a Hamiltonian system 11 / 29

19 Weight-balanced digraphs ẋ = f (x) a Lx Lz ż = Lx Example: 4 agents in a directed cycle Evolution of the agents s estimates {[x i 1(t), x i 2(t)]} f 1 (x 1, x 2 ) = 1 2 ((x 1 4) 2 + (x 2 3) 2 ) f 2 (x 1, x 2 ) = x 1 + 3x 2 2 f 3 (x 1, x 2 ) = log(e x1+3 + e x2+1 ) f 4 (x 1, x 2 ) = (x 1 + 2x 2 + 5) 2 + (x 1 x 2 4) time, t convergence to a neighborhood of optimizer (1.10, 2.74) size of the neighborhood depends on size of the noise [DMN-JC, 13] 11 / 29

20 Review of online convex optimization 12 / 29

21 Different kind of optimization: sequential decision making Resuming the diagnosis example: Each round t {1,..., T } question (features, medical findings): w t decision (about using CT): h(x t, w t ) outcome (by CT findings/follow up of patient): y t loss: l(y t h(x t, w t )) Choose x t & Incur loss f t (x t ) := l(y t h(x t, w t )) 13 / 29

22 Different kind of optimization: sequential decision making Resuming the diagnosis example: Each round t {1,..., T } question (features, medical findings): w t decision (about using CT): h(x t, w t ) outcome (by CT findings/follow up of patient): y t loss: l(y t h(x t, w t )) Choose x t & Incur loss f t (x t ) := l(y t h(x t, w t )) Goal: sublinear regret T T R(u, T ) := f t (x t ) f t (u) o(t ) t=1 t=1 using historical observations 13 / 29

23 Why regret? If the regret is sublinear, T f t (x t ) t=1 T f t (u) + o(t ), t=1 then, 1 T T f t (x t ) t=1 1 T T t=1 f t (u) + o(t ) T In temporal average, online decisions {x t } T t=1 fixed decision in hindsight perform as well as best No regrets, my friend 14 / 29

24 What about generalization error? Sublinear regret does not imply x t+1 will do well with f t+1 No assumptions about sequence {f t }; it can follow an unknown stochastic or deterministic model, or be chosen adversarially In our example, f t := l(y t h(x t, w t )). If some model w h(x, w) explains reasonably the data in hindsight, then the online models w h(xt, w) perform just as well in average 15 / 29

25 What about generalization error? Sublinear regret does not imply x t+1 will do well with f t+1 No assumptions about sequence {f t }; it can follow an unknown stochastic or deterministic model, or be chosen adversarially In our example, f t := l(y t h(x t, w t )). If some model w h(x, w) explains reasonably the data in hindsight, then the online models w h(xt, w) perform just as well in average Other applications: portfolio selection online advertisement placement interactive learning 15 / 29

26 Some classical results Projected gradient descent: x t+1 = Π S (x t η t f t (x t )), (1) where Π S is a projection onto a compact set S R d, & f 2 H Follow-the-Regularized-Leader: x t+1 = arg min y S ( t s=1 ) f s (y) + ψ(y) 16 / 29

27 Some classical results Projected gradient descent: x t+1 = Π S (x t η t f t (x t )), (1) where Π S is a projection onto a compact set S R d, & f 2 H Follow-the-Regularized-Leader: x t+1 = arg min y S ( t s=1 ) f s (y) + ψ(y) Martin Zinkevich, 03 (1) achieves O( T ) regret under convexity with η t = 1 t Elad Hazan, Amit Agarwal, and Satyen Kale, 07 (1) achieves O(log T ) regret under p-strong convexity with ηt = 1 pt Others: Online Newton Step, Follow the Regularized Leader, etc. 16 / 29

28 Our contribution: Combining both aspects 17 / 29

29 Back again to the diagnosis example: Health center i {1,..., N} takes care of a set of patients P i t at time t f i (x) = T l(y s h(x, w s )) = t=1 s Pt i T t=1 f i t (x) 18 / 29

30 Back again to the diagnosis example: Health center i {1,..., N} takes care of a set of patients P i t at time t f i (x) = T l(y s h(x, w s )) = t=1 s Pt i T t=1 f i t (x) R j (u, T ) := Goal: sublinear agent regret T t=1 i=1 f i t (x j t ) T t=1 i=1 f i t (u) o(t ) using local information & historical observations 18 / 29

31 Challenge: Coordinate hospitals f 1 t f 2 t f 5 t ft 3 ft 4 Need to design distributed online algorithms 19 / 29

32 Previous work on consensus-based online algorithms F. Yan, S. Sundaram, S. V. N. Vishwanathan and Y. Qi, TAC Projected Subgradient Descent log(t ) regret (local strong convexity & bounded subgradients) T regret (convexity & bounded subgradients) Both analysis require a projection onto a compact set S. Hosseini, A. Chapman and M. Mesbahi, CDC, 13 Dual Averaging T regret (convexity & bounded subgradients) General regularized projection onto a convex closed set. K. I. Tsianos and M. G. Rabbat, arxiv, 12 Projected Subgradient Descent Empirical risk as opposed to regret analysis Communication digraph in all cases is fixed, strongly connected & weight-balanced 20 / 29

33 Our contributions (informally) time-varying communication digraphs under B-joint connectivity & weight-balanced unconstrained optimization (no projection step onto a bounded set) log T regret (local strong convexity & bounded subgradients) T regret (convexity & bounded subgradients) 21 / 29

34 Coordination algorithm x i t+1 = x i t η t g x i t Subgradient descent on previous local objectives, g x i t f i t 22 / 29

35 Coordination algorithm x i t+1 = x i t η t g x i t ( + σ a j=1,j i ( a ij,t x j t xt i ) Proportional (linear) feedback on disagreement with neighbors 22 / 29

36 Coordination algorithm x i t+1 = x i t z i t+1 = z i t η t g x i t ( + σ a σ j=1,j i j=1,j i ( a ij,t x j t xt i ) + N ( a ij,t z j t z i ) ) t j=1,j i ( a ij,t x j t xt i ) Integral (linear) feedback on disagreement with neighbors 22 / 29

37 Coordination algorithm x i t+1 = x i t z i t+1 = z i t η t g x i t ( + σ a σ j=1,j i j=1,j i ( a ij,t x j t xt i ) + N ( a ij,t z j t z i ) ) t j=1,j i ( a ij,t x j t xt i ) Union of graphs over intervals of length B is strongly connected. f 1 t f 2 t f 5 t ft 3 ft 4 a 41,t+1 a 14,t+1 time t time t + 1 time t / 29

38 Coordination algorithm x i t+1 = x i t z i t+1 = z i t η t g x i t ( + σ a σ j=1,j i j=1,j i ( a ij,t x j t xt i ) + N ( a ij,t z j t z i ) ) t j=1,j i ( a ij,t x j t xt i ) Compact representation [ xt+1 z t+1 ] = [ xt z t ] σ [ alt L t L t 0 ] [ xt z t ] ] [ gxt η t 0 22 / 29

39 Coordination algorithm x i t+1 = x i t z i t+1 = z i t η t g x i t ( + σ a σ j=1,j i j=1,j i ( a ij,t x j t xt i ) + N ( a ij,t z j t z i ) ) t j=1,j i ( a ij,t x j t xt i ) Compact representation & generalization [ xt+1 z t+1 ] = [ xt z t ] σ [ alt L t L t 0 ] [ xt z t v t+1 = (I σg L t )v t η t g t, ] ] [ gxt η t 0 22 / 29

40 Our contributions Theorem Assume that {ft 1,..., ft N } T t=1 are convex functions in Rd with H-bounded subgradient sets, nonempty and uniformly bounded sets of minimizers, and p-strongly convex in a suff. large neighborhood of their minimizers The sequence of weight-balanced communication digraphs is nondegenerate, and B-jointly-connected G R K K is diagonalizable with positive real eigenvalues Then, taking learning rates η t = 1 p t, for any j {1,..., N} and u R d R j (u, T ) C( u log T ), 23 / 29

41 Our contributions Theorem Assume that {ft 1,..., ft N } T t=1 are convex functions in Rd with H-bounded subgradient sets, nonempty and uniformly bounded sets of minimizers, and p-strongly convex in a suff. large neighborhood of their minimizers The sequence of weight-balanced communication digraphs is nondegenerate, and B-jointly-connected G R K K is diagonalizable with positive real eigenvalues Relaxing strong convexity to convexity and using the Doubling Trick scheme (see S. Shalev-Shwartz) for the learning rates, R j (u, T ) C u 2 2 T, for any j {1,..., N} and u R d 23 / 29

42 Idea of the proof Network regret R j (u, T ) := T t=1 i=1 f i t (x i t) T t=1 i=1 f i t (u) Disagreement dynamics under B-joint connectivity Bound on the trajectories uniform in T 24 / 29

43 Simulations: acute brain finding revealed on Computerized Tomography 25 / 29

44 Agents estimates Average regret ( max j 1/T R j x T, { N } i ft i T t =1) { x i t,7 }N i= time, t Centralized Proportional-I ntegral disagreement feedback Proportional disagreement feedback C entralized time horizon, T ft i (x) = l(y s h(x, w s )) x 2 2 s Pt i where l(m) = log ( 1 + e 2m) 26 / 29

45 Conclusions Distributed online unconstrained convex optimization with sublinear regret under B-joint connectivity Relevant for regression & classification that play a crucial role in machine learning, computer vision, etc. Future work Refine guarantees under model for evolution of objective functions Enable agents to cooperatively select features that strike the balance sensibility/specificity Effect of noise on the performance 27 / 29

46 Future horizons for distributed optimization in healthcare Engage & detect disease before it happens 28 / 29

47 Thank you for listening! 29 / 29

Distributed online optimization over jointly connected digraphs

Distributed online optimization over jointly connected digraphs David Mateos-Núñez Jorge Cortés University of California, San Diego {dmateosn,cortes}@ucsd.edu Mathematical Theory of Networks and Systems