Distributed online optimization over jointly connected digraphs

Size: px

Start display at page:

Download "Distributed online optimization over jointly connected digraphs"

Carmel O’Connor’
5 years ago
Views:

1 Distributed online optimization over jointly connected digraphs David Mateos-Núñez Jorge Cortés University of California, San Diego Mathematical Theory of Networks and Systems University of Groningen, July 7, / 30

2 Overview of this talk Distributed optimization Online optimization Case study: medical diagnosis 2 / 30

$Motivation: a data-driven optimization problem Medical findings, symptoms: age factor amnesia before impact deterioration in GCS score open skull fracture loss of$

3 Motivation: a data-driven optimization problem Medical findings, symptoms: age factor amnesia before impact deterioration in GCS score open skull fracture loss of consciousness vomiting Any acute brain finding revealed on Computerized Tomography? (-1 = not present, 1 = present) The Canadian CT Head Rule for patients with minor head injury 3 / 30

4 Binary classification feature vector of patient s: w s = ((w s ) 1,..., (w s ) d 1 ) true diagnosis: y s { 1, 1} wanted weights: x = (x 1,..., x d ) predictor: h(x, w s ) = x (w s, 1) margin: m s (x) = y s h(x, w s ) 4 / 30

5 Binary classification feature vector of patient s: w s = ((w s ) 1,..., (w s ) d 1 ) true diagnosis: y s { 1, 1} wanted weights: x = (x 1,..., x d ) predictor: h(x, w s ) = x (w s, 1) margin: m s (x) = y s h(x, w s ) Given the data set {w s } P s=1, estimate x Rd by solving min x R d P l(m s (x)) s=1 where the loss function l : R R is decreasing and convex 4 / 30

6 Why distributed online optimization? Why distributed? information is distributed across group of agents need to interact to optimize performance Why online? information becomes incrementally available need adaptive solution 5 / 30

7 Review of distributed convex optimization 6 / 30

8 A distributed scenario In the diagnosis example health center i {1,..., N} manages a set of patients P i f (x) = i=1 s P i l(y s h(x, w s )) = f i (x) i=1 7 / 30

9 A distributed scenario In the diagnosis example health center i {1,..., N} manages a set of patients P i f (x) = i=1 s P i l(y s h(x, w s )) = f i (x) i=1 Goal: best predicting model w h(x, w) min x R d f i (x) i=1 using local information 7 / 30

10 What do we mean by using local information? Agent i maintains an estimate x i t of x arg min x R d f i (x) Agent i has access to f i Agent i can share its estimate x i t with neighboring agents i=1 f 1 f 2 f 5 a 13 a 14 a 21 a 25 A = a 31 a 32 a 41 a 52 f 3 f 4 8 / 30

11 What do we mean by using local information? Agent i maintains an estimate x i t of x arg min x R d f i (x) Agent i has access to f i Agent i can share its estimate x i t with neighboring agents i=1 f 1 f 2 f 5 a 13 a 14 a 21 a 25 A = a 31 a 32 a 41 a 52 f 3 f 4 Distributed algorithms: Tsitsiklis 84, Bertsekas and Tsitsiklis 95 Consensus: Jadbabaie et al. 03, Olfati-Saber, Murray 04, Boyd et al / 30

12 How do agents agree on the optimizer? local minimization vs local consensus 9 / 30

13 How do agents agree on the optimizer? local minimization vs local consensus A. Nedić and A. Ozdaglar, TAC, 09 N xk+1 i = η kgk i + a ij,k x j k, j=1 where gk i f i (xk i ) and A k = (a ij,k ) is doubly stochastic J. C. Duchi, A. Agarwal, and M. J. Wainwright, TAC, 12 9 / 30

14 Saddle-point dynamics The minimization problem can be regarded as f i (x) = f i (x i ) = min x R d i=1 min x 1,...,x N R d x 1 =...=x N i=1 min f (x), x (R d ) N Lx=0 (Lx) i = a ij (x i x j ) f (x) = j=1 f i (x i ) i=1 10 / 30

15 Saddle-point dynamics The minimization problem can be regarded as f i (x) = f i (x i ) = min x R d i=1 min x 1,...,x N R d x 1 =...=x N i=1 min f (x), x (R d ) N Lx=0 (Lx) i = a ij (x i x j ) f (x) = j=1 f i (x i ) i=1 Convex-concave augmented Lagrangian when L is symmetric Saddle-point dynamics F (x, z) := f (x) + γ 2 x L x + z L x, F (x, z) ẋ = = f (x) γ Lx Lz x F (x, z) ż = = Lx z 10 / 30

16 In the case of directed graphs Weight-balanced 1 L = 0 L + L 0, F (x, z) := f (x) + γ 2 x L x + z L x still convex-concave 11 / 30

17 In the case of directed graphs Weight-balanced 1 L = 0 L + L 0, F (x, z) := f (x) + γ 2 x L x + z L x still convex-concave ẋ = F (x, z) x = f (x) γ 1 2( L + L ) x L z (Non distributed!) ż = F (x, z) z = Lx J. Wang and N. Elia (with L = L), Allerton, / 30

18 In the case of directed graphs Weight-balanced 1 L = 0 L + L 0, F (x, z) := f (x) + γ 2 x L x + z L x still convex-concave F (x, z) ẋ = = f (x) γ 1 x 2( ) L + L x L z (Non distributed!) changed to f (x) γ Lx Lz ż = F (x, z) z = Lx B. Gharesifard and J. Cortés, CDC, / 30

19 Review of online convex optimization 12 / 30

20 Different kind of optimization: sequential decision making In the diagnosis example: Each round t {1,..., T } question (features, medical findings): w t decision (about using CT): h(x t, w t ) outcome (by CT findings/follow up of patient): y t loss: l(y t h(x t, w t )) Choose x t & Incur loss f t (x t ) := l(y t h(x t, w t )) 13 / 30

21 Different kind of optimization: sequential decision making In the diagnosis example: Each round t {1,..., T } question (features, medical findings): w t decision (about using CT): h(x t, w t ) outcome (by CT findings/follow up of patient): y t loss: l(y t h(x t, w t )) Choose x t & Incur loss f t (x t ) := l(y t h(x t, w t )) Goal: sublinear regret T T R(u, T ) := f t (x t ) f t (u) o(t ) t=1 t=1 using historical observations 13 / 30

22 Why regret? If the regret is sublinear, T f t (x t ) t=1 T f t (u) + o(t ), t=1 then, lim T 1 T T f t (x t ) t=1 lim T 1 T T f t (u) t=1 In temporal average, online decisions {x t } T t=1 fixed decision in hindsight perform as well as best No regrets, my friend 14 / 30

23 Why regret? What about generalization error? Sublinear regret does not imply x t+1 will do well with f t+1 No assumptions about sequence {f t }; it can follow an unknown stochastic or deterministic model, or be chosen adversarially 15 / 30

24 Why regret? What about generalization error? Sublinear regret does not imply x t+1 will do well with f t+1 No assumptions about sequence {f t }; it can follow an unknown stochastic or deterministic model, or be chosen adversarially In our example, f t := l(y t h(x t, w t )). If some model w h(x, w) explains reasonably the data in hindsight, then the online models w h(xt, w) perform just as well in average 15 / 30

25 Some classical results Projected gradient descent: x t+1 = Π S (x t η t f t (x t )), (1) where Π S is a projection onto a compact set S R d, & f 2 H Follow-the-Regularized-Leader: x t+1 = arg min y S ( t s=1 ) f s (y) + ψ(y) 16 / 30

26 Some classical results Projected gradient descent: x t+1 = Π S (x t η t f t (x t )), (1) where Π S is a projection onto a compact set S R d, & f 2 H Follow-the-Regularized-Leader: x t+1 = arg min y S ( t s=1 ) f s (y) + ψ(y) Martin Zinkevich, 03 (1) achieves O( T ) regret under convexity with η t = 1 t Elad Hazan, Amit Agarwal, and Satyen Kale, 07 (1) achieves O(log T ) regret under p-strong convexity with ηt = 1 pt Others: Online Newton Step, Follow the Regularized Leader, etc. 16 / 30

27 Done with the review Now... we combine the two frameworks Previous work and limitations our coordination algorithm in the online case theorems of O( T ) & O(log T ) agent regrets simulations outline of the proof 17 / 30

28 Combining both frameworks 18 / 30

29 Resuming the diagnosis example: Health center i {1,..., N} takes care of a set of patients P i t at time t f i (x) = T l(y s h(x, w s )) = t=1 s Pt i T t=1 f i t (x) 19 / 30

30 Resuming the diagnosis example: Health center i {1,..., N} takes care of a set of patients P i t at time t f i (x) = T l(y s h(x, w s )) = t=1 s Pt i T t=1 f i t (x) R j (u, T ) := Goal: sublinear agent regret T t=1 i=1 f i t (x j t ) T t=1 i=1 f i t (u) o(t ) using local information & historical observations 19 / 30

31 Challenge: Coordinate hospitals ft 1 ft 2 f 5 t ft 3 ft 4 Need to design distributed online algorithms 20 / 30

32 Previous work on consensus-based online algorithms F. Yan, S. Sundaram, S. V. N. Vishwanathan and Y. Qi, TAC Projected Subgradient Descent log(t ) regret (local strong convexity & bounded subgradients) T regret (convexity & bounded subgradients) Both analysis require a projection onto a compact set S. Hosseini, A. Chapman and M. Mesbahi, CDC, 13 Dual Averaging T regret (convexity & bounded subgradients) General regularized projection onto a convex closed set. K. I. Tsianos and M. G. Rabbat, arxiv, 12 Projected Subgradient Descent Empirical risk as opposed to regret analysis Limitations: Communication digraph in all cases is fixed and strongly connected 21 / 30

33 Our contributions (informally) time-varying communication digraphs under B-joint connectivity & weight-balanced unconstrained optimization (no projection step onto a bounded set) log T regret (local strong convexity & bounded subgradients) T regret (convexity & β-centrality with β (0, 1] & bounded subgradients) 22 / 30

34 Our contributions (informally) time-varying communication digraphs under B-joint connectivity & weight-balanced unconstrained optimization (no projection step onto a bounded set) log T regret (local strong convexity & bounded subgradients) T regret (convexity & β-centrality with β (0, 1] & bounded subgradients) x ξ x x f i t is β-central in Z R d \ X, where X = { x : 0 ft i (x ) }, if x Z, x X such that ξ x (x x) ξ x 2 x x 2 β, for each ξ x f i t (x). 22 / 30

35 Coordination algorithm x i t+1 = x i t η t g x i t Subgradient descent on previous local objectives, g x i t f i t 23 / 30

36 Coordination algorithm x i t+1 = x i t η t g x i t ( + σ γ j=1,j i ( a ij,t x j t xt i ) Proportional (linear) feedback on disagreement with neighbors 23 / 30

37 Coordination algorithm x i t+1 = x i t z i t+1 = z i t η t g x i t ( + σ γ σ j=1,j i j=1,j i ( a ij,t x j t xt i ) + N ( a ij,t z j t z i ) ) t j=1,j i ( a ij,t x j t xt i ) Integral (linear) feedback on disagreement with neighbors 23 / 30

38 Coordination algorithm x i t+1 = x i t z i t+1 = z i t η t g x i t ( + σ γ σ j=1,j i j=1,j i ( a ij,t x j t xt i ) + N ( a ij,t z j t z i ) ) t j=1,j i ( a ij,t x j t xt i ) Union of graphs over intervals of length B is strongly connected. ft 1 ft 2 f 5 t ft 3 ft 4 a 41,t+1 a 14,t+1 time t time t + 1 time t / 30

39 Coordination algorithm x i t+1 = x i t z i t+1 = z i t η t g x i t ( + σ γ σ j=1,j i j=1,j i ( a ij,t x j t xt i ) + N ( a ij,t z j t z i ) ) t j=1,j i ( a ij,t x j t xt i ) Compact representation [ xt+1 z t+1 ] = [ xt z t ] σ [ γlt L t L t 0 ] [ xt z t ] ] [ gxt η t 0 23 / 30

40 Coordination algorithm x i t+1 = x i t z i t+1 = z i t η t g x i t ( + σ γ σ j=1,j i j=1,j i ( a ij,t x j t xt i ) + N ( a ij,t z j t z i ) ) t j=1,j i ( a ij,t x j t xt i ) Compact representation & generalization [ xt+1 z t+1 ] = [ xt z t ] σ [ γlt L t L t 0 ] [ xt z t ] ] [ gxt η t 0 v t+1 = (I σe L t )v t η t g t, 23 / 30

41 Theorem (DMN-JC) Assume that {ft 1,..., ft N } T t=1 are convex functions in Rd with H-bounded subgradient sets, nonempty and uniformly bounded sets of minimizers, and p-strongly convex in a suff. large neighborhood of their minimizers The sequence of weight-balanced communication digraphs is nondegenerate, and B-jointly-connected E R K K is diagonalizable with positive real eigenvalues Then, taking σ [ δ λ min (E)δ, for any j {1,..., N} and u R d ] 1 δ λ max(e)d max and ηt = 1 p t, R j (u, T ) C( u log T ), 24 / 30

42 Theorem (DMN-JC) Assume that {ft 1,..., ft N } T t=1 are convex functions in Rd with H-bounded subgradient sets, nonempty and uniformly bounded sets of minimizers, and p-strongly convex in a suff. large neighborhood of their minimizers The sequence of weight-balanced communication digraphs is nondegenerate, and B-jointly-connected E R K K is diagonalizable with positive real eigenvalues Substituting strong convexity by β-centrality, for β (0, 1], for any j {1,..., N} and u R d R j (u, T ) C u 2 2 T, 24 / 30

43 Simulations: acute brain finding revealed on Computerized Tomography 25 / 30

44 Simulations Agents estimates Average regret ( max j 1/T R j x T, { N } i ft i T t=1) { x i t,7 }N i= time, t Centralized Proportional-Integral disagreement feedback Proportional disagreement feedback Centralized time horizon, T ft i (x) = l(y s h(x, w s )) x 2 2 s Pt i where l(m) = log ( 1 + e 2m) 26 / 30

45 Outline of the proof Analysis of network regret R N (u, T ) := T ISS with respect to agreement t=1 i=1 f i t (x i t) T t=1 i=1 ( ˆL K v t 2 C I v δ ) t 1 B 4N 2 + C U f i t (u) max u s 2, 1 s t 1 Agent regret bound in terms of {η t } t 1 and initial conditions Boundedness of online estimates under β-centrality (β (0, 1] ) Doubling trick to obtain O( T ) agent regret Local strong convexity β-centrality η t = 1 p t to obtain O(log T ) agent regret 27 / 30

Conclusions and future work Conclusions Distributed online convex optimization over jointly connected digraphs, submitted to IEEE Transactions on Network Science and Engineering (can be found in the

46 Conclusions and future work Conclusions Distributed online convex optimization over jointly connected digraphs, submitted to IEEE Transactions on Network Science and Engineering (can be found in the webpage of Jorge Cortés) Relevant for regression & classification that play a crucial role in machine learning, computer vision, etc. Future work Refine guarantees under model for evolution of objective functions 28 / 30

47 Future directions in data-driven optimization problems feature selection semi-supervised learning Distributed strategies for / 30

48 Thank you for listening! Contact: David Mateos-Núñez and Jorge Cortés, at UC San Diego 30 / 30

Distributed online optimization over jointly connected digraphs

Distributed online optimization over jointly connected digraphs David Mateos-Núñez Jorge Cortés University of California, San Diego {dmateosn,cortes}@ucsd.edu Southern California Optimization Day UC San