ADMM and Fast Gradient Methods for Distributed Optimization
|
|
- Alexander Hunt
- 5 years ago
- Views:
Transcription
1 ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013
2 Joint work Fast gradient methods Du san Jakovetić (IST-CMU) José M. F. Moura (CMU) ADMM João Mota (IST-CMU) Markus Püschel (ETH) Pedro Aguiar (IST)
3 Multi-agent optimization minimize subject to f (x) := f 1 (x) + f (x) + + f n (x) x X i f i f i is convex, private to agent i X R d is closed, convex (hereafter, d = 1) f = inf x X f (x) is attained at x network is connected and static applications: distributed learning, cognitive radio, consensus,...
4 Distributed subgradient method Update at each agent i (with constant step size) x i (t) = P X W ij x j (t 1) α f i (x i (t 1)) j N i N i is neighborhood of node i (including i) W ij are weights α > 0 is stepsize f i (x) is a subgradient of f i at x P X is projector onto X
5 Several variations exist and time-varying networks are supported Small sample of representative work: J.Tsitsiklis et al., Distributed asynchronous deterministic and stochastic gradient optimization algorithms, IEEE TAC, 31(9), 1986 A. Nedić and A. Ozdaglar, Distributed subgradient methods for multi-agent optimization, IEEE TAC, 54(1), 009 B. Johansson, A randomized incremental subgradient method for distributed optimization in networked systems, SIAM J. Optimization, 0(3), 009 A. Nedić et al., Constrained consensus and optimization in multi-agent networks, IEEE TAC, 55(4), 010 J. Duchi et al., Dual averaging for distributed optimization: convergence and network scaling, IEEE TAC, 57(3), 01
6 Convergence analysis: under appropriate conditions ( f (x i (t)) f = O α + 1 ) αt For optimized α, O(1/ɛ ) iterations suffice to reach ɛ-suboptimality Distributed subgradient method in matrix form x(t) = P (W x(t 1) α F (x(t 1))) x(t) := (x 1 (t),..., x n (t)) is network state F (x 1,..., x n ) := f 1 (x 1 ) + + f n (x n ) F (x 1,..., x n ) = ( f 1 (x 1 ),..., f n (x n )) P is projector onto X n
7 Interpretation: when W = I αρl (L = network Laplacian, ρ > 0) x(t) = P (x(t 1) α Ψ ρ (x(t 1))) classical subgradient method applied to penalized objective Ψ ρ (x) = F (x) + ρ x Lx = n f i (x i ) + ρ x i x j i=1 i j Key idea: apply instead Nesterov s fast gradient method x(t) = P (y(t 1) α Ψ ρ (y(t 1))) y(t) = x(t) + t 1 (x(t) x(t 1)) t + y(t) = (y 1 (t),..., y n (t)) is auxiliary variable
8 Distributed Nesterov gradient method (D-NG) with constant stepsize x(t) = P (W y(t 1) α F (y(t 1))) y(t) = x(t) + t 1 (x(t) x(t 1)) t + Convergence analysis: if f i s are differentiable, f i s are Lipschitz continuous, and X is compact ( 1 f (x i (t)) f = O ρ + 1 t + ρ ) t For optimized ρ, O(1/ɛ) iterations suffice to reach ɛ-suboptimality D. Jakovetić et al., Distributed Nesterov-like gradient algorithms, IEEE 51st Annual Conference on Decision and Control (CDC), 01
9 Proof Step 1: plug-in Nesterov s classical analysis f i (x) f i (y) L x y implies Ψ ρ (x) Ψ ρ (y) L ρ x y for L ρ = L + ρλ max (L) notation: Ψ ρ := inf x X Ψ ρ (x) is attained at x ρ with α := 1/L ρ, classical analysis yields Ψ ρ (x(t)) Ψ ρ L ρ t x(0) x ρ for some B 0 (since X is compact) L ρ t B (1)
10 Step : relate f (x i ) to Ψ ρ (x), x = (x 1,..., x n ) n f (x i ) = f j (x i ) = j=1 n f j (x j ) + ρ x Lx + j=1 } {{ } Ψ ρ (x) n f j (x i ) f j (x j ) ρ x Lx j=1 } {{ } (x) () Step 3: upper bound (x) C ρ for some C 0 use Lipschitz continuity of f j to obtain n f j (x i ) f j (x j ) G j=1 n x i x j (3) j=1 for some G 0
11 since L1 = 0, x Lx = (x x i 1) L (x x i 1) (4) combine (3) and (4) to obtain (x) G x x i 1 1 ρ (x x i1) L (x x i 1) = G x 1 ρ x L x x is x xi 1 with ith entry removed L is L with ith row and ith column removed easy to see that L is positive definite (network is connected)
12 it follows (x) max G y y 1 ρ y Ly = max y max z 1 Gz y ρ y Ly = max max Gz y ρ z 1 y y Ly = 1 ρ G max z 1 z L 1 z }{{} C (5) use f Ψ ρ, and combine (1), () with (5) to conclude f (x i (t)) f (L + ρλ max(l))b t + C ρ = O ( 1 ρ + 1 t + ρ t )
13 Numerical example Distributed logistic regression minimize s,r subject to n i=1 j=1 s R 5 φ ( b ij (s a ij + r) ) } {{ } f i (s,r) {(a ij, b ij ) R 3 R : j = 1,..., 5}: training data for agent i φ(t) = log (1 + e t ) geometric graph, n = 0 nodes and 86 edges
14 Constant stepsize e f (t) := 1 n f (x i (t)) f n i=1 f 10 0 Nesterov subgradient dual averaging e f iteration number, k x 10 4 Red: D-NG 1 Blue: subgradient Black: dual averaging 3 1 D. Jakovetić et al., Distributed Nesterov-like gradient algorithms, IEEE 51st Annual Conference on Decision and Control (CDC), 01 A. Nedić and A. Ozdaglar, Distributed subgradient methods for multi-agent optimization, IEEE TAC, 54(1), J. Duchi et al., Dual averaging for distributed optimization: convergence and network scaling, IEEE TAC, 57(3), 01
15 Diminishing stepsize e f (t) := 1 n f (x i (t)) f n i=1 f e f Nesterov subgradient dual averaging iteration number, k D-NG: α(t) = 1/t Subgradient, dual averaging: α(t) = 1/ t
16 Distributed Nesterov gradient method (D-NG) with diminishing stepsize Unconstrained problem minimize subject to f (x) := f 1 (x) + f (x) + + f n (x) x R d Diminishing stepsize α(t) = c t+1 x(t) = W y(t 1) α(t 1) F (y(t 1)) (6) y(t) = x(t) + t 1 (x(t) x(t 1)) t + (7) D. Jakovetić et al., Fast cooperative distributed learning, 46th Asilomar Conference on Signals, Systems and Computers (ASILOMAR), 01
17 Convergence analysis: if f i s are differentiable, f i s are bounded and L-Lipschitz continuous, and W is symmetric, stochastic and pos.-def. ( ) log t f (x i (t)) f = O t 1 dependence on network spectral gap 1 λ (W ) agents may ignore L and λ (W ) also known Sample of related work (different assumptions): K. Tsianos and M. Rabbat, Distributed strongly convex optimization, 50th Allerton Conference on Communication, Control and Computing, 01 E. Ghadimi et al., Accelerated gradient methods for networked optimization, arxiv preprint, 01
18 Sketch of proof Step 1: look at network averages x(t) := 1 n n x i (t) i=1 y(t) := 1 n n y i (t) i=1 from (6) and (7) x(t) = y(t 1) α(t 1) n } {{ } 1 L(t 1) n f i (y i (t 1)) i=1 } {{ } f (y(t 1)) y(t) = x(t) + t 1 (x(t) x(t 1)) t + interpretation: inexact Nesterov s gradient method
19 using ideas from optimization with inexact oracles 4, where f (x(t)) f n ct x(0) x + L t δ y (t) := y(t) y(t)1 t 1 s=0 = (y 1 (t) y(t),..., y n (t) y(t)) (s + ) s + 1 δ y (s) (8) 4 O. Devolder et al., First-order methods of smooth convex optimization with inexact oracle, submitted, Mathematical Programming, 011
20 Step : show δ y (t) = O ( ) 1 t rewrite (6) and (7) as the time-varying linear system where A(t) := [ t 1 [ ] δx (t) δ x (t 1) = A(t) [ ] δx (t 1) δ x (t ) + u(t 1), (9) t+1 W t t+1 ] [ ] W α(t)(i J) F (y(t)), u(t) := I 0 0 δx(t) := x(t) x(t)1 J := 1 n 11 W := W J u(t) = O ( ) 1 t due to α(t) = c t+1 and bounded gradient assumption
21 Fact: if x(t) = λx(t 1) + O with λ < 1, then x(t) = O ( ) 1 t ( ) 1 t (10) hand-waving argument: upon approximating [ ] W A(t) W, I 0 and diagonalizing, system (9) reduces to (10) since δ x (t) = O ( 1 t ), δ y (t) = δ x (t) + t 1 t + (δ x(t) δ x (t 1)) ( ) 1 = O t
22 Step 3: relate f (x i ) to f (x) f (x i ) = = n f j (x i ) j=1 n f j (x) + j=1 } {{ } f (x) n f j (x i ) f j (x) j=1 } {{ } (x) (11) by the bounded gradient assumption n (x) = f j (x i ) f j (x) Gn x i x Gn δ x (1) j=1 combine (8), (11) and (1) to conclude ( ) log t f (x i (t)) f = O t
23 Numerical example Acoustic source localization in sensor networks agent i measures r i = position of agent i y i = goal: determine source position x 1 x r i + noise convex approach 5 : n minimize x i=1 dist (x, C i ) { } with C i = x : x r i 1 yi geometric graph, n = 70 nodes and 99 edges 5 A. O. Hero and D. Blatt, Sensor network source localization via projection onto convex sets (POCS), IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 005
24 e f (t) := 1 n f (x i (t)) f n i=1 f avg. relative eror [ log 10 ] dis. Nesterov dis. grad., k =1/(k+1) 1/ dis. grad., k =1/(k+1) 1/3 dis. grad., k =1/(k+1) 1/10 dis. grad., k =1/(k+1) iteraion number, k [ log ] 10 Red: D-NG 6, α(t) = 1/(t + 1) Others: subgradient 7 6 D. Jakovetić et al., Fast cooperative distributed learning, 46th Asilomar Conference on Signals, Systems and Computers (ASILOMAR), 01 7 A. Nedić and A. Ozdaglar, Distributed subgradient methods for multi-agent optimization, IEEE TAC, 54(1), 009
25 Numerical example Distributed regularized logistic regression minimize s,r n i=1 φ ( b i (s a i + r) ) + β s }{{} f i (s,r) (a i, b i ): training data for agent i φ(t) = log (1 + e t ) geometric graph, n = 0 nodes and 67 edges
26 e f (t) := 1 n f (x i (t)) f n i=1 f 0 avg. relative error [ log 10 ] dis. grad, k =1/(k+1) dis. Nesterov dis. grad, k =1/(k+1) 1/ dis. grad, k =1/(k+1) 1/3 dis. grad, k =1/(k+1) 1/ iteration number, k [ log 10 ] Red: D-NG, α(t) = 1/(t + 1) Others: subgradient
27 Distributed Nesterov gradient method (D-NC) with consensus iterations x(t) = W a(t) (y(t 1) α F (y(t 1))) ( y(t) = W b(t) x(t) + t 1 ) (x(t) x(t 1)) t + log t a(t) = log λ (W ) and b(t) = log 3 λ (W ) must be known by all agents log λ (W ) + log t log λ (W ) D. Jakovetić et al., Fast distributed gradient methods, arxiv preprint, 011
28 Convergence analysis: if f i s are differentiable, f i s are bounded and L-Lipschitz continuous, W is symmetric and stochastic, and α = 1/(L) ( ) 1 f (x i (t)) f = O t t iterations involve O(t log t) communication rounds dependence on network spectral gap also known Similar rate guarantees in: A. Chen and A. Ozdaglar, A fast distributed proximal gradient method, 50th Allerton Conference on Communication, Control and Computing, 01
29 Distributed optimization via ADMM (D-ADMM) ADMM = Alternate Direction Method of Multipliers Recent review: S. Boyd et al., Distributed optimization and statistical learning via the alternating method of multipliers, Foundations and Trends in Machine Learning, 011 Star network
30 ADMM, distributed optimization: I. Schizas et al., Consensus in ad hoc WSNs with noisy links - part I: distributed estimation of deterministic signals, IEEE TSP, 56(1), 008 many others This talk: Generic network J. Mota et al., D-ADMM: a communication efficient distributed algorithm for separable optimization, IEEE TSP, 61(10), 013 J. Mota et al., Distributed optimization with local domains: applications in MPC and network flows, arxiv preprint, 013
31 Illustrative example: f 1 (x 1,x ) 1 f 3 (x 1,x ) 3 f (x ) minimize x 1,x,x 3 f 1 (x 1, x ) + f (x ) + f 3 (x 1, x ) f i s are proper, closed, convex functions with range R {+ } f i s may depend on different subsets of variables
32 minimize x 1,x,x 3 f 1 (x 1, x ) + f (x ) + f 3 (x 1, x ) Roadmap to obtain a distributed algorithm Step 1: split each variable across associated nodes x (1) 1,x(1) 1 x (3) 1,x(3) 3 x () Step : introduce consistency constraints ( ) ( ) ( ) minimize f 1 x (1) 1, x (1) + f x () + f 3 x (3) 1, x (3) subject to x (1) 1 = x (3) 1 x (1) = x (3) x (1) = x ()
33 Step 3: pass to augmented Lagrangian dual maximize (1,3) λ 1,λ (1,3),λ (1,) with ( ) L ρ λ (1,3) 1, λ (1,3), λ (1,) ( ) L ρ λ (1,3) 1, λ (1,3), λ (1,) ( ) ( = f 1 x (1) 1, x (1) + f x () + λ (1,3) 1, x (1) 1 x (3) 1 + ρ + λ (1,3), x (1) x (3) + ρ + λ (1,), x (1) x () + ρ ) x (3) 1, x (3) 1 x (3) 1 ) + f 3 ( x (1) x (1) x (3) x () x (1)
34 Step 4: color the network f 1 (x 1,x ) 1 f 3 (x 1,x ) 3 f (x ) ( ) L ρ λ (1,3) 1, λ (1,3), λ (1,) ( ) ( = f 1 x (1) 1, x (1) + f x () + λ (1,3) 1, x (1) 1 x (3) 1 + ρ + λ (1,3), x (1) x (3) + ρ + λ (1,), x (1) x () + ρ ) + f 3 ( ) x (3) 1, x (3) 1 x (3) 1 x (1) x (1) x (3) x () x (1)
35 Step 5: apply extended ADMM Primal update at node 1 ( ) x (1) 1, x (1) (t + 1) = argmin x1,x f 1 (x 1, x ) + λ (1,3) 1 (t), x 1 + ρ + λ (1,3) (t), x + ρ + λ (1,) (t), x + ρ x 1 x (3) 1 (t) x x (3) (t) x x () (t)
36 Primal update at node x () (t + 1) = argmin x f (x ) Primal update at node 3 ( ) x (3) 1, x (3) λ (1,) (t), x + ρ x (1) (t + 1) = argmin x1,x f 3 (x 1, x ) λ (1,3) 1 (t), x 1 + ρ λ (1,3) (t), x + ρ (t + 1) x x (1) 1 (t + 1) x 1 (t + 1) x x (1) Key point: nodes and 3 work in parallel (same color)
37 Step 6: dual update at all relevant nodes ( λ (1,3) 1 (t + 1) = λ (1,3) 1 (t) + ρ ( λ (1,3) (t + 1) = λ (1,3) (t) + ρ ( λ (1,) (t + 1) = λ (1,) (t) + ρ x (1) 1 x (1) x (1) (t + 1) x (3) 1 (t + 1) ) (t + 1) x (3) (t + 1) ) (t + 1) x () (t + 1) ) Convergence analysis: for colors (bipartite network): classical ADMM results apply for 3 colors 8 : convergence for strongly convex functions and suitable ρ 8 D. Han and X. Yuan, A note on the alternating direction method of multipliers, JOTA, 155(1), 01
38 Numerical example Consensus minimize x n (x θ i ) i=1 } {{ } f i (x) θ i = measurement of agent i (θ i i.i.d. N ( 10, 10 4) ) Network Model (parameters) # Colors 1 Erdős-Rényi (0.1) 5 Watts-Strogatz (4, 0.4) 4 3 Barabasi () 3 4 Geometric (0.3) 10 5 Lattice (5 10)
39 Communication steps B A D-ADMM C Network number D-ADMM 9 Others: A 10, B 11, C 1 9 J. Mota et al., D-ADMM: a communication efficient distributed algorithm for separable optimization, IEEE TSP, 61(10), I. Schizas et al., Consensus in ad hoc WSNs with noisy links - part I: distributed estimation of deterministic signals, IEEE TSP, 56(1), H. Zhu et al., Distributed in-network channel decoding, IEEE TSP, 57(10), B. Oreshkin et al., Optimization and analysis of distributed averaging with short node memory, IEEE TSP, 58(5), 010
40 LASSO Numerical example minimize x subject to x 1 Ax b σ Column partition: node i holds A i R 00 0 After regularization and dualization: n 1 minimize i=1 λ n (b λ + σ λ ) inf { x x 1 + λ A i x + δ } x }{{} f i (λ)
41 Communication steps B A D-ADMM Network number D-ADMM 13 Others: A 14, B J. Mota et al., D-ADMM: a communication efficient distributed algorithm for separable optimization, IEEE TSP, 61(10), I. Schizas et al., Consensus in ad hoc WSNs with noisy links - part I: distributed estimation of deterministic signals, IEEE TSP, 56(1), H. Zhu et al., Distributed in-network channel decoding, IEEE TSP, 57(10), 009
42 What if variables are not connected? Example: variable x is not connected f 1 (x 1 ) 1 f 3 (x 1,x ) 3 f (x ) Propagate variable across a Steiner tree J. Mota et al., Distributed optimization with local domains: applications in MPC and network flows, arxiv preprint, 013
43 Thank you!
Asynchronous Non-Convex Optimization For Separable Problem
Asynchronous Non-Convex Optimization For Separable Problem Sandeep Kumar and Ketan Rajawat Dept. of Electrical Engineering, IIT Kanpur Uttar Pradesh, India Distributed Optimization A general multi-agent
More informationDecentralized Quadratically Approximated Alternating Direction Method of Multipliers
Decentralized Quadratically Approimated Alternating Direction Method of Multipliers Aryan Mokhtari Wei Shi Qing Ling Alejandro Ribeiro Department of Electrical and Systems Engineering, University of Pennsylvania
More informationDistributed online optimization over jointly connected digraphs
Distributed online optimization over jointly connected digraphs David Mateos-Núñez Jorge Cortés University of California, San Diego {dmateosn,cortes}@ucsd.edu Mathematical Theory of Networks and Systems
More informationDistributed Smooth and Strongly Convex Optimization with Inexact Dual Methods
Distributed Smooth and Strongly Convex Optimization with Inexact Dual Methods Mahyar Fazlyab, Santiago Paternain, Alejandro Ribeiro and Victor M. Preciado Abstract In this paper, we consider a class of
More informationDistributed online optimization over jointly connected digraphs
Distributed online optimization over jointly connected digraphs David Mateos-Núñez Jorge Cortés University of California, San Diego {dmateosn,cortes}@ucsd.edu Southern California Optimization Day UC San
More informationNetwork Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)
Network Newton Aryan Mokhtari, Qing Ling and Alejandro Ribeiro University of Pennsylvania, University of Science and Technology (China) aryanm@seas.upenn.edu, qingling@mail.ustc.edu.cn, aribeiro@seas.upenn.edu
More informationDecentralized Consensus Optimization with Asynchrony and Delay
Decentralized Consensus Optimization with Asynchrony and Delay Tianyu Wu, Kun Yuan 2, Qing Ling 3, Wotao Yin, and Ali H. Sayed 2 Department of Mathematics, 2 Department of Electrical Engineering, University
More informationDistributed Consensus Optimization
Distributed Consensus Optimization Ming Yan Michigan State University, CMSE/Mathematics September 14, 2018 Decentralized-1 Backgroundwhy andwe motivation need decentralized optimization? I Decentralized
More informationDistributed Optimization over Random Networks
Distributed Optimization over Random Networks Ilan Lobel and Asu Ozdaglar Allerton Conference September 2008 Operations Research Center and Electrical Engineering & Computer Science Massachusetts Institute
More informationUses of duality. Geoff Gordon & Ryan Tibshirani Optimization /
Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear
More informationConsensus-Based Distributed Optimization with Malicious Nodes
Consensus-Based Distributed Optimization with Malicious Nodes Shreyas Sundaram Bahman Gharesifard Abstract We investigate the vulnerabilities of consensusbased distributed optimization protocols to nodes
More informationA SIMPLE PARALLEL ALGORITHM WITH AN O(1/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS
A SIMPLE PARALLEL ALGORITHM WITH AN O(/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS HAO YU AND MICHAEL J. NEELY Abstract. This paper considers convex programs with a general (possibly non-differentiable)
More informationStochastic dynamical modeling:
Stochastic dynamical modeling: Structured matrix completion of partially available statistics Armin Zare www-bcf.usc.edu/ arminzar Joint work with: Yongxin Chen Mihailo R. Jovanovic Tryphon T. Georgiou
More informationFrank-Wolfe Method. Ryan Tibshirani Convex Optimization
Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationRecent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables
Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong 2014 Workshop
More informationDistributed Optimization over Networks Gossip-Based Algorithms
Distributed Optimization over Networks Gossip-Based Algorithms Angelia Nedić angelia@illinois.edu ISE Department and Coordinated Science Laboratory University of Illinois at Urbana-Champaign Outline Random
More informationShiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers
Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 9 Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 2 Separable convex optimization a special case is min f(x)
More informationBLOCK ALTERNATING OPTIMIZATION FOR NON-CONVEX MIN-MAX PROBLEMS: ALGORITHMS AND APPLICATIONS IN SIGNAL PROCESSING AND COMMUNICATIONS
BLOCK ALTERNATING OPTIMIZATION FOR NON-CONVEX MIN-MAX PROBLEMS: ALGORITHMS AND APPLICATIONS IN SIGNAL PROCESSING AND COMMUNICATIONS Songtao Lu, Ioannis Tsaknakis, and Mingyi Hong Department of Electrical
More informationDLM: Decentralized Linearized Alternating Direction Method of Multipliers
1 DLM: Decentralized Linearized Alternating Direction Method of Multipliers Qing Ling, Wei Shi, Gang Wu, and Alejandro Ribeiro Abstract This paper develops the Decentralized Linearized Alternating Direction
More informationConvex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013
Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course otes for EE7C (Spring 018): Conve Optimization and Approimation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Ma Simchowitz Email: msimchow+ee7c@berkeley.edu October
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationPreconditioning via Diagonal Scaling
Preconditioning via Diagonal Scaling Reza Takapoui Hamid Javadi June 4, 2014 1 Introduction Interior point methods solve small to medium sized problems to high accuracy in a reasonable amount of time.
More informationAsynchronous Distributed Optimization. via Randomized Dual Proximal Gradient
Asynchronous Distributed Optimization 1 via Randomized Dual Proximal Gradient Ivano Notarnicola and Giuseppe Notarstefano arxiv:1509.08373v2 [cs.sy] 24 Jun 2016 Abstract In this paper we consider distributed
More informationSparse Gaussian conditional random fields
Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, zkolter}@cs.cmu.edu Abstract We propose sparse Gaussian
More informationDivide-and-combine Strategies in Statistical Modeling for Massive Data
Divide-and-combine Strategies in Statistical Modeling for Massive Data Liqun Yu Washington University in St. Louis March 30, 2017 Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationStochastic Proximal Gradient Algorithm
Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind
More informationOn Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:
A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition
More informationLearning with stochastic proximal gradient
Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and
More informationAccelerated primal-dual methods for linearly constrained convex problems
Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23 Accelerated proximal gradient For convex composite problem: minimize
More informationOptimization for Machine Learning
Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationarxiv: v2 [math.oc] 7 Apr 2017
Optimal algorithms for smooth and strongly convex distributed optimization in networks arxiv:702.08704v2 [math.oc] 7 Apr 207 Kevin Scaman Francis Bach 2 Sébastien Bubeck 3 Yin Tat Lee 3 Laurent Massoulié
More informationOn the Linear Convergence of Distributed Optimization over Directed Graphs
1 On the Linear Convergence of Distributed Optimization over Directed Graphs Chenguang Xi, and Usman A. Khan arxiv:1510.0149v1 [math.oc] 7 Oct 015 Abstract This paper develops a fast distributed algorithm,
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More informationDual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725
Dual methods and ADMM Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Given f : R n R, the function is called its conjugate Recall conjugate functions f (y) = max x R n yt x f(x)
More informationNESTT: A Nonconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization
ESTT: A onconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization Davood Hajinezhad, Mingyi Hong Tuo Zhao Zhaoran Wang Abstract We study a stochastic and distributed algorithm for
More informationPrimal-dual Subgradient Method for Convex Problems with Functional Constraints
Primal-dual Subgradient Method for Convex Problems with Functional Constraints Yurii Nesterov, CORE/INMA (UCL) Workshop on embedded optimization EMBOPT2014 September 9, 2014 (Lucca) Yu. Nesterov Primal-dual
More informationConvergence rates for distributed stochastic optimization over random networks
Convergence rates for distributed stochastic optimization over random networs Dusan Jaovetic, Dragana Bajovic, Anit Kumar Sahu and Soummya Kar Abstract We establish the O ) convergence rate for distributed
More informationAccelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)
Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan
More informationConvergence Rates in Decentralized Optimization. Alex Olshevsky Department of Electrical and Computer Engineering Boston University
Convergence Rates in Decentralized Optimization Alex Olshevsky Department of Electrical and Computer Engineering Boston University Distributed and Multi-agent Control Strong need for protocols to coordinate
More informationConsensus Optimization with Delayed and Stochastic Gradients on Decentralized Networks
06 IEEE International Conference on Big Data (Big Data) Consensus Optimization with Delayed and Stochastic Gradients on Decentralized Networks Benjamin Sirb Xiaojing Ye Department of Mathematics and Statistics
More informationGradient Descent. Ryan Tibshirani Convex Optimization /36-725
Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like
More informationA Customized ADMM for Rank-Constrained Optimization Problems with Approximate Formulations
A Customized ADMM for Rank-Constrained Optimization Problems with Approximate Formulations Chuangchuang Sun and Ran Dai Abstract This paper proposes a customized Alternating Direction Method of Multipliers
More information1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method
L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order
More informationarxiv: v3 [math.oc] 1 Jul 2015
On the Convergence of Decentralized Gradient Descent Kun Yuan Qing Ling Wotao Yin arxiv:1310.7063v3 [math.oc] 1 Jul 015 Abstract Consider the consensus problem of minimizing f(x) = n fi(x), where x Rp
More informationOptimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method
Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors
More informationA Distributed Newton Method for Network Utility Maximization, II: Convergence
A Distributed Newton Method for Network Utility Maximization, II: Convergence Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie October 31, 2012 Abstract The existing distributed algorithms for Network Utility
More informationYou should be able to...
Lecture Outline Gradient Projection Algorithm Constant Step Length, Varying Step Length, Diminishing Step Length Complexity Issues Gradient Projection With Exploration Projection Solving QPs: active set
More informationCoordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /
Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods
More informationOn the linear convergence of distributed optimization over directed graphs
1 On the linear convergence of distributed optimization over directed graphs Chenguang Xi, and Usman A. Khan arxiv:1510.0149v4 [math.oc] 7 May 016 Abstract This paper develops a fast distributed algorithm,
More informationDual Ascent. Ryan Tibshirani Convex Optimization
Dual Ascent Ryan Tibshirani Conve Optimization 10-725 Last time: coordinate descent Consider the problem min f() where f() = g() + n i=1 h i( i ), with g conve and differentiable and each h i conve. Coordinate
More informationSplitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches
Splitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches Patrick L. Combettes joint work with J.-C. Pesquet) Laboratoire Jacques-Louis Lions Faculté de Mathématiques
More information5. Subgradient method
L. Vandenberghe EE236C (Spring 2016) 5. Subgradient method subgradient method convergence analysis optimal step size when f is known alternating projections optimality 5-1 Subgradient method to minimize
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationA Robust Event-Triggered Consensus Strategy for Linear Multi-Agent Systems with Uncertain Network Topology
A Robust Event-Triggered Consensus Strategy for Linear Multi-Agent Systems with Uncertain Network Topology Amir Amini, Amir Asif, Arash Mohammadi Electrical and Computer Engineering,, Montreal, Canada.
More informationA Distributed Newton Method for Network Optimization
A Distributed Newton Method for Networ Optimization Ali Jadbabaie, Asuman Ozdaglar, and Michael Zargham Abstract Most existing wor uses dual decomposition and subgradient methods to solve networ optimization
More informationA Distributed Newton Method for Network Utility Maximization
A Distributed Newton Method for Networ Utility Maximization Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie Abstract Most existing wor uses dual decomposition and subgradient methods to solve Networ Utility
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More informationDistributed Computation of Quantiles via ADMM
1 Distributed Computation of Quantiles via ADMM Franck Iutzeler Abstract In this paper, we derive distributed synchronous and asynchronous algorithms for computing uantiles of the agents local values.
More informationDECENTRALIZED algorithms are used to solve optimization
5158 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 64, NO. 19, OCTOBER 1, 016 DQM: Decentralized Quadratically Approximated Alternating Direction Method of Multipliers Aryan Mohtari, Wei Shi, Qing Ling,
More informationDistributed Algorithms for Optimization and Nash-Games on Graphs
Distributed Algorithms for Optimization and Nash-Games on Graphs Angelia Nedić angelia@illinois.edu ECEE Department Arizona State University, Tempe, AZ 0 Large Networked Systems The recent advances in
More informationStochastic Gradient Descent. Ryan Tibshirani Convex Optimization
Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so
More informationLecture 25: November 27
10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have
More informationSparse Covariance Selection using Semidefinite Programming
Sparse Covariance Selection using Semidefinite Programming A. d Aspremont ORFE, Princeton University Joint work with O. Banerjee, L. El Ghaoui & G. Natsoulis, U.C. Berkeley & Iconix Pharmaceuticals Support
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationALADIN An Algorithm for Distributed Non-Convex Optimization and Control
ALADIN An Algorithm for Distributed Non-Convex Optimization and Control Boris Houska, Yuning Jiang, Janick Frasch, Rien Quirynen, Dimitris Kouzoupis, Moritz Diehl ShanghaiTech University, University of
More informationResilient Distributed Optimization Algorithm against Adversary Attacks
207 3th IEEE International Conference on Control & Automation (ICCA) July 3-6, 207. Ohrid, Macedonia Resilient Distributed Optimization Algorithm against Adversary Attacks Chengcheng Zhao, Jianping He
More informationOn the Convergence Rate of Incremental Aggregated Gradient Algorithms
On the Convergence Rate of Incremental Aggregated Gradient Algorithms M. Gürbüzbalaban, A. Ozdaglar, P. Parrilo June 5, 2015 Abstract Motivated by applications to distributed optimization over networks
More informationLasso: Algorithms and Extensions
ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions
More informationPerturbed Proximal Primal Dual Algorithm for Nonconvex Nonsmooth Optimization
Noname manuscript No. (will be inserted by the editor Perturbed Proximal Primal Dual Algorithm for Nonconvex Nonsmooth Optimization Davood Hajinezhad and Mingyi Hong Received: date / Accepted: date Abstract
More informationLinearized Alternating Direction Method: Two Blocks and Multiple Blocks. Zhouchen Lin 林宙辰北京大学
Linearized Alternating Direction Method: Two Blocks and Multiple Blocks Zhouchen Lin 林宙辰北京大学 Dec. 3, 014 Outline Alternating Direction Method (ADM) Linearized Alternating Direction Method (LADM) Two Blocks
More informationARock: an algorithmic framework for asynchronous parallel coordinate updates
ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,
More informationTOPOLOGY FOR GLOBAL AVERAGE CONSENSUS. Soummya Kar and José M. F. Moura
TOPOLOGY FOR GLOBAL AVERAGE CONSENSUS Soummya Kar and José M. F. Moura Department of Electrical and Computer Engineering Carnegie Mellon University, Pittsburgh, PA 15213 USA (e-mail:{moura}@ece.cmu.edu)
More informationTrade-Offs in Distributed Learning and Optimization
Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed
More informationPrimal Solutions and Rate Analysis for Subgradient Methods
Primal Solutions and Rate Analysis for Subgradient Methods Asu Ozdaglar Joint work with Angelia Nedić, UIUC Conference on Information Sciences and Systems (CISS) March, 2008 Department of Electrical Engineering
More informationA Unified Approach to Proximal Algorithms using Bregman Distance
A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationDistributed and Real-time Predictive Control
Distributed and Real-time Predictive Control Melanie Zeilinger Christian Conte (ETH) Alexander Domahidi (ETH) Ye Pu (EPFL) Colin Jones (EPFL) Challenges in modern control systems Power system: - Frequency
More informationGradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725
Gradient descent Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Gradient descent First consider unconstrained minimization of f : R n R, convex and differentiable. We want to solve
More informationComposite nonlinear models at scale
Composite nonlinear models at scale Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with D. Davis (Cornell), M. Fazel (UW), A.S. Lewis (Cornell) C. Paquette (Lehigh), and S. Roy (UW)
More informationDual Proximal Gradient Method
Dual Proximal Gradient Method http://bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/19 1 proximal gradient method
More informationarxiv: v1 [math.oc] 1 Jul 2016
Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the
More informationIncremental Gradient, Subgradient, and Proximal Methods for Convex Optimization
Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization Dimitri P. Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology February 2014
More informationHomework 5. Convex Optimization /36-725
Homework 5 Convex Optimization 10-725/36-725 Due Tuesday November 22 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)
More informationZero-Gradient-Sum Algorithms for Distributed Convex Optimization: The Continuous-Time Case
0 American Control Conference on O'Farrell Street, San Francisco, CA, USA June 9 - July 0, 0 Zero-Gradient-Sum Algorithms for Distributed Convex Optimization: The Continuous-Time Case Jie Lu and Choon
More informationParallel Coordinate Optimization
1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D Contours of a
More informationProximal Minimization by Incremental Surrogate Optimization (MISO)
Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine
More informationarxiv: v2 [cs.dc] 2 May 2018
On the Convergence Rate of Average Consensus and Distributed Optimization over Unreliable Networks Lili Su EECS, MI arxiv:1606.08904v2 [cs.dc] 2 May 2018 Abstract. We consider the problems of reaching
More informationFinal exam.
EE364b Convex Optimization II June 4 8, 205 Prof. John C. Duchi Final exam By now, you know how it works, so we won t repeat it here. (If not, see the instructions for the EE364a final exam.) Since you
More informationA Distributed Newton Method for Network Utility Maximization, I: Algorithm
A Distributed Newton Method for Networ Utility Maximization, I: Algorithm Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie October 31, 2012 Abstract Most existing wors use dual decomposition and first-order
More informationLOCAL LINEAR CONVERGENCE OF ADMM Daniel Boley
LOCAL LINEAR CONVERGENCE OF ADMM Daniel Boley Model QP/LP: min 1 / 2 x T Qx+c T x s.t. Ax = b, x 0, (1) Lagrangian: L(x,y) = 1 / 2 x T Qx+c T x y T x s.t. Ax = b, (2) where y 0 is the vector of Lagrange
More informationConvergence of Cubic Regularization for Nonconvex Optimization under KŁ Property
Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property Yi Zhou Department of ECE The Ohio State University zhou.1172@osu.edu Zhe Wang Department of ECE The Ohio State University
More informationDealing with Constraints via Random Permutation
Dealing with Constraints via Random Permutation Ruoyu Sun UIUC Joint work with Zhi-Quan Luo (U of Minnesota and CUHK (SZ)) and Yinyu Ye (Stanford) Simons Institute Workshop on Fast Iterative Methods in
More informationStochastic gradient methods for machine learning
Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine
More informationFirst-order methods for structured nonsmooth optimization
First-order methods for structured nonsmooth optimization Sangwoon Yun Department of Mathematics Education Sungkyunkwan University Oct 19, 2016 Center for Mathematical Analysis & Computation, Yonsei University
More information