Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

Similar documents
The Direct Extension of ADMM for Multi-block Convex Minimization Problems is Not Necessarily Convergent

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Does Alternating Direction Method of Multipliers Converge for Nonconvex Problems?

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16

Distributed Optimization via Alternating Direction Method of Multipliers

Dealing with Constraints via Random Permutation

arxiv: v1 [math.oc] 23 May 2017

Beyond Heuristics: Applying Alternating Direction Method of Multipliers in Nonconvex Territory

Inexact Alternating Direction Method of Multipliers for Separable Convex Optimization

Nonconvex ADMM: Convergence and Applications

Splitting methods for decomposing separable convex programs

Accelerated primal-dual methods for linearly constrained convex problems

Contraction Methods for Convex Optimization and monotone variational inequalities No.12

Coordinate Update Algorithm Short Course Operator Splitting

Douglas-Rachford Splitting: Complexity Estimates and Accelerated Variants

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Optimal Linearized Alternating Direction Method of Multipliers for Convex Programming 1

Math 273a: Optimization Overview of First-Order Optimization Algorithms

Tight Rates and Equivalence Results of Operator Splitting Schemes

You should be able to...

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.11

Sparse Optimization Lecture: Dual Methods, Part I

Dual Ascent. Ryan Tibshirani Convex Optimization

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING

arxiv: v1 [math.oc] 27 Jan 2013

A Unified Approach to Proximal Algorithms using Bregman Distance

HYBRID JACOBIAN AND GAUSS SEIDEL PROXIMAL BLOCK COORDINATE UPDATE METHODS FOR LINEARLY CONSTRAINED CONVEX PROGRAMMING

On convergence rate of the Douglas-Rachford operator splitting method

Linearized Alternating Direction Method of Multipliers via Positive-Indefinite Proximal Regularization for Convex Programming.

Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Multi-Block ADMM and its Convergence

The Direct Extension of ADMM for Multi-block Convex Minimization Problems is Not Necessarily Convergent

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

A Bregman alternating direction method of multipliers for sparse probabilistic Boolean network problem

Linearized Alternating Direction Method: Two Blocks and Multiple Blocks. Zhouchen Lin 林宙辰北京大学

ON THE GLOBAL AND LINEAR CONVERGENCE OF THE GENERALIZED ALTERNATING DIRECTION METHOD OF MULTIPLIERS

Convergence of Fixed-Point Iterations

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Alternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization

On the Convergence of Multi-Block Alternating Direction Method of Multipliers and Block Coordinate Descent Method

Optimization for Learning and Big Data

Dual and primal-dual methods

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Distributed Optimization and Statistics via Alternating Direction Method of Multipliers

Lecture 11 and 12: Penalty methods and augmented Lagrangian methods for nonlinear programming

Primal-dual coordinate descent

ADMM and Fast Gradient Methods for Distributed Optimization

Key words. alternating direction method of multipliers, convex composite optimization, indefinite proximal terms, majorization, iteration-complexity

ARock: an algorithmic framework for asynchronous parallel coordinate updates

On the equivalence of the primal-dual hybrid gradient method and Douglas Rachford splitting

Primal-dual fixed point algorithms for separable minimization problems and their applications in imaging

Proximal splitting methods on convex problems with a quadratic term: Relax!

Primal/Dual Decomposition Methods

Proximal ADMM with larger step size for two-block separable convex programming and its application to the correlation matrices calibrating problems

Convergence rate of inexact proximal point methods with relative error criteria for convex optimization

ECS289: Scalable Machine Learning

6. Proximal gradient method


Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Fast proximal gradient methods

SEMI-SMOOTH SECOND-ORDER TYPE METHODS FOR COMPOSITE CONVEX PROGRAMS

Fast-and-Light Stochastic ADMM

A Primal-dual Three-operator Splitting Scheme

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

arxiv: v1 [math.oc] 21 Apr 2016

A SIMPLE PARALLEL ALGORITHM WITH AN O(1/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS

4TE3/6TE3. Algorithms for. Continuous Optimization

Probabilistic Graphical Models

Distributed Convex Optimization

The Alternating Direction Method of Multipliers

Primal-dual algorithms for the sum of two and three functions 1

A GENERAL INERTIAL PROXIMAL POINT METHOD FOR MIXED VARIATIONAL INEQUALITY PROBLEM

Generalized ADMM with Optimal Indefinite Proximal Term for Linearly Constrained Convex Optimization

Operator Splitting for Parallel and Distributed Optimization

arxiv: v3 [math.oc] 17 Dec 2017 Received: date / Accepted: date

Math 273a: Optimization Subgradient Methods

Algorithms for constrained local optimization

WHY DUALITY? Gradient descent Newton s method Quasi-newton Conjugate gradients. No constraints. Non-differentiable ???? Constrained problems? ????

FAST ALTERNATING DIRECTION OPTIMIZATION METHODS

Preconditioned Alternating Direction Method of Multipliers for the Minimization of Quadratic plus Non-Smooth Convex Functionals

Math 273a: Optimization Subgradients of convex functions

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming

Conductivity imaging from one interior measurement

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Big Data Analytics: Optimization and Randomization

First-order methods for structured nonsmooth optimization

Relaxed linearized algorithms for faster X-ray CT image reconstruction

Solving DC Programs that Promote Group 1-Sparsity

A New Look at First Order Methods Lifting the Lipschitz Gradient Continuity Restriction

Algorithms for Nonsmooth Optimization

NESTT: A Nonconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization

STA141C: Big Data & High Performance Statistical Computing

M. Marques Alves Marina Geremia. November 30, 2017

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

Optimization for Machine Learning

AN AUGMENTED LAGRANGIAN METHOD FOR l 1 -REGULARIZED OPTIMIZATION PROBLEMS WITH ORTHOGONALITY CONSTRAINTS

arxiv: v4 [math.oc] 29 Jan 2018

Alternating Direction Augmented Lagrangian Algorithms for Convex Optimization

Transcription:

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong 2014 Workshop on Optimization for Modern Computation BICMR, Beijing, China September 2, 2014

Outline ADMM for N = 2 Existing work on ADMM for N 3 Convergence Rates of ADMM for N 3 BSUM-M

Alternating Direction Method of Multipliers (ADMM) Convex optimization min f 1 (x 1 )+f 2 (x 2 )+...+f N (x N ) s.t. A 1 x 1 +A 2 x 2 +...+A N x N = b x j X j, j = 1,2,...,N. f j : closed convex function X j : closed convex set Augmented Lagrangian function L γ (x 1,...,x N ;λ) := N N f j (x j ) λ, A j x j b + γ N 2 A j x j b 2 2 j=1 j=1 j=1

Augmented Lagrangian function L γ (x 1,...,x N ;λ) := N N f j (x j ) λ, A j x j b + γ N 2 A j x j b 2 2 j=1 j=1 j=1 x1 k+1 := argmin x1 X 1 L γ (x 1,x2 k,...,xk N ;λk ) x 2 k+1 := argmin x2 X 2 L γ (x1 k+1,x 2,x3 k,...,xk N ;λk ). x k+1 N := argmin xn ( X N L γ (x1 k+1,x2 k+1,...,x k+1 N 1,x N;λ k ) N ) λ k+1 := λ k γ j=1 A jx k+1 j b. Update the primal variables in a Gauss-Seidel manner.

ADMM for N = 2 ADMM for N = 2 x1 k+1 := argmin x1 X 1 L γ (x 1,x2 k;λk ) x2 k+1 := argmin x2 ( X 2 L γ (x1 k+1,x 2 ;λ k ) ) λ k+1 := λ k γ A 1 x1 k+1 +A 2 x2 k+1 b. Long history goes back to variational methods for PDEs in 1950s; Relate to Douglas-Rachford and Peaceman-Rachford Operator Splitting Methods for finding zero of monotone operators. Find x, s.t.,0 A(x)+B(x). Revisited recently for sparse optimization [Wang-Yang-Yin-Zhang-2008] [Goldstein-Osher-2009] [Boyd-etal-2011]

Global Convergence of ADMM for N = 2 ADMM for N = 2 x1 k+1 := argmin x1 X 1 L γ (x 1,x2 k;λk ) x2 k+1 := argmin x2 ( X 2 L γ (x1 k+1,x 2 ;λ k ) ) λ k+1 := λ k γ A 1 x1 k+1 +A 2 x2 k+1 b. Global convergence for any γ > 0. (Fortin-Glowinski-1983; Gabay-1983; Glowinski-Le Tallec-1989; Eckstein-Bertsekas-1992) ADMM for N = 2 with fixed dual step size x1 k+1 := argmin x1 X 1 L γ (x 1,x2 k;λk ) x2 k+1 := argmin x2 X ( 2 L γ (x1 k+1,x 2 ;λ k ) ) λ k+1 := λ k αγ A 1 x1 k+1 +A 2 x2 k+1 b. α > 0 is a fixed dual step size Global convergence for any γ > 0 and α (0, 1+ 5 2 ).

Sublinear Convergence of ADMM for N = 2 Ergodic O(1/k) convergence (He-Yuan-2012) Non-Ergodic O(1/k) convergence (He-Yuan-2012) Ergodic O(1/k) convergence (Monteiro-Svaiter-2013)

Linear Convergence Rate of ADMM for N = 2 Douglas-Rachford splitting method converges linearly if B is coercive and Lipschitz (Lions-Mercier-1979) Linear convergence for solving linear programs (Eckstein-Bertsekas-1990) Linear convergence for quadratic programs (Han-Yuan-2013; Boley-2013)

Generalized ADMM Generalized ADMM for N = 2 (Deng-Yin-2012) x1 k+1 := argmin x1 X 1 L γ (x 1,x2 k;λk )+ 1 2 x 1 x1 k 2 P x2 k+1 := argmin x2 X 2 L γ (x k+1 ( λ k+1 := λ k αγ 1,x 2 ;λ k )+ 1 2 ) x 2 x2 k 2 Q 2 b. A 1 x k+1 1 +A 2 x k+1 One sufficient condition for guaranteeing global linear convergence: P = Q = 0, α = 1, f 2 strongly convex, f 2 Lipschtiz continuous, A 2 full row rank.

ADMM for N 3: a counter example A negative result (Chen-He-Ye-Yuan-2013): Direct extension of multi-block ADMM is not necessarily convergent A counter example: 1 1 1 A 1 x 1 +A 2 x 2 +A 3 x 3 = 0, where A = (A 1,A 2,A 3 ) = 1 1 2 1 2 2 The update of multi-block ADMM with γ = 1 is 3 0 0 0 0 0 4 6 0 0 0 0 x k+1 0 4 5 1 1 1 1 5 7 9 0 0 0 x k+1 0 0 7 1 1 2 x k 1 2 1 1 1 1 0 0 x3 k+1 = 0 0 0 1 2 2 x k 2 0 0 0 1 0 0 x k 1 1 2 0 1 0 λ k+1 3 0 0 0 0 1 0 λ k 1 2 2 0 0 1 0 0 0 0 0 1

ADMM for N 3: a counter example Equivalently, M = 1 162 x2 k+1 x3 k+1 λ k+1 Note that ρ(m) > 1. = M Theorem (Chen-He-Ye-Yuan-2013) x k 2 x k 3 λ k, where 144 9 9 9 18 8 157 5 13 8 64 122 122 58 64. 56 35 35 91 56 88 26 26 62 88 There existing an example where the direct extension of ADMM of three blocks with a real number initial point is not necessarily convergent for any choice of γ > 0.

ADMM for N 3: Strong convexity? min 0.05x1 2 +0.05x2 2 +0.05x2 3 1 1 1 x 1 s.t. 1 1 2 x 2 = 0. 1 2 2 x 3 For γ = 1, ρ(m) = 1.0087 > 1 Able to find a proper initial point such that ADMM diverges Even for strongly convex programming, the extended ADMM is not necessarily convergent for a certain γ > 0.

ADMM for N 3: Strong convexity works! Global convergence Theorem (Han-Yuan-2012) If f i,i = 1,...,N are strongly convex with parameter σ i s, and { } 2σ i 0 < γ < min i=1,...,n 3(N 1)λ max (A, i A i ) then multi-block ADMM globally converges. Convergence Rate?

ADMM for N 3: weaker condition and convergence rate u := (x 1,...,x N ), x t i = 1 t +1 t k=0 Theorem (Lin-Ma-Zhang-2014a) x k+1 i,1 i N, λ t = 1 t +1 If f 2,...,f N are strongly convex, f 1 is convex, and { γ min 2 i N 1 t λ k+1. k=0 2σ i (2N i)(i 1)λ max (A i A i ), 2σ N (N 2)(N +1)λ max (A N A N) N then f(ū t ) f(u ) = O(1/t), and A i x i t b = O(1/t). Weaker condition Ergodic O(1/t) convergence rate in terms of objective value and primal feasibility i=1 },

ADMM for N 3: non-ergodic convergence rate Optimality measure: if A 2 x2 k+1 A 2 x2 k = 0, A 3 x3 k+1 A 3 x3 k = 0, A 1 x1 k+1 +A 2 x2 k+1 +A 3 x3 k+1 b = 0, then (x k+1 1,x k+1 2,x k+1 3,λ k+1 ) is optimal. Define R k+1 := A 1 x1 k+1 +A 2 x2 k+1 +A 3 x3 k+1 b 2 +2 A 2 x2 k+1 A 2 x2 k 2 +3 A 3 x3 k+1 A 3 x3 k 2. We can prove: R k = o(1/k)

ADMM for N 3: non-ergodic convergence rate Theorem (Lin-Ma-Zhang-2014a) If f 2 and f 3 are strongly convex, and { γ min σ 2 2λ max (A 2 A 2), σ 3 2λ max (A 3 A 3) }, then R k < + and R k = o(1/k). k=1

ADMM for N 3: non-ergodic convergence rate Theorem (Lin-Ma-Zhang-2014a) If f 2,...,f N are strongly convex, and { γ min 2 i N 1 then i=1 2σ i (2N i)(i 1)λ max (A i A i ), 2σ N (N 2)(N +1)λ max (A N A N) R k < + and R k = o(1/k), k=1 where 2 N R k+1 := A i x k+1 i b + N i=2 (2N i)(i 1) A i xi k A i x k+1 i 2. 2 },

ADMM for N 3: global linear convergence Globally linear convergence of ADMM for N 3 (Lin-Ma-Zhang-2014b) s.c. Lipschitz full row rank full column rank 1 f 2,,f N f N A N 2 f 1,,f N f 1,, f N 3 f 2,,f N f 1,, f N A 1 Table: Three scenarios leading to global linear convergence Reduce to the conditions in (Deng-Yin-2012) when N = 2

Variants: Modified Proximal Jacobian ADMM (Deng-Lai-Peng-Yin-2014) x1 k+1 := argmin x1 X 1 L γ (x 1,x2 k,...,xk N ;λk )+ 1 2 x 1 x1 k 2 P 1 x2 k+1 := argmin x2 X 2 L γ (x1 k,x 2,x3 k,...,xk N ;λk )+ 1 2 x 2 x2 k 2 P 2 x k+1 N λ k+1. := argmin xn X ( N L γ (x1 k,xk 2,...,xk N 1,x N;λ k )+ 1 2 x N xn k 2 P N N ) := λ k αγ j=1 A jx k+1 j b. Conditions for convergence: { Pi γ(1/ǫ i 1)A i A i,i = 1,2,...,N N i=1 ǫ i < 2 α o(1/k) convergence rate in non-ergodic sense

Variants: Modified Proximal Gauss-Seidel ADMM x1 k+1 := argmin x1 X 1 L γ (x 1,x2 k,xk 3 ;λk )+ 1 2 x 1 x1 k 2 P 1 x2 k+1 := argmin x2 X 2 L γ (x1 k+1,x 2,x3 k;λk )+ 1 2 x 2 x2 k 2 P 2 x3 k+1 := argmin x3 X ( 3 L γ (x1 k+1,x2 k+1,x 3 ;λ k )+ 1 2 x 3 x3 k 2 P 3 3 ) λ k+1 := λ k αγ j=1 A jx k+1 j b.

Proximal Gauss-Seidel ADMM Theorem (Lin-Ma-Zhang-2014c) Ergodic O(1/k) convergence rate in terms of both objective value and primal feasibility, under conditions: f 3 is strongly convex, and P 1 γ(3+ 5 ǫ 1 )A 1 A 1 P 2 γ(1+ 3 ǫ 2 )A 2 A 2 P 3 γ( 1 ǫ 3 1)A 3 A 3 3(ǫ 1 +ǫ 2 +ǫ 3 ) < 1 α γ < σ 3 ǫ 3 2(ǫ 3 +1)λ max(a 3 A 3) Ongoing work. More coming soon.

The General Problem Formulation We consider the following convex optimization problem min f(x) := g (x 1,,x K )+ K k=1 h k(x k ) subject to E 1 x 1 +E 2 x 2 + +E K x K = q, x k X k, k = 1,2,...,K, (P) g( ) a smooth convex function; h k a nonsmooth convex function x := (x T 1,...,xT K )T R n block variables X := K k=1 X k feasible set E := (E 1,,E K ) and h(x) := K k=1 h k(x k ) Augmented Lagrangian function L(x;y) = g(x)+h(x)+ y,q Ex + ρ q Ex 2 2

Small dual step size and error bound condition ADMM for N 3 without strong convexity assumption (Hong-Luo-2012) x k+1 1 := argmin x1 X 1 L(x 1,x2 k,...,xk N ;yk ) x2 k+1 := argmin x2 X 2 L(x1 k+1,x 2,x3 k,...,xk N ;yk ). x k+1 N := argmin xn ( X N L(x1 k+1,x2 k+1,...,x k+1 N 1,x N;y k ) N ) y k+1 := y k α j=1 A jx k+1 j b. Do not assume strong convexity, but need other conditions.

Small dual step size and error bound condition Error bound condition: there exist positive scalars τ and δ such that the following error bound holds dist(x,x(y)) τ x L(x;y), for all (x,y) such that x L(x;y) δ [Hong-Luo-2012]: Given that α is small enough (upper bounded by some constant related to τ and δ), the small step size variant of ADMM converges linearly. α is too small and not computable, thus not practical.

The BSUM-M Algorithm: Main Ideas A Block Successive Upper-bound Minimization Method of Multipliers Introduce the augmented Lagrangian function for problem (P) L(x;y) = g(x)+h(x)+ y,q Ex + ρ q Ex 2 2 y dual variable; ρ 0 primal stepsize Main idea: Primal update 1 Update the primal variables successively (Gauss-Seidel) 2 Optimize some approximate version of L(x, y) Main idea: Dual update 1 Inexact dual ascent + proper step size control

The BSUM-M Algorithm: Details At iteration r +1, a block variable x k is updated by solving ( min u k xk ;x1 r+1,,x r+1 x k X k 1,xr k, ),xr K k + y r+1,q E k x k +h k (x k ) u k ( ; x1 r+1,,x r+1 k 1,xr k,,xr K ): is an upper-bound of g(x)+ ρ q Ex 2 2 at the current iterate (x r+1 1,,x r+1 k 1,xr k,,xr K )

The BSUM-M Algorithm: Details (cont.) The BSUM-M Algorithm At each iteration r 1: ) K y r+1 = y r +α r (q Ex r ) = y r +α (q r E k xk r, k=1 x r+1 k = arg min u k (x k ;w r+1 x k X k ) y r+1,e k x k +h k (x k ), k k where α r > 0 is the dual stepsize. To simplify notations, we have defined w r+1 k := (x r+1 1,,x r+1 k 1,xr k,xr k+1,,xr K ),

The BSUM-M Algorithm: Randomized Version Select a vector {p k > 0} K k=0 such that K k=0 p k = 1 Each iteration t only updates a single randomly selected primal or dual variable At iteration t 1, pick k {0,,K} with probability p k and If k = 0 y t+1 = y t +α t (q Ex t ), x t+1 k = xk t, k = 1,,K. Else If k {1,,K} x t+1 k = argmin xk X k u k (x k ;x t ) y r,e k x k +h k (x k ), x t+1 j = x t j, j k, yt+1 = y t. End

Convergence Analysis: Assumptions Assumption A (on the problem) (a) Problem (P) is convex problem (b) g(x) = l(ax)+ x,b ; l( ) smooth strictly convex, A not necessarily full column rank (c) Nonsmooth function h k : h k (x k ) = λ k x k 1 + J w J x k,j 2, where x k = (,x k,j, ) is a partition of x k ; λ k 0 and w J 0 are some constants. (d) The feasible sets {X k } are compact polyhedral sets, and are given by X k := {x k C k x k c k }.

Convergence Analysis: Assumptions Assumption B (on u k ) (a) u k (v k ;x) g(v k,x k )+ ρ 2 E kv k q +E k x k 2, v k X k, x,k (upper-bound) (b) u k (x k ;x) = g(x)+ ( ρ 2 Ex q 2, x,k (locally tight) (c) u k (x k ;x) = k g(x)+ ρ 2 Ex q 2), x,k (d) For any given x, u k (v k ;x) is strongly convex in v k (e) For given x, u k (v k ;x) has Lipchitz continuous gradient Figure: Illustration of the upper-bound.

The Convergence Result Theorem (Hong, Chang, Wang, Razaviyanyn, Ma and Luo 2014) Suppose Assumptions A-B hold. Assume the dual stepsize α r satisfies r=1 Then we have the following: α r =, lim r αr = 0. 1 For the BSUM-M, we have lim r Ex r q = 0, and every limit point of {x r,y r } is a primal and dual optimal solution. 2 For the RBSUM-M, we have lim t Ex t q = 0 w.p.1. Further, every limit point of {x t,y t } is a primal and dual optimal solution w.p.1.

Counterexample for multi-block ADMM Recently [Chen-He-Ye-Yuan 13] shows (through an example) that applying ADMM to multi-block problem can diverge We show applying (R)BSUM-M to the same problem converges (diminishing dual step size) Main message: Dual stepsize control is crucial Consider the following linear systems of equations (unique solution x 1 = x 2 = x 3 = 0) E 1 x 1 +E 2 x 2 +E 3 x 3 = 0, with [E 1 E 2 E 3 ] = 1 1 1 1 1 2. 1 2 2

Counterexample for multi-block ADMM (cont.) 0.2 x 1 0.5 x 1 0.15 x 2 0.4 x 2 0.1 0.05 x 3 x 1 +x 2 +x 3 0.3 0.2 x 3 x 1 +x 2 +x 3 0 0.05 0.1 0 0.1 0.1 0.2 0.15 0.3 0.2 0.4 0.25 0 50 100 150 200 250 300 350 400 iteration (r) 0.5 0 200 400 600 800 1000 1200 iteration (t) Figure: Iterates generated by the BSUM-M. Each curve is averaged over 1000 runs (with random starting points). Figure: Iterates generated by the RBSUM-M algorithm. Each curve is averaged over 1000 runs (with random starting points)

Thank you for your attention!