Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

Size: px

Start display at page:

Download "Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables"

Eustace Roberts
5 years ago
Views:

1 Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong 2014 Workshop on Optimization for Modern Computation BICMR, Beijing, China September 2, 2014

2 Outline ADMM for N = 2 Existing work on ADMM for N 3 Convergence Rates of ADMM for N 3 BSUM-M

3 Alternating Direction Method of Multipliers (ADMM) Convex optimization min f 1 (x 1 )+f 2 (x 2 )+...+f N (x N ) s.t. A 1 x 1 +A 2 x A N x N = b x j X j, j = 1,2,...,N. f j : closed convex function X j : closed convex set Augmented Lagrangian function L γ (x 1,...,x N ;λ) := N N f j (x j ) λ, A j x j b + γ N 2 A j x j b 2 2 j=1 j=1 j=1

4 Augmented Lagrangian function L γ (x 1,...,x N ;λ) := N N f j (x j ) λ, A j x j b + γ N 2 A j x j b 2 2 j=1 j=1 j=1 x1 k+1 := argmin x1 X 1 L γ (x 1,x2 k,...,xk N ;λk ) x 2 k+1 := argmin x2 X 2 L γ (x1 k+1,x 2,x3 k,...,xk N ;λk ). x k+1 N := argmin xn ( X N L γ (x1 k+1,x2 k+1,...,x k+1 N 1,x N;λ k ) N ) λ k+1 := λ k γ j=1 A jx k+1 j b. Update the primal variables in a Gauss-Seidel manner.

5 ADMM for N = 2 ADMM for N = 2 x1 k+1 := argmin x1 X 1 L γ (x 1,x2 k;λk ) x2 k+1 := argmin x2 ( X 2 L γ (x1 k+1,x 2 ;λ k ) ) λ k+1 := λ k γ A 1 x1 k+1 +A 2 x2 k+1 b. Long history goes back to variational methods for PDEs in 1950s; Relate to Douglas-Rachford and Peaceman-Rachford Operator Splitting Methods for finding zero of monotone operators. Find x, s.t.,0 A(x)+B(x). Revisited recently for sparse optimization [Wang-Yang-Yin-Zhang-2008] [Goldstein-Osher-2009] [Boyd-etal-2011]

6 Global Convergence of ADMM for N = 2 ADMM for N = 2 x1 k+1 := argmin x1 X 1 L γ (x 1,x2 k;λk ) x2 k+1 := argmin x2 ( X 2 L γ (x1 k+1,x 2 ;λ k ) ) λ k+1 := λ k γ A 1 x1 k+1 +A 2 x2 k+1 b. Global convergence for any γ > 0. (Fortin-Glowinski-1983; Gabay-1983; Glowinski-Le Tallec-1989; Eckstein-Bertsekas-1992) ADMM for N = 2 with fixed dual step size x1 k+1 := argmin x1 X 1 L γ (x 1,x2 k;λk ) x2 k+1 := argmin x2 X ( 2 L γ (x1 k+1,x 2 ;λ k ) ) λ k+1 := λ k αγ A 1 x1 k+1 +A 2 x2 k+1 b. α > 0 is a fixed dual step size Global convergence for any γ > 0 and α (0, ).

7 Sublinear Convergence of ADMM for N = 2 Ergodic O(1/k) convergence (He-Yuan-2012) Non-Ergodic O(1/k) convergence (He-Yuan-2012) Ergodic O(1/k) convergence (Monteiro-Svaiter-2013)

8 Linear Convergence Rate of ADMM for N = 2 Douglas-Rachford splitting method converges linearly if B is coercive and Lipschitz (Lions-Mercier-1979) Linear convergence for solving linear programs (Eckstein-Bertsekas-1990) Linear convergence for quadratic programs (Han-Yuan-2013; Boley-2013)

9 Generalized ADMM Generalized ADMM for N = 2 (Deng-Yin-2012) x1 k+1 := argmin x1 X 1 L γ (x 1,x2 k;λk )+ 1 2 x 1 x1 k 2 P x2 k+1 := argmin x2 X 2 L γ (x k+1 ( λ k+1 := λ k αγ 1,x 2 ;λ k )+ 1 2 ) x 2 x2 k 2 Q 2 b. A 1 x k+1 1 +A 2 x k+1 One sufficient condition for guaranteeing global linear convergence: P = Q = 0, α = 1, f 2 strongly convex, f 2 Lipschtiz continuous, A 2 full row rank.

10 ADMM for N 3: a counter example A negative result (Chen-He-Ye-Yuan-2013): Direct extension of multi-block ADMM is not necessarily convergent A counter example: A 1 x 1 +A 2 x 2 +A 3 x 3 = 0, where A = (A 1,A 2,A 3 ) = The update of multi-block ADMM with γ = 1 is x k x k x k x3 k+1 = x k x k λ k λ k

11 ADMM for N 3: a counter example Equivalently, M = x2 k+1 x3 k+1 λ k+1 Note that ρ(m) > 1. = M Theorem (Chen-He-Ye-Yuan-2013) x k 2 x k 3 λ k, where There existing an example where the direct extension of ADMM of three blocks with a real number initial point is not necessarily convergent for any choice of γ > 0.

12 ADMM for N 3: Strong convexity? min 0.05x x x x 1 s.t x 2 = x 3 For γ = 1, ρ(m) = > 1 Able to find a proper initial point such that ADMM diverges Even for strongly convex programming, the extended ADMM is not necessarily convergent for a certain γ > 0.

13 ADMM for N 3: Strong convexity works! Global convergence Theorem (Han-Yuan-2012) If f i,i = 1,...,N are strongly convex with parameter σ i s, and { } 2σ i 0 < γ < min i=1,...,n 3(N 1)λ max (A, i A i ) then multi-block ADMM globally converges. Convergence Rate?

14 ADMM for N 3: weaker condition and convergence rate u := (x 1,...,x N ), x t i = 1 t +1 t k=0 Theorem (Lin-Ma-Zhang-2014a) x k+1 i,1 i N, λ t = 1 t +1 If f 2,...,f N are strongly convex, f 1 is convex, and { γ min 2 i N 1 t λ k+1. k=0 2σ i (2N i)(i 1)λ max (A i A i ), 2σ N (N 2)(N +1)λ max (A N A N) N then f(ū t ) f(u ) = O(1/t), and A i x i t b = O(1/t). Weaker condition Ergodic O(1/t) convergence rate in terms of objective value and primal feasibility i=1 },

15 ADMM for N 3: non-ergodic convergence rate Optimality measure: if A 2 x2 k+1 A 2 x2 k = 0, A 3 x3 k+1 A 3 x3 k = 0, A 1 x1 k+1 +A 2 x2 k+1 +A 3 x3 k+1 b = 0, then (x k+1 1,x k+1 2,x k+1 3,λ k+1 ) is optimal. Define R k+1 := A 1 x1 k+1 +A 2 x2 k+1 +A 3 x3 k+1 b 2 +2 A 2 x2 k+1 A 2 x2 k 2 +3 A 3 x3 k+1 A 3 x3 k 2. We can prove: R k = o(1/k)

16 ADMM for N 3: non-ergodic convergence rate Theorem (Lin-Ma-Zhang-2014a) If f 2 and f 3 are strongly convex, and { γ min σ 2 2λ max (A 2 A 2), σ 3 2λ max (A 3 A 3) }, then R k < + and R k = o(1/k). k=1

17 ADMM for N 3: non-ergodic convergence rate Theorem (Lin-Ma-Zhang-2014a) If f 2,...,f N are strongly convex, and { γ min 2 i N 1 then i=1 2σ i (2N i)(i 1)λ max (A i A i ), 2σ N (N 2)(N +1)λ max (A N A N) R k < + and R k = o(1/k), k=1 where 2 N R k+1 := A i x k+1 i b + N i=2 (2N i)(i 1) A i xi k A i x k+1 i 2. 2 },

18 ADMM for N 3: global linear convergence Globally linear convergence of ADMM for N 3 (Lin-Ma-Zhang-2014b) s.c. Lipschitz full row rank full column rank 1 f 2,,f N f N A N 2 f 1,,f N f 1,, f N 3 f 2,,f N f 1,, f N A 1 Table: Three scenarios leading to global linear convergence Reduce to the conditions in (Deng-Yin-2012) when N = 2

19 Variants: Modified Proximal Jacobian ADMM (Deng-Lai-Peng-Yin-2014) x1 k+1 := argmin x1 X 1 L γ (x 1,x2 k,...,xk N ;λk )+ 1 2 x 1 x1 k 2 P 1 x2 k+1 := argmin x2 X 2 L γ (x1 k,x 2,x3 k,...,xk N ;λk )+ 1 2 x 2 x2 k 2 P 2 x k+1 N λ k+1. := argmin xn X ( N L γ (x1 k,xk 2,...,xk N 1,x N;λ k )+ 1 2 x N xn k 2 P N N ) := λ k αγ j=1 A jx k+1 j b. Conditions for convergence: { Pi γ(1/ǫ i 1)A i A i,i = 1,2,...,N N i=1 ǫ i < 2 α o(1/k) convergence rate in non-ergodic sense

20 Variants: Modified Proximal Gauss-Seidel ADMM x1 k+1 := argmin x1 X 1 L γ (x 1,x2 k,xk 3 ;λk )+ 1 2 x 1 x1 k 2 P 1 x2 k+1 := argmin x2 X 2 L γ (x1 k+1,x 2,x3 k;λk )+ 1 2 x 2 x2 k 2 P 2 x3 k+1 := argmin x3 X ( 3 L γ (x1 k+1,x2 k+1,x 3 ;λ k )+ 1 2 x 3 x3 k 2 P 3 3 ) λ k+1 := λ k αγ j=1 A jx k+1 j b.

21 Proximal Gauss-Seidel ADMM Theorem (Lin-Ma-Zhang-2014c) Ergodic O(1/k) convergence rate in terms of both objective value and primal feasibility, under conditions: f 3 is strongly convex, and P 1 γ(3+ 5 ǫ 1 )A 1 A 1 P 2 γ(1+ 3 ǫ 2 )A 2 A 2 P 3 γ( 1 ǫ 3 1)A 3 A 3 3(ǫ 1 +ǫ 2 +ǫ 3 ) < 1 α γ < σ 3 ǫ 3 2(ǫ 3 +1)λ max(a 3 A 3) Ongoing work. More coming soon.

22 The General Problem Formulation We consider the following convex optimization problem min f(x) := g (x 1,,x K )+ K k=1 h k(x k ) subject to E 1 x 1 +E 2 x 2 + +E K x K = q, x k X k, k = 1,2,...,K, (P) g( ) a smooth convex function; h k a nonsmooth convex function x := (x T 1,...,xT K )T R n block variables X := K k=1 X k feasible set E := (E 1,,E K ) and h(x) := K k=1 h k(x k ) Augmented Lagrangian function L(x;y) = g(x)+h(x)+ y,q Ex + ρ q Ex 2 2

23 Small dual step size and error bound condition ADMM for N 3 without strong convexity assumption (Hong-Luo-2012) x k+1 1 := argmin x1 X 1 L(x 1,x2 k,...,xk N ;yk ) x2 k+1 := argmin x2 X 2 L(x1 k+1,x 2,x3 k,...,xk N ;yk ). x k+1 N := argmin xn ( X N L(x1 k+1,x2 k+1,...,x k+1 N 1,x N;y k ) N ) y k+1 := y k α j=1 A jx k+1 j b. Do not assume strong convexity, but need other conditions.

24 Small dual step size and error bound condition Error bound condition: there exist positive scalars τ and δ such that the following error bound holds dist(x,x(y)) τ x L(x;y), for all (x,y) such that x L(x;y) δ [Hong-Luo-2012]: Given that α is small enough (upper bounded by some constant related to τ and δ), the small step size variant of ADMM converges linearly. α is too small and not computable, thus not practical.

25 The BSUM-M Algorithm: Main Ideas A Block Successive Upper-bound Minimization Method of Multipliers Introduce the augmented Lagrangian function for problem (P) L(x;y) = g(x)+h(x)+ y,q Ex + ρ q Ex 2 2 y dual variable; ρ 0 primal stepsize Main idea: Primal update 1 Update the primal variables successively (Gauss-Seidel) 2 Optimize some approximate version of L(x, y) Main idea: Dual update 1 Inexact dual ascent + proper step size control

26 The BSUM-M Algorithm: Details At iteration r +1, a block variable x k is updated by solving ( min u k xk ;x1 r+1,,x r+1 x k X k 1,xr k, ),xr K k + y r+1,q E k x k +h k (x k ) u k ( ; x1 r+1,,x r+1 k 1,xr k,,xr K ): is an upper-bound of g(x)+ ρ q Ex 2 2 at the current iterate (x r+1 1,,x r+1 k 1,xr k,,xr K )

27 The BSUM-M Algorithm: Details (cont.) The BSUM-M Algorithm At each iteration r 1: ) K y r+1 = y r +α r (q Ex r ) = y r +α (q r E k xk r, k=1 x r+1 k = arg min u k (x k ;w r+1 x k X k ) y r+1,e k x k +h k (x k ), k k where α r > 0 is the dual stepsize. To simplify notations, we have defined w r+1 k := (x r+1 1,,x r+1 k 1,xr k,xr k+1,,xr K ),

28 The BSUM-M Algorithm: Randomized Version Select a vector {p k > 0} K k=0 such that K k=0 p k = 1 Each iteration t only updates a single randomly selected primal or dual variable At iteration t 1, pick k {0,,K} with probability p k and If k = 0 y t+1 = y t +α t (q Ex t ), x t+1 k = xk t, k = 1,,K. Else If k {1,,K} x t+1 k = argmin xk X k u k (x k ;x t ) y r,e k x k +h k (x k ), x t+1 j = x t j, j k, yt+1 = y t. End

29 Convergence Analysis: Assumptions Assumption A (on the problem) (a) Problem (P) is convex problem (b) g(x) = l(ax)+ x,b ; l( ) smooth strictly convex, A not necessarily full column rank (c) Nonsmooth function h k : h k (x k ) = λ k x k 1 + J w J x k,j 2, where x k = (,x k,j, ) is a partition of x k ; λ k 0 and w J 0 are some constants. (d) The feasible sets {X k } are compact polyhedral sets, and are given by X k := {x k C k x k c k }.

30 Convergence Analysis: Assumptions Assumption B (on u k ) (a) u k (v k ;x) g(v k,x k )+ ρ 2 E kv k q +E k x k 2, v k X k, x,k (upper-bound) (b) u k (x k ;x) = g(x)+ ( ρ 2 Ex q 2, x,k (locally tight) (c) u k (x k ;x) = k g(x)+ ρ 2 Ex q 2), x,k (d) For any given x, u k (v k ;x) is strongly convex in v k (e) For given x, u k (v k ;x) has Lipchitz continuous gradient Figure: Illustration of the upper-bound.

31 The Convergence Result Theorem (Hong, Chang, Wang, Razaviyanyn, Ma and Luo 2014) Suppose Assumptions A-B hold. Assume the dual stepsize α r satisfies r=1 Then we have the following: α r =, lim r αr = 0. 1 For the BSUM-M, we have lim r Ex r q = 0, and every limit point of {x r,y r } is a primal and dual optimal solution. 2 For the RBSUM-M, we have lim t Ex t q = 0 w.p.1. Further, every limit point of {x t,y t } is a primal and dual optimal solution w.p.1.

32 Counterexample for multi-block ADMM Recently [Chen-He-Ye-Yuan 13] shows (through an example) that applying ADMM to multi-block problem can diverge We show applying (R)BSUM-M to the same problem converges (diminishing dual step size) Main message: Dual stepsize control is crucial Consider the following linear systems of equations (unique solution x 1 = x 2 = x 3 = 0) E 1 x 1 +E 2 x 2 +E 3 x 3 = 0, with [E 1 E 2 E 3 ] =

33 Counterexample for multi-block ADMM (cont.) 0.2 x x x x x 3 x 1 +x 2 +x x 3 x 1 +x 2 +x iteration (r) iteration (t) Figure: Iterates generated by the BSUM-M. Each curve is averaged over 1000 runs (with random starting points). Figure: Iterates generated by the RBSUM-M algorithm. Each curve is averaged over 1000 runs (with random starting points)

34 Thank you for your attention!

The Direct Extension of ADMM for Multi-block Convex Minimization Problems is Not Necessarily Convergent

The Direct Extension of ADMM for Multi-block Convex Minimization Problems is Not Necessarily Convergent Yinyu Ye K. T. Li Professor of Engineering Department of Management Science and Engineering Stanford