Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong 2014 Workshop on Optimization for Modern Computation BICMR, Beijing, China September 2, 2014

Outline ADMM for N = 2 Existing work on ADMM for N 3 Convergence Rates of ADMM for N 3 BSUM-M

Alternating Direction Method of Multipliers (ADMM) Convex optimization min f 1 (x 1 )+f 2 (x 2 )+...+f N (x N ) s.t. A 1 x 1 +A 2 x 2 +...+A N x N = b x j X j, j = 1,2,...,N. f j : closed convex function X j : closed convex set Augmented Lagrangian function L γ (x 1,...,x N ;λ) := N N f j (x j ) λ, A j x j b + γ N 2 A j x j b 2 2 j=1 j=1 j=1

Augmented Lagrangian function L γ (x 1,...,x N ;λ) := N N f j (x j ) λ, A j x j b + γ N 2 A j x j b 2 2 j=1 j=1 j=1 x1 k+1 := argmin x1 X 1 L γ (x 1,x2 k,...,xk N ;λk ) x 2 k+1 := argmin x2 X 2 L γ (x1 k+1,x 2,x3 k,...,xk N ;λk ). x k+1 N := argmin xn ( X N L γ (x1 k+1,x2 k+1,...,x k+1 N 1,x N;λ k ) N ) λ k+1 := λ k γ j=1 A jx k+1 j b. Update the primal variables in a Gauss-Seidel manner.

ADMM for N = 2 ADMM for N = 2 x1 k+1 := argmin x1 X 1 L γ (x 1,x2 k;λk ) x2 k+1 := argmin x2 ( X 2 L γ (x1 k+1,x 2 ;λ k ) ) λ k+1 := λ k γ A 1 x1 k+1 +A 2 x2 k+1 b. Long history goes back to variational methods for PDEs in 1950s; Relate to Douglas-Rachford and Peaceman-Rachford Operator Splitting Methods for finding zero of monotone operators. Find x, s.t.,0 A(x)+B(x). Revisited recently for sparse optimization [Wang-Yang-Yin-Zhang-2008] [Goldstein-Osher-2009] [Boyd-etal-2011]

Global Convergence of ADMM for N = 2 ADMM for N = 2 x1 k+1 := argmin x1 X 1 L γ (x 1,x2 k;λk ) x2 k+1 := argmin x2 ( X 2 L γ (x1 k+1,x 2 ;λ k ) ) λ k+1 := λ k γ A 1 x1 k+1 +A 2 x2 k+1 b. Global convergence for any γ > 0. (Fortin-Glowinski-1983; Gabay-1983; Glowinski-Le Tallec-1989; Eckstein-Bertsekas-1992) ADMM for N = 2 with fixed dual step size x1 k+1 := argmin x1 X 1 L γ (x 1,x2 k;λk ) x2 k+1 := argmin x2 X ( 2 L γ (x1 k+1,x 2 ;λ k ) ) λ k+1 := λ k αγ A 1 x1 k+1 +A 2 x2 k+1 b. α > 0 is a fixed dual step size Global convergence for any γ > 0 and α (0, 1+ 5 2 ).

Sublinear Convergence of ADMM for N = 2 Ergodic O(1/k) convergence (He-Yuan-2012) Non-Ergodic O(1/k) convergence (He-Yuan-2012) Ergodic O(1/k) convergence (Monteiro-Svaiter-2013)

Linear Convergence Rate of ADMM for N = 2 Douglas-Rachford splitting method converges linearly if B is coercive and Lipschitz (Lions-Mercier-1979) Linear convergence for solving linear programs (Eckstein-Bertsekas-1990) Linear convergence for quadratic programs (Han-Yuan-2013; Boley-2013)

Generalized ADMM Generalized ADMM for N = 2 (Deng-Yin-2012) x1 k+1 := argmin x1 X 1 L γ (x 1,x2 k;λk )+ 1 2 x 1 x1 k 2 P x2 k+1 := argmin x2 X 2 L γ (x k+1 ( λ k+1 := λ k αγ 1,x 2 ;λ k )+ 1 2 ) x 2 x2 k 2 Q 2 b. A 1 x k+1 1 +A 2 x k+1 One sufficient condition for guaranteeing global linear convergence: P = Q = 0, α = 1, f 2 strongly convex, f 2 Lipschtiz continuous, A 2 full row rank.

ADMM for N 3: a counter example A negative result (Chen-He-Ye-Yuan-2013): Direct extension of multi-block ADMM is not necessarily convergent A counter example: 1 1 1 A 1 x 1 +A 2 x 2 +A 3 x 3 = 0, where A = (A 1,A 2,A 3 ) = 1 1 2 1 2 2 The update of multi-block ADMM with γ = 1 is 3 0 0 0 0 0 4 6 0 0 0 0 x k+1 0 4 5 1 1 1 1 5 7 9 0 0 0 x k+1 0 0 7 1 1 2 x k 1 2 1 1 1 1 0 0 x3 k+1 = 0 0 0 1 2 2 x k 2 0 0 0 1 0 0 x k 1 1 2 0 1 0 λ k+1 3 0 0 0 0 1 0 λ k 1 2 2 0 0 1 0 0 0 0 0 1

ADMM for N 3: a counter example Equivalently, M = 1 162 x2 k+1 x3 k+1 λ k+1 Note that ρ(m) > 1. = M Theorem (Chen-He-Ye-Yuan-2013) x k 2 x k 3 λ k, where 144 9 9 9 18 8 157 5 13 8 64 122 122 58 64. 56 35 35 91 56 88 26 26 62 88 There existing an example where the direct extension of ADMM of three blocks with a real number initial point is not necessarily convergent for any choice of γ > 0.

ADMM for N 3: Strong convexity? min 0.05x1 2 +0.05x2 2 +0.05x2 3 1 1 1 x 1 s.t. 1 1 2 x 2 = 0. 1 2 2 x 3 For γ = 1, ρ(m) = 1.0087 > 1 Able to find a proper initial point such that ADMM diverges Even for strongly convex programming, the extended ADMM is not necessarily convergent for a certain γ > 0.

ADMM for N 3: Strong convexity works! Global convergence Theorem (Han-Yuan-2012) If f i,i = 1,...,N are strongly convex with parameter σ i s, and { } 2σ i 0 < γ < min i=1,...,n 3(N 1)λ max (A, i A i ) then multi-block ADMM globally converges. Convergence Rate?

ADMM for N 3: weaker condition and convergence rate u := (x 1,...,x N ), x t i = 1 t +1 t k=0 Theorem (Lin-Ma-Zhang-2014a) x k+1 i,1 i N, λ t = 1 t +1 If f 2,...,f N are strongly convex, f 1 is convex, and { γ min 2 i N 1 t λ k+1. k=0 2σ i (2N i)(i 1)λ max (A i A i ), 2σ N (N 2)(N +1)λ max (A N A N) N then f(ū t ) f(u ) = O(1/t), and A i x i t b = O(1/t). Weaker condition Ergodic O(1/t) convergence rate in terms of objective value and primal feasibility i=1 },

ADMM for N 3: non-ergodic convergence rate Optimality measure: if A 2 x2 k+1 A 2 x2 k = 0, A 3 x3 k+1 A 3 x3 k = 0, A 1 x1 k+1 +A 2 x2 k+1 +A 3 x3 k+1 b = 0, then (x k+1 1,x k+1 2,x k+1 3,λ k+1 ) is optimal. Define R k+1 := A 1 x1 k+1 +A 2 x2 k+1 +A 3 x3 k+1 b 2 +2 A 2 x2 k+1 A 2 x2 k 2 +3 A 3 x3 k+1 A 3 x3 k 2. We can prove: R k = o(1/k)

ADMM for N 3: non-ergodic convergence rate Theorem (Lin-Ma-Zhang-2014a) If f 2 and f 3 are strongly convex, and { γ min σ 2 2λ max (A 2 A 2), σ 3 2λ max (A 3 A 3) }, then R k < + and R k = o(1/k). k=1

ADMM for N 3: non-ergodic convergence rate Theorem (Lin-Ma-Zhang-2014a) If f 2,...,f N are strongly convex, and { γ min 2 i N 1 then i=1 2σ i (2N i)(i 1)λ max (A i A i ), 2σ N (N 2)(N +1)λ max (A N A N) R k < + and R k = o(1/k), k=1 where 2 N R k+1 := A i x k+1 i b + N i=2 (2N i)(i 1) A i xi k A i x k+1 i 2. 2 },

ADMM for N 3: global linear convergence Globally linear convergence of ADMM for N 3 (Lin-Ma-Zhang-2014b) s.c. Lipschitz full row rank full column rank 1 f 2,,f N f N A N 2 f 1,,f N f 1,, f N 3 f 2,,f N f 1,, f N A 1 Table: Three scenarios leading to global linear convergence Reduce to the conditions in (Deng-Yin-2012) when N = 2

Variants: Modified Proximal Jacobian ADMM (Deng-Lai-Peng-Yin-2014) x1 k+1 := argmin x1 X 1 L γ (x 1,x2 k,...,xk N ;λk )+ 1 2 x 1 x1 k 2 P 1 x2 k+1 := argmin x2 X 2 L γ (x1 k,x 2,x3 k,...,xk N ;λk )+ 1 2 x 2 x2 k 2 P 2 x k+1 N λ k+1. := argmin xn X ( N L γ (x1 k,xk 2,...,xk N 1,x N;λ k )+ 1 2 x N xn k 2 P N N ) := λ k αγ j=1 A jx k+1 j b. Conditions for convergence: { Pi γ(1/ǫ i 1)A i A i,i = 1,2,...,N N i=1 ǫ i < 2 α o(1/k) convergence rate in non-ergodic sense

Variants: Modified Proximal Gauss-Seidel ADMM x1 k+1 := argmin x1 X 1 L γ (x 1,x2 k,xk 3 ;λk )+ 1 2 x 1 x1 k 2 P 1 x2 k+1 := argmin x2 X 2 L γ (x1 k+1,x 2,x3 k;λk )+ 1 2 x 2 x2 k 2 P 2 x3 k+1 := argmin x3 X ( 3 L γ (x1 k+1,x2 k+1,x 3 ;λ k )+ 1 2 x 3 x3 k 2 P 3 3 ) λ k+1 := λ k αγ j=1 A jx k+1 j b.

Proximal Gauss-Seidel ADMM Theorem (Lin-Ma-Zhang-2014c) Ergodic O(1/k) convergence rate in terms of both objective value and primal feasibility, under conditions: f 3 is strongly convex, and P 1 γ(3+ 5 ǫ 1 )A 1 A 1 P 2 γ(1+ 3 ǫ 2 )A 2 A 2 P 3 γ( 1 ǫ 3 1)A 3 A 3 3(ǫ 1 +ǫ 2 +ǫ 3 ) < 1 α γ < σ 3 ǫ 3 2(ǫ 3 +1)λ max(a 3 A 3) Ongoing work. More coming soon.

The General Problem Formulation We consider the following convex optimization problem min f(x) := g (x 1,,x K )+ K k=1 h k(x k ) subject to E 1 x 1 +E 2 x 2 + +E K x K = q, x k X k, k = 1,2,...,K, (P) g( ) a smooth convex function; h k a nonsmooth convex function x := (x T 1,...,xT K )T R n block variables X := K k=1 X k feasible set E := (E 1,,E K ) and h(x) := K k=1 h k(x k ) Augmented Lagrangian function L(x;y) = g(x)+h(x)+ y,q Ex + ρ q Ex 2 2

Small dual step size and error bound condition ADMM for N 3 without strong convexity assumption (Hong-Luo-2012) x k+1 1 := argmin x1 X 1 L(x 1,x2 k,...,xk N ;yk ) x2 k+1 := argmin x2 X 2 L(x1 k+1,x 2,x3 k,...,xk N ;yk ). x k+1 N := argmin xn ( X N L(x1 k+1,x2 k+1,...,x k+1 N 1,x N;y k ) N ) y k+1 := y k α j=1 A jx k+1 j b. Do not assume strong convexity, but need other conditions.

Small dual step size and error bound condition Error bound condition: there exist positive scalars τ and δ such that the following error bound holds dist(x,x(y)) τ x L(x;y), for all (x,y) such that x L(x;y) δ [Hong-Luo-2012]: Given that α is small enough (upper bounded by some constant related to τ and δ), the small step size variant of ADMM converges linearly. α is too small and not computable, thus not practical.

The BSUM-M Algorithm: Main Ideas A Block Successive Upper-bound Minimization Method of Multipliers Introduce the augmented Lagrangian function for problem (P) L(x;y) = g(x)+h(x)+ y,q Ex + ρ q Ex 2 2 y dual variable; ρ 0 primal stepsize Main idea: Primal update 1 Update the primal variables successively (Gauss-Seidel) 2 Optimize some approximate version of L(x, y) Main idea: Dual update 1 Inexact dual ascent + proper step size control

The BSUM-M Algorithm: Details At iteration r +1, a block variable x k is updated by solving ( min u k xk ;x1 r+1,,x r+1 x k X k 1,xr k, ),xr K k + y r+1,q E k x k +h k (x k ) u k ( ; x1 r+1,,x r+1 k 1,xr k,,xr K ): is an upper-bound of g(x)+ ρ q Ex 2 2 at the current iterate (x r+1 1,,x r+1 k 1,xr k,,xr K )

The BSUM-M Algorithm: Details (cont.) The BSUM-M Algorithm At each iteration r 1: ) K y r+1 = y r +α r (q Ex r ) = y r +α (q r E k xk r, k=1 x r+1 k = arg min u k (x k ;w r+1 x k X k ) y r+1,e k x k +h k (x k ), k k where α r > 0 is the dual stepsize. To simplify notations, we have defined w r+1 k := (x r+1 1,,x r+1 k 1,xr k,xr k+1,,xr K ),

The BSUM-M Algorithm: Randomized Version Select a vector {p k > 0} K k=0 such that K k=0 p k = 1 Each iteration t only updates a single randomly selected primal or dual variable At iteration t 1, pick k {0,,K} with probability p k and If k = 0 y t+1 = y t +α t (q Ex t ), x t+1 k = xk t, k = 1,,K. Else If k {1,,K} x t+1 k = argmin xk X k u k (x k ;x t ) y r,e k x k +h k (x k ), x t+1 j = x t j, j k, yt+1 = y t. End

Convergence Analysis: Assumptions Assumption A (on the problem) (a) Problem (P) is convex problem (b) g(x) = l(ax)+ x,b ; l( ) smooth strictly convex, A not necessarily full column rank (c) Nonsmooth function h k : h k (x k ) = λ k x k 1 + J w J x k,j 2, where x k = (,x k,j, ) is a partition of x k ; λ k 0 and w J 0 are some constants. (d) The feasible sets {X k } are compact polyhedral sets, and are given by X k := {x k C k x k c k }.

Convergence Analysis: Assumptions Assumption B (on u k ) (a) u k (v k ;x) g(v k,x k )+ ρ 2 E kv k q +E k x k 2, v k X k, x,k (upper-bound) (b) u k (x k ;x) = g(x)+ ( ρ 2 Ex q 2, x,k (locally tight) (c) u k (x k ;x) = k g(x)+ ρ 2 Ex q 2), x,k (d) For any given x, u k (v k ;x) is strongly convex in v k (e) For given x, u k (v k ;x) has Lipchitz continuous gradient Figure: Illustration of the upper-bound.

The Convergence Result Theorem (Hong, Chang, Wang, Razaviyanyn, Ma and Luo 2014) Suppose Assumptions A-B hold. Assume the dual stepsize α r satisfies r=1 Then we have the following: α r =, lim r αr = 0. 1 For the BSUM-M, we have lim r Ex r q = 0, and every limit point of {x r,y r } is a primal and dual optimal solution. 2 For the RBSUM-M, we have lim t Ex t q = 0 w.p.1. Further, every limit point of {x t,y t } is a primal and dual optimal solution w.p.1.

Counterexample for multi-block ADMM Recently [Chen-He-Ye-Yuan 13] shows (through an example) that applying ADMM to multi-block problem can diverge We show applying (R)BSUM-M to the same problem converges (diminishing dual step size) Main message: Dual stepsize control is crucial Consider the following linear systems of equations (unique solution x 1 = x 2 = x 3 = 0) E 1 x 1 +E 2 x 2 +E 3 x 3 = 0, with [E 1 E 2 E 3 ] = 1 1 1 1 1 2. 1 2 2

Counterexample for multi-block ADMM (cont.) 0.2 x 1 0.5 x 1 0.15 x 2 0.4 x 2 0.1 0.05 x 3 x 1 +x 2 +x 3 0.3 0.2 x 3 x 1 +x 2 +x 3 0 0.05 0.1 0 0.1 0.1 0.2 0.15 0.3 0.2 0.4 0.25 0 50 100 150 200 250 300 350 400 iteration (r) 0.5 0 200 400 600 800 1000 1200 iteration (t) Figure: Iterates generated by the BSUM-M. Each curve is averaged over 1000 runs (with random starting points). Figure: Iterates generated by the RBSUM-M algorithm. Each curve is averaged over 1000 runs (with random starting points)

Thank you for your attention!