Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Size: px

Start display at page:

Download "Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers"

Chastity Crawford
5 years ago
Views:

1 Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 9 Alternating Direction Method of Multipliers

2 Shiqian Ma, MAT-258A: Numerical Optimization 2 Separable convex optimization a special case is min f(x) + g(y) s.t. Ax + By = b min f(x) + g(x) because it is equivalent to (by variable-splitting) min f(x) + g(y) s.t. x y = 0 both f and g closed and convex both f and g have special structures: easy proximal mappings possible that both f and g are nonsmooth, so proximal gradient method cannot be used

3 Shiqian Ma, MAT-258A: Numerical Optimization 3 Robust PCA or Example: Robust PCA min X R m n X + ρ M X 1 min + ρ Y 1, s.t. X + Y = M X,Y R m n Can we use ALM to solve this? augmented Lagrangian function L t (X, Y ; Λ) = X +ρ Y 1 Λ, X +Y M + t 2 X +Y M 2 F ALM: any disadvantage? { (X k+1, Y k+1 ) = argmin X,Y L t (X, Y ; Λ k ) Λ k+1 = Λ k t(x k+1 + Y k+1 M)

4 Shiqian Ma, MAT-258A: Numerical Optimization 4 Surveillance video background extraction

5 Shiqian Ma, MAT-258A: Numerical Optimization 5 Example: Sparse Inverse Covariance Selection Sparse Inverse Covariance Selection Can we use PGM to solve this? min X log det(x) + Σ, X + ρ X 1 Proximal gradient method (note that g(x) = log det(x)+ Σ, X is smooth) X k+1 := argmin X τ X t X (Xk t g(x k )) 2 F any disadvantage?

6 Shiqian Ma, MAT-258A: Numerical Optimization 6 Example: Sparse Inverse Covariance Selection min X log det(x) + Σ, X + ρ X 1 equivalent to (by variable-splitting) min X,Y log det(x) + Σ, X + ρ Y 1 s.t. X Y = 0 Can we use ALM to solve this? augmented Lagrangian function L t (X, Y ; Λ) = log det(x)+ Σ, X +ρ Y 1 Λ, X Y + t 2 X Y 2 F ALM: { (X k+1, Y k+1 ) = argmin X,Y L t (X, Y ; Λ k ) Λ k+1 = Λ k t(x k+1 Y k+1 ) any disadvantage?

7 Shiqian Ma, MAT-258A: Numerical Optimization 7 Alternating Direction Method of Multipliers (ADMM) Robust PCA: min + ρ Y 1, s.t. X + Y = M X,Y R m n augmented Lagrangian function L t (X, Y ; Λ) = X +ρ Y 1 Λ, X +Y M + t 2 X +Y M 2 F ALM: { (X k+1, Y k+1 ) = argmin X,Y L t (X, Y ; Λ k ) Λ k+1 = Λ k t(x k+1 + Y k+1 M) Alternating Direction Method of Multipliers X k+1 = argmin X L t (X, Y k ; Λ k ) Y k+1 = argmin Y L t (X k+1, Y ; Λ k ) Λ k+1 = Λ k t(x k+1 + Y k+1 M)

8 Shiqian Ma, MAT-258A: Numerical Optimization 8 The X-subproblem is min X + ρ Y k 1 Λ k, X + Y k M + t X 2 X + Y k M 2 F equivalent to min X X + t 2 X + Y k M Λ k /t 2 F the proximal mapping of X The Y -subproblem is min X k+1 +ρ Y 1 Λ k, X k+1 +Y M + t Y 2 Xk+1 +Y M 2 F equivalent to min Y ρ Y 1 + t 2 Xk+1 + Y M Λ k /t 2 F the proximal mapping of Y 1

9 Shiqian Ma, MAT-258A: Numerical Optimization 9 Alternating Direction Method of Multipliers Sparse inverse covariance selection min X,Y log det(x) + Σ, X + ρ Y 1 s.t. X Y = 0 augmented Lagrangian function L t (X, Y ; Λ) = log det(x)+ Σ, X +ρ Y 1 Λ, X Y + t 2 X Y 2 F ALM: { (X k+1, Y k+1 ) = argmin X,Y L t (X, Y ; Λ k ) Λ k+1 = Λ k t(x k+1 Y k+1 ) ADMM: X k+1 = argmin X L t (X, Y k ; Λ k ) Y k+1 = argmin Y L t (X k+1, Y ; Λ k ) Λ k+1 = Λ k t(x k+1 + Y k+1 M)

10 Shiqian Ma, MAT-258A: Numerical Optimization 10 The X-subproblem is min log det(x)+ Σ, X +ρ Y k 1 Λ k, X Y k + t X 2 X Y k 2 F equivalent to min log det(x) + t X 2 X + Y k M + (Σ Λ k )/t 2 F the proximal mapping of log det(x) The Y -subproblem is min log det(x k+1 )+ Σ, X k+1 +ρ Y 1 Λ k, X k+1 Y + t Y 2 Xk+1 Y 2 F equivalent to min Y ρ Y 1 + t 2 Xk+1 Y Λ k /t 2 F the proximal mapping of Y 1

11 Shiqian Ma, MAT-258A: Numerical Optimization 11 General form of ADMM Convex minimization with two-block separable structure: augmented Lagrangian function min f(x) + g(y) s.t. Ax + By = b L t (x, y; λ) = f(x) + g(y) λ, Ax + By b + t 2 Ax + By b 2 2 ADMM Two subproblems x k+1 = argmin x L t (x, y k ; λ k ) y k+1 = argmin y L t (x k+1, y; λ k ) λ k+1 = λ k t(ax k+1 + By k+1 b) x k+1 = argmin x f(x) + t 2 Ax + Byk b λ k /t 2 2 y k+1 = argmin y g(y) + t 2 Axk+1 + By b λ k /t 2 2

12 Shiqian Ma, MAT-258A: Numerical Optimization 12 Variable splitting and reformulation sum of two functions with structures (could be indicator function) For many applications, one can apply variable-splitting to reformulate the problem as min f(x) + g(y) s.t. Ax + By = b such that both f and g have easy proximal mappings then one can apply ADMM For example is equivalent to min f(x) + g(ax b) min f(x) + g(y) s.t. Ax y = b

13 Shiqian Ma, MAT-258A: Numerical Optimization 13 Compressed sensing with noise can be reformulated as min x 1 s.t. Ax b 2 σ or Augmented Lagrangian function min x 1 s.t. Ax y = b y 2 σ min x 1 + I y 2 σ(y) s.t. Ax y = b L t (x, y; λ) = x 1 + I y 2 σ(y) λ, Ax y b + t 2 Ax y b 2 2

14 Shiqian Ma, MAT-258A: Numerical Optimization 14 Apply ADMM x-subproblem: y-subproblem: x k+1 = argmin x L t (x, y k ; λ k ) y k+1 = argmin y L t (x k+1, y; λ k ) λ k+1 = λ k t(ax k+1 y k+1 b) x k+1 = argmin x y k+1 = argmin y this is the projection onto l 2 -ball. x 1 + t 2 Ax yk b λ k /t 2 2 I y 2 σ(y) + t 2 Axk+1 y b λ k /t 2 2

15 Shiqian Ma, MAT-258A: Numerical Optimization 15 Portfolio Selection r i : random variable, the rate of return for stock i x i : the relative amount invested in stock i Return: r = r 1 x 1 + r 2 x r n x n expected return: R = E(r) = E(r i )x i = µ i x i Risk: V = V ar(r) = ij σ ijx i x j = x Σx min (1/2)x Σx s.t. i µ ix i = r 0 i x i = 1 x i 0, i = 1,..., n

16 Shiqian Ma, MAT-258A: Numerical Optimization 16 Can be reformulated as (define set C as the probability simplex) Augmented Lagrangian function: min (1/2)x Σx s.t. µ x = r 0 x y = 0 y C L t (x, y; λ 1, λ 2 ) = (1/2)x Σx + I {y C} (y) λ 1, µ x r 0 λ 2, x y + t 2 µ x r t 2 x y 2 2 ADMM: x k+1 = argmin x L t (x, y k ; λ k 1, λ k 2) y k+1 = argmin y L t (x k+1, y; λ k 1, λ k 2) λ k+1 1 = λ k 1 t(µ x k+1 r 0 ) λ k+1 2 = λ k 2 t(x k+1 y k+1 )

17 Shiqian Ma, MAT-258A: Numerical Optimization 17 Total variation image deblurring Use u R n2 to denote an n n gray-scale image. Use K R n2 n 2 to represent a blurring operator An observation of the image is obtained by (ɛ is noise) b = Ku + ɛ So one wants to minimize Ku b 2 2 A widely used technique in image processing is to use the Total Variation term to preserve the sharp edges n T V (u) = (u i+1,j u ij ) 2 + (u i,j+1 u ij ) 2 i,j=1

18 Shiqian Ma, MAT-258A: Numerical Optimization 18 By slight abuse of the notation (now u is a n 2 -dim vector), TV can also be written as T V (u) = the TV image deblurring model is n 2 i=1 D i u 2 min u n 2 i=1 D i u 2 + ρ 2 Ku b 2 2 By variable-splitting, reformulate it as min u,w n 2 i=1 w i 2 + ρ 2 Ku b 2 2 s.t. D i u w i = 0, i = 1,..., n 2

19 Shiqian Ma, MAT-258A: Numerical Optimization 19 augmented Lagrangian function L t (u, w; λ) = n 2 i=1 w i 2 + ρ 2 Ku b 2 2 n 2 i=1 λ i, D i u w i + n 2 i=1 t 2 D iu w i 2 2 ADMM u k+1 = argmin u L t (u, w k ; λ k ) w k+1 = argmin w L t (u k+1, w; λ k ) λ k+1 i = λ k i t(d iu k+1 wi k+1 ), i = 1,..., n 2 the w-subproblem is separable for w i

20 Shiqian Ma, MAT-258A: Numerical Optimization 20 TV+L1 model for image reconstruction The image u is sparse under wavelet transform Ψ, i.e., Ψu is sparse Reformulate as min u n 2 i=1 D i u 2 + γ Ψu 1 + ρ 2 Ku b 2 2 min n 2 i=1 w i 2 + γ v 1 + ρ 2 Ku b 2 2 s.t. D i u w i = 0, i = 1,..., n 2 Ψu v = 0 augmented Lagrangian function L t (u, w, v; λ, µ) = n 2 i=1 w i 2 + γ v 1 + ρ 2 Ku b 2 2 λ i, D i u w i + n 2 i=1 t 2 D iu w i 2 2 µ, Ψu v + t 2 Ψu v 2 2

21 Shiqian Ma, MAT-258A: Numerical Optimization 21 ADMM u k+1 = argmin u L t (u, w k, v k ; λ k, µ k ) (w k+1, v k+1 ) = argmin w,v L t (u k+1, w, v; λ k, µ k ) λ k+1 i = λ k i t(d iu k+1 wi k+1 ), i = 1,..., n 2 µ k+1 = µ k t(ψu k+1 v k+1 ) note that the subproblem for (w, v) is separable for w and v.

22 Shiqian Ma, MAT-258A: Numerical Optimization 22 The standard SDP: Semidefinite Programming min X S n C, X s.t. A (i), X = b i, i = 1,..., m X 0 where C, A (i) S n, i = 1,..., m The dual problem min y R m,s S n s.t. b y A (y) + S = C S 0 augmented Lagrangian function (X is the Lagrange multiplier) L t (y, S; X) = b y+i {S 0} (S) X, A (y)+s C + t 2 A (y)+s C 2 F

23 Shiqian Ma, MAT-258A: Numerical Optimization 23 ADMM y k+1 = argmin y L t (y, S k ; X k ) S k+1 = argmin S L t (y k+1, S; X k ) X k+1 = X k t(a (y k+1 ) + S k+1 C)

24 Shiqian Ma, MAT-258A: Numerical Optimization 24 sparse covariance matrix estimation 1 min X S n 2 X Σ 2 F + ρ X 1, s.t.x 0 where Σ is the sample covariance matrix which may not be sparse and not positive semidefinite Reformulation (by variable-splitting) augmented Lagrangian function 1 min X,Y S n 2 X Σ 2 F + ρ X 1 s.t. X Y = 0 Y 0 L t (X, Y ; Λ) = 1 2 X Σ 2 F+ρ X 1 +I {Y 0} (Y )+ Λ, X Y + t 2 X Y 2 F

25 Shiqian Ma, MAT-258A: Numerical Optimization 25 ADMM X k+1 = argmin X L t (X, Y k ; Λ k ) Y k+1 = argmin Y L t (X k+1, Y ; Λ k ) Λ k+1 = Λ k t(x k+1 Y k+1 )

26 Shiqian Ma, MAT-258A: Numerical Optimization 26 Nonconvex model: Optimization on sphere min f(x) + x 1 x s.t. x 2 = 1 where f(x) is a differentiable function. Reformulation: min x,y f(x) + x 1 s.t. x y = 0 y 2 = 1 augmented Lagrangian function L t (x, y; λ) = f(x) + x 1 + I { y 2 =1}(y) λ, x y + t 2 x y 2 2 ADMM x k+1 = argmin x L t (x, y k ; λ k ) y k+1 = argmin y L t (x k+1, y; λ k ) λ k+1 = λ k t(x k+1 y k+1 )

27 Shiqian Ma, MAT-258A: Numerical Optimization 27 Linearized ADMM The standard form of the problem min x,y s.t. Augmented Lagrangian function f(x) + g(y) Ax + By = b L t (x, y; λ) = f(x) + g(y) λ, Ax + By b + t 2 Ax + By b 2 2 ADMM x k+1 = argmin x f(x) + t 2 Ax + Byk b λ k /t 2 2 y k+1 = argmin y g(y) + t 2 Axk+1 + By b λ k /t 2 2 λ k+1 = λ k t(ax k+1 + By k+1 b) The two subproblems are not easy if A and B are not identity matrices

28 Shiqian Ma, MAT-258A: Numerical Optimization 28 Use proximal gradient method to solve them For example, the x-subproblem is min x f(x) + h(x) iterates x i+1 = argmin x f(x) + 1 2τ x (xi τ h(x i )) 2 2 where τ < 1/L and L is the Lipschitz constant of h But, at the end, it is just a subproblem, we do not want to solve it to a very high accuracy in fact, one iteration of PGM is enough. This leads to the linearized ADMM

29 Shiqian Ma, MAT-258A: Numerical Optimization 29 linearized ADMM x k+1 = argmin x f(x) + 1 2τ 1 x (x k τ 1 ta (Ax k + By k b λ k /t) 2 2 y k+1 = argmin y g(y) + 1 2τ 2 y (y k τ 2 tb (Ax k+1 + By k b λ k /t)) 2 2 λ k+1 = λ k t(ax k+1 + By k+1 b) where τ 1 < 1/λ max (A A) and τ 2 < 1/λ max (B B) Now the two subproblems are easy: they are the proximal mappings of f and g

30 Shiqian Ma, MAT-258A: Numerical Optimization 30 Global Convergence of ADMM The problem Lagrangian function min x,y s.t. f(x) + g(y) Ax + By = b L(x, y; λ) = f(x) + g(y) λ, Ax + By b Optimality conditions: (x, y ; λ ) is optimal, if A λ f(x ) B λ g(y ) Ax + By = b

31 Shiqian Ma, MAT-258A: Numerical Optimization 31 ADMM: x k+1 = argmin x f(x) + t 2 Ax + Byk b λ k /t 2 2 y k+1 = argmin y g(y) + t 2 Axk+1 + By b λ k /t 2 2 λ k+1 = λ k t(ax k+1 + By k+1 b) Theorem: If A and B are full column rank, ADMM globally converges to the optimal solution (x, y ; λ ) for any t > 0, and any initial point (y 0, λ 0 ). Proof. The optimality conditions for the two subproblems are: 0 f(x k+1 ) + ta (Ax k+1 + By k b λ k /t) 0 g(y k+1 ) + tb (Ax k+1 + By k+1 b λ k /t) Using the updating formula for λ k+1, we have

32 Shiqian Ma, MAT-258A: Numerical Optimization 32 A (λ k+1 tb(y k y k+1 )) f(x k+1 ) (1) B λ k+1 g(y k+1 ) (2) Because f( ) and g( ) are monotone operators, we have x k+1 x, A (λ k+1 λ tb(y k y k+1 )) 0 y k+1 y, B (λ k+1 λ ) 0 Summing these two inequalities, we have (x k+1 x ) A (λ k+1 λ ) t(x k+1 x ) A B(y k y k+1 ) +(y k+1 y ) B (λ k+1 λ ) 0 this is equivalent to (λ k+1 λ ) (Ax k+1 +By k+1 b) t(x k+1 x ) A B(y k y k+1 ) 0 ( )

33 Shiqian Ma, MAT-258A: Numerical Optimization 33 Note that by we get Ax k+1 + By k+1 b = (λ k λ k+1 )/t, Ax + By b = 0 A(x k+1 x ) = B(y k+1 y ) + (λ k λ k+1 )/t substitute this to ( ), we get 1 t (λk+1 λ ) (λ k λ k+1 ) + t(by k+1 By ) (By k By k+1 ) (λ k λ k+1 ) (By k By k+1 ) Define we get u = ( ) y, H = λ ( ) tb B t I u k+1 u, u k u k+1 H λ k λ k+1, By k By k+1

34 Shiqian Ma, MAT-258A: Numerical Optimization 34 Because we have Thus Because B λ k+1 g(y k+1 ), B λ k g(y k ) y k y k+1, B λ k B λ k+1 0 u k+1 u, u k u k+1 H 0 u k+1 u 2 H = u k+1 u k 2 H 2 u k u k+1, u k u H + u k u 2 H we have u k u 2 H uk+1 u 2 H = 2 u k u k+1, u k u H u k+1 u k 2 H = 2 u k u k+1, (u k u k+1 ) + (u k+1 u ) H u k+1 u k 2 H = u k+1 u k 2 H + 2 uk u k+1, u k+1 u H u k+1 u k 2 H ( )

35 Shiqian Ma, MAT-258A: Numerical Optimization 35 From ( ) we have the following conclusion: (i) u k u k+1 H 0 (ii) {u k } lies in a compact region (iii) u k u 2 H is monotonically non-increasing and thus converges From (i) we have By k By k+1 0 and λ k λ k+1 0. Then Ax k + By k b 0 and Ax k Ax k+1 0. Since A and B are full column rank, we have x k x k+1 0 and y k y k+1 0. From (ii) we know u k has a subsequence {u k j} that converges to û = (ŷ, ˆλ). Therefore, x k j ˆx. So (ˆx, ŷ, ˆλ) is a limit point of {(x k, y k, λ k )} and Aˆx + Bŷ b = 0. From (1) and (2) we know that 0 f(ˆx) A ˆλ 0 g(ŷ) B ˆλ

36 Shiqian Ma, MAT-258A: Numerical Optimization 36 thus (ˆx, ŷ, ˆλ) satisfies the KKT conditions and thus is an optimal solution. Therefore, we showed that any limit point of {(x k, y k, λ k )} is an optimal solution. To complete the proof, it remains to show that {(x k, y k, λ k )} has a unique limit point. Let {(ˆx 1, ŷ 1, ˆλ 1 )} and {(ˆx 2, ŷ 2, ˆλ 2 )} be any two limit points of {(x k, y k, λ k )}. As we have shown, both {(ˆx 1, ŷ 1, ˆλ 1 )} and {(ˆx 2, ŷ 2, ˆλ 2 )} are optimal solutions. Thus, u in ( ) can be replaced by û 1 := (ŷ 1, ˆλ 1 ) and û 2 := (ŷ 2, ˆλ 2 ). This results in u k+1 û i 2 H u k û i 2 H, i = 1, 2, and we thus get the existence of the limits lim k uk û i H = η i < +, i = 1, 2. Now using the identity u k û 1 2 H u k û 2 2 H = 2 u k, û 1 û 2 H + û 1 2 H û 2 2 H

37 Shiqian Ma, MAT-258A: Numerical Optimization 37 and passing the limit we get η1 2 η2 2 = 2 û 1, û 1 û 2 H + û 1 2 H û 2 2 H = û 1 û 2 2 H and η1 2 η2 2 = 2 û 2, û 1 û 2 H + û 1 2 H û 2 2 H = û 1 û 2 2 H. Thus we must have û 1 û 2 2 H {(x k, y k, λ k )} is unique. = 0 and hence the limit point of

38 Shiqian Ma, MAT-258A: Numerical Optimization 38 linearized ADMM Convergence of Linearized ADMM x k+1 = argmin x f(x) + 1 2τ 1 x (x k τ 1 ta (Ax k + By k b λ k /t) 2 2 y k+1 = argmin y g(y) + 1 2τ 2 y (y k τ 2 tb (Ax k+1 + By k b λ k /t)) 2 2 λ k+1 = λ k t(ax k+1 + By k+1 b) Theorem: If τ 1 < 1/λ max (A A) and τ 2 < 1/λ max (B B), linearized ADMM globally converges to the optimal solution (x, y ; λ ) for any t > 0, and any initial point (y 0, λ 0 ). Proof. See the posted paper for proof.

39 Shiqian Ma, MAT-258A: Numerical Optimization 39 Extensions: Multi-block ADMM How about the function and variables have 3 parts (blocks)? augmented Lagrangian function min f 1 (x 1 ) + f 2 (x 2 ) + f 3 (x 3 ) s.t. A 1 x 1 + A 2 x 2 + A 3 x 3 = b L t (x 1, x 2, x 3 ; λ) = f 1 (x 1 ) + f 2 (x 2 ) + f 3 (x 3 ) λ, A 1 x 1 + A 2 x 2 + A 3 x 3 b + t 2 A 1x 1 + A 2 x 2 + A 3 x 3 b 2 2 Multi-block ADMM: x k+1 1 = argmin x1 L t (x 1, x k 2, x k 3; λ k ) x k+1 2 = argmin x2 L t (x k+1 1, x 2, x k 3; λ k ) x k+1 3 = argmin x3 L t (x k+1 1, x k+1 2, x 3 ; λ k ) λ k+1 = λ k t(a 1 x k A 2 x k A 3 x k+1 3 b)

40 Shiqian Ma, MAT-258A: Numerical Optimization 40 RPCA with noise Applications min X + ρ Y 1 s.t. X + Y + Z = M Z F σ Latent variable graphical model (See Lecture 1) min R,S,L R, ˆΣ X log det(r) + α S 1 + βtr(l) s.t. R = S L, R 0, L 0.

41 Shiqian Ma, MAT-258A: Numerical Optimization 41 Convergence without further conditions, multi-block is not necessarily convergent Counter-example by Chen, He, Ye and Yuan (2013) min 0 s.t. A 1 x 1 + A 2 x 2 + A 3 x 3 = 0, where (A 1, A 2, A 3 ) =

42 Shiqian Ma, MAT-258A: Numerical Optimization 42 The update of multi-block ADMM with t = 1 is x k x k x k+1 3 = λ k x k 1 x k 2 x k 3 λ k Equivalently, x k+1 2 x k+1 3 λ k+1 = M x k 2 x k 3, λ k where

43 Shiqian Ma, MAT-258A: Numerical Optimization 43 M = Note that ρ(m) > Theorem (Chen-He-Ye-Yuan-2013): There existing an example where the direct extension of ADMM of three blocks with a real number initial point is not necessarily convergent for any choice of t > 0.

44 Shiqian Ma, MAT-258A: Numerical Optimization 44 Sufficient conditions for convergence of multi-block ADMM This is a trendy topic for ADMM; still under developing Han and Yuan (2012): Golbal convergence, if f 1,..., f N strongly convex, and t is restricted to be small are all Lin, Ma and Zhang (2014): Sublinear convergence rate, if f 2,..., f N are strongly convex, and t is restricted to be small Lin, Ma and Zhang (2014): Globally linear convergence rate, if f 2,..., f N are strongly convex, f N Lipschitz continuous, A N full row rank, and t is restricted to be small Cai, Han and Yuan (2014): Sublinear convergence rate for N = 3, if f 3 is strongly convex, and t is restricted to be small

45 Shiqian Ma, MAT-258A: Numerical Optimization 45 Li, Sun and Toh (2014): Global convergence with proximal terms for N = 3, if f 3 is strongly convex a lot of following up works going on...

46 Shiqian Ma, MAT-258A: Numerical Optimization 46 Variants: Transform multi-block to two-block What if I do not want to impose additional conditions? Many varaints of multi-block ADMM with guaranteed convergence But usually they perform worse than the original multi-block ADMM, although the latter one is not theoretically guaranteed One variant is the following (Wang, Hong, Ma and Luo (2013)): first tranform the original problem to the following one: min f 1 (x 1 ) + f 2 (x 2 ) f N (x N ) s.t. A 1 x 1 + A 2 x A N x N = b min f 1 (x 1 ) + f 2 (x 2 ) f N (x N ) s.t. A i x i b/n = y i y 1 + y y N = 0

47 Shiqian Ma, MAT-258A: Numerical Optimization 47 then apply two-block ADMM to the transformed problem: N N N t L t (x, y; λ i ) = f i (x i ) λ i, A i x i b/n y i + 2 A ix i b/n y i 2 i=1 i=1 x-subproblems are separable y-subproblem is an easy projection Lagrangian function i=1 min 1 2 y z 2, s.t. y y N = 0 L(y, µ) = KKT conditions: N i=1 1 2 y i z i 2 µ, y y N y i z i µ = 0, and y y N = 0

48 Shiqian Ma, MAT-258A: Numerical Optimization 48 So we get µ = 1 N N z i and y i = z i 1 N i=1 N i=1 z i theoretically guaranteed to converge

49 Shiqian Ma, MAT-258A: Numerical Optimization 49 Gradient-based ADMM min f(x) + g(y) s.t. Ax + By = 0 What if f has an easy proximal mapping, but g does not? Assume g is smooth ADMM x k+1 = argmin x L t (x, y k ; λ k ) y k+1 = argmin y L t (x k+1, y; λ k ) λ k+1 = λ k t(ax k+1 + By k+1 b) Gradient-based ADMM: take a gradient step for the y-subproblem x k+1 = argmin x L t (x, y k ; λ k ) y k+1 = y k t y L t (x k+1, y k ; λ k ) λ k+1 = λ k t(ax k+1 + By k+1 b)

50 Shiqian Ma, MAT-258A: Numerical Optimization 50 Sparse logistic regression: where l(x, c) = 1 m min x 1 + l(x, c) m log(1 + exp( b i (x a i + c))) i=1 reformulation of sparse logistic regression: min x 1 + l(y, c) s.t. x y = 0 take gradient step for (y, c)-subproblem Fused logistic regression: min x 1 + n x i x i 1 + l(x, c) i=1

51 Shiqian Ma, MAT-258A: Numerical Optimization 51 reformulation of fused logistic regression augmented Lagrangian function min x 1 + w 1 + l(y, c) s.t. w = By x = y L t (x, w, y, c; λ 1, λ 2 ) = x 1 + w 1 + l(y, c) λ 1, w By + t 2 w By 2 λ 2, x y + t 2 x y 2

52 Shiqian Ma, MAT-258A: Numerical Optimization 52 exactly solve (x, w)-subproblem, take gradient step for (y, c)- subproblem

53 Shiqian Ma, MAT-258A: Numerical Optimization 53 One more example on nonconvex problem Semidefinite programming min X S n C, X s.t. A (i), X = b i, i = 1,..., m X 0 Any positive semidefinite matrix X can be rewritten as X = V V, where V R n n reformulation of SDP: min V R n n C, V V s.t. A (i), V V = b i, i = 1,..., m This is a nonconvex equality-constrained problem: you can use augmented Lagrangian method as long as you have a good way to mi-

54 Shiqian Ma, MAT-258A: Numerical Optimization 54 nimize the augmented Lagrangian function: L t (V, λ) = C, V V λ, A(V V ) b + t 2 A(V V ) b 2 The augmented Lagrangian method: V k+1 := argmin V L t (V, λ) λ k+1 := λ k t(a(v k+1 V k+1 ) b) Two-block reformulation (X = UV and U = V ) min U,V R n n C, UV s.t. A (i), UV = b i, i = 1,..., m U V = 0 augmented Lagrangian function L t (U, V ; λ, Λ) = C, UV λ, A(UV ) b + t 2 A(UV ) b 2 Λ, U V + t 2 U V 2 F

55 Shiqian Ma, MAT-258A: Numerical Optimization 55 ADMM: U k+1 := argmin U L t (U, V k ; λ k, Λ k ) V k+1 := argmin V L t (U k+1, V ; λ k, Λ k ) λ k+1 := λ k t(a(u k+1 V k+1 ) b) Λ k+1 := Λ k t(u k+1 V k+1 )

56 Shiqian Ma, MAT-258A: Numerical Optimization 56 Lots of recent developments of ADMM Sufficient conditions for multi-block ADMM for convex problems Convergence analysis for ADMM for nonconvex problems Stochastic ADMM Online ADMM...

57 Shiqian Ma, MAT-258A: Numerical Optimization 57 Relation with operator-splitting method Operator-splitting methods for inclusion problem of monotone operators Find u, s.t., 0 S(u) + T (u) where S, T : R n R n are maximal monotone operators T is monotone operator if (u v) (T (u) T (v)) 0, u, v T is called maximal monotone if there is no monotone operator that properly contains it

58 Shiqian Ma, MAT-258A: Numerical Optimization 58 Douglas-Rachford Operator Splitting Method Find u, s.t., 0 S(u) + T (u) Douglas-Rachford operator splitting method v k+1 u k+1 = JS τ(2j T τ I)vk + (I JT τ )vk = JT τ vk+1 J τ T = (I + τt ) 1 is called the resolvent of operator T Example: Optimality condition is: min f(x) + g(x) 0 f(x) + g(x) so, S = f and T = g

59 Shiqian Ma, MAT-258A: Numerical Optimization 59 Now y = J τ S (x) = (I + τs) 1 (x) = (I + τ f) 1 (x) means that x y + τ f(y) This is the optimality condition of min y τf(y) y x 2 2 This is the proximal mapping of f

60 Shiqian Ma, MAT-258A: Numerical Optimization 60 Primal problem Dual problem Separable convex minimization min λ min f(x) + g(y) s.t. Ax + By = b f (A λ) + g (B λ) b λ Optimality condition of dual problem Find λ, s.t. 0 A f (A λ) + B g (B λ) b Define S( ) = A f (A ), T ( ) = B g (B ) b Apply Douglas-Rachford splitting method to Find λ, s.t. 0 S(λ) + T (λ) is equivalent to apply ADMM to the primal problem

61 Shiqian Ma, MAT-258A: Numerical Optimization 61 Peaceman-Rachford operator splitting method Find u, s.t. 0 S(u) + T (u) Peaceman-Rachford operator splitting method: v k+1 u k+1 If apply to the dual problem of = (2JS τ I)(2J T τ I)vk = JT τ vk+1 min f(x) + g(y) s.t. Ax + By = b then it is equivalent to the following algorithm: (symmetric ADMM) x k+1 = argmin x L t (x, y k ; λ k ) λ k+1 2 = λ k t(ax k+1 + By k b) y k+1 = argmin y L t (x k+1, y; λ k+1 2) λ k+1 = λ k+1 2 t(ax k+1 + By k+1 b)

62 Shiqian Ma, MAT-258A: Numerical Optimization 62 Lots of recent developments Operator splitting method has broader application Recent research questions: Three (or more) operators? other operator splitting schemes? sufficient conditions? convergence rate?...

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong 2014 Workshop