Linearized Alternating Direction Method: Two Blocks and Multiple Blocks. Zhouchen Lin 林宙辰北京大学

Size: px

Start display at page:

Download "Linearized Alternating Direction Method: Two Blocks and Multiple Blocks. Zhouchen Lin 林宙辰北京大学"

Leonard Booker
5 years ago
Views:

1 Linearized Alternating Direction Method: Two Blocks and Multiple Blocks Zhouchen Lin 林宙辰北京大学 Dec. 3, 014

2 Outline Alternating Direction Method (ADM) Linearized Alternating Direction Method (LADM) Two Blocks LADM Multiple Blocks Proximal LADM Multiple Blocks Conclusions Lin et al., Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation, NIPS 011. Lin et al., Linearized Alternating Direction Method with Parallel Splitting and Adaptive Penalty for Separable Convex Programs in Machine Learning, ML, 015.

3 Background Optimization is everywhere Compressed Sensing: x kxk 1 ; s:t: Ax = b: RPCA w/ Missing Value: kak + kek 1 ; s:t: ¼ (A + E) = d: LASSO: x kax bk ; s:t: kxk 1 <= ": Image Restoration: x kax bk + jjrxjj 1 ; s:t: 0 <= x <= 55: Covariance Selection: X s:t: tr( X) log(det(x)) + ½eT jxje; X S n ; where S n = fx < 0j I 4 X 4 max Ig Pose Estimation: Q tr(wq); s:t: tr(a iq) = 0; i = 1; ; m; Q < 0; rank(q) <= 1:

4 Alternating Direction Method Model Problem: (ADM) x 1 ;x f 1 (x 1 ) + f (x ); s:t: A 1 (x 1 ) + A (x ) = b; where f i are convex functions and A i are linear mappings. ~L(x 1 ; x ; ) = f 1 (x 1 ) + f (x ) + h ; A 1 (x 1 ) + A (x ) bi + ka 1(x 1 ) + A (x ) bk F ; x k+1 1 = arg L(x1 ~ ; x k x 1 ; k); x k+1 = arg L(x ~ k+1 x 1 ; x ; k); k+1 = k + k[a 1 (x k+1 1 ) + A (x k+1 ) b]: Update k Augmented Lagrangian Function Assume: Easy

5 Linearized Alternating Direction Method (LADM) x k+1 1 = arg f 1 (x 1 ) + k x ka 1 (x 1 ) + A (x k 1 ) b + k= kk ; x k+1 = arg f (x ) + k ka (x k+1 1 ) + A (x ) b + k= kk x x Proximal Operation f 1 (x) + kx wk F f (x) + x kx wk F arg x kxk 1 + kx wk = T 1(w); T " (x) = sgn(x) max(jxj "; 0): T ε (x) x arg X kxk + " kx Wk F = " 1(W) = UT " 1(S)V T ; where W = USV T is the singular value decomposition (SVD) of W. Lin et al., Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation, NIPS 011.

6 Linearized Alternating Direction Introducing auxiliary variables: Method (LADM) x 1 ;x ;x 3 ;x 4 f 1 (x 1 ) + f (x ); s:t: x 1 = x 3 ; x = x 4 ; A 1 (x 3 ) + A (x 4 ) = b: ~L(x 1 ; x ; x 3 ; x 4 ; 1; ; 3) = f 1 (x 1 ) + f (x ) + h 1; x 1 x 3 i + h ; x x 4 i + h 3; A 1 (x 3 ) + A (x 4 ) bi + kx1 x 3 k F + kx x 4 k F + ka 1(x 3 ) + A (x 4 ) bk F ; Three drawbacks: 1. More blocks! more memory & slower convergence.. Matrix inversion is expensive. 3. Convergence is NOT guaranteed! Lin et al., Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation, NIPS 011.

7 Linearized Alternating Direction Method (LADM) x k+1 1 = arg f 1 (x 1 ) + k x ka 1 (x 1 ) + A (x k 1 ) b + k= kk ; x k+1 = arg f (x ) + k ka (x k+1 1 ) + A (x ) b + k= kk x x f 1 (x) + kx wk F f (x) + x kx wk F Linearize the quadratic term x k+1 1 = arg f 1 (x 1 ) + ha 1 ( k) + ka 1 (A 1(x k 1 ) + A (x k ) b); x 1 x k 1 i x 1 + k 1 kx 1 x k 1k = arg x 1 f 1 (x 1 ) + k 1 x k+1 = arg x f (x ) ha (x); yi = hx; A(y)i; 8x; y kx 1 x k 1 + A 1( k + k(a 1 (x k 1) + A (x k 1) b))=( k 1)k ; Adjoint Operator + k kx x k + A ( k + k(a 1 (x k+1 1 ) + A (x k ) b))=( k )k : Lin et al., Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation, NIPS 011.

8 LADM with Adaptive Penalty (LADMAP) Theorem: If f kg is non-decreasing and upper bounded, i > ka i k, i = 1;, then the sequence f(x k 1 ; xk ; k)g converges to a KKT point of the model problem. Lin et al., Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation, NIPS 011.

9 LADM with Adaptive Penalty Adaptive Penalty x k+1 1 = arg x 1 f 1 (x 1 ) (LADMAP) + k 1 kx 1 x k 1 + A 1( k + k(a 1 (x k 1) + A (x k 1) b))=( k 1)k ; x k+1 = arg x f (x ) + k kx x k + A ( k + k(a 1 (x k+1 1 ) + A (x k ) b))=( k )k : k 1(x k+1 1 x k 1) A 1( k + k(a 1 (x k 1) + A (x k 1) 1 (x k+1 1 ) k (x k+1 x k ) A ( k + k(a 1 (x k+1 1 ) + A (x k ) (x k+1 ) KKT condition: 9(x ; y ; ) such that Lin et al., Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation, NIPS A 1 (x 1 ) + A (x ) b = 0; A 1 ( 1 (x 1 ); A ( (x ):

10 LADM with Adaptive Penalty (LADMAP) Both k 1kx k+1 1 x k 1k=kA 1(b)k and k kx k+1 x k k=ka (b)k should be small. i = ka i k ) Approximate ka i (b)k by p ikbk Adaptive Penalty k+1 = ( max ; ½ k); ½= ½ k+1 ½0 ; if k max( p 1kx 1 x k 1k F ; 1; otherwise; where ½ 0 1 is a constant. Loop until ka 1 (x k+1 1 ) + A (x k+1 ) bk F < " 1 ; p kx k+1 x k k F )=kbk F < " ; k+1 k max( p 1kx 1 x k k+1 1k F ; p kx x k k)=kbk F < " : Lin et al., Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation, NIPS 011.

11 LADM with Adaptive Penalty Choice of parameters (LADMAP) 1. 0 = ", where / the size of b. 0 should not be too large, so that k increases in the rst few iterations.. ½ 0 1 should be chosen such that k increases steadily (but not necessarily every iteration). Lin et al., Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation, NIPS 011. k

12 LADM with Adaptive Penalty An example (LRR): (LADMAP) Z;E kzk + ¹kEk 1 ; s:t: X = XZ + E: A 1 (Z) = XZ; A (E) = E: A 1(Z) = X T Z; A (E) = E; 1 = kxk ; = 1: Lin et al., Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation, NIPS 011.

13 Experiment Table 1: Comparison among APG, ADM, LADM and LADMAP on the synthetic data. For each quadruple (s, p, d, ~r), the LRR problem, with ¹ = 0:1, was solved for the same data using di erent algorithms. We present typical running time (in 10 3 seconds), iteration number, relative error (%) of output solution ( ^E; ^Z) and the clustering accuracy (%) of tested algorithms, respectively. Size (s, p, d, ~r) Method Time Iter. (10, 0,00, 5) (15, 0,300, 5) (0, 5, 500, 5) (30, 30, 900, 5) k^z Z 0 k kz 0 k k ^E E 0 k ke 0 k Acc. APG ADM LADM LADMAP APG ADM LADM LADMAP APG ADM LADM LADMAP APG ADM LADM N.A. N.A. N.A. N.A. N.A. LADMAP

14 LADM with Parallel Splitting and Adaptive Penalty (LADMPSAP) Model problem: x 1 ; ;x n nx f i (x i ); s:t: nx A i (x i ) = b: Lin et al., Linearized Alternating Direction Method with Parallel Splitting and Adaptive Penalty for Separable Convex Programs in Machine Learning, ML, 015. X kxk + 1 ¹ kb P (X)k ; s:t: X 0; + X;e kxk + 1 ¹ kek ; s:t: b = P (X) + e; X 0; + kxk + 1 X;Y;e ¹ kek ; s:t: b = P (Y) + e; X = Y; Y 0; + kxk + 1 X;Y;e ¹ kek + Â Y 0 (Y); s:t: b = P (Y) + e; X = Y:

15 LADM with Parallel Splitting and Adaptive Penalty (LADMPSAP) Can we naively generalize two-block LADMAP for multi-block problems? No! Actually, the naive generalization of LADMAP may be divergent, e.g., when applied to the following problem with n 5: x 1 ; ;x n nx kx i k 1 ; s:t: nx A i x i = b: Lin et al., Linearized Alternating Direction Method with Parallel Splitting and Adaptive Penalty for Separable Convex Programs in Machine Learning, ML, 015. C. Chen et al. The Direct Extension of ADMM for Multi-block Convex Minimization Problems is Not Necessarily Convergent. Preprint.

16 x k+1 i LADM with Parallel Splitting and Adaptive Penalty (LADMPSAP) = arg x i f i (x i )+ i k 0 0 x i x k i + k i k i = 1; ; n; k+1 = k + k Ã n X 11 nx A i (x k j ) baa =( i k) j=1 A i (x k+1 i ) b k+1 = ( max ; ½ k); i P1 j=1! ; A i (x k+1 P j ) + n A i (x k j ) j=i Parallel! ; where ½= ½ ½0 ; if k max p i x k+1 i 1; otherwise; x k i ; i = 1; ; n ª = kbk < " ; with ½ 0 > 1 being a constant and 0 < " 1 being a threshold. Lin et al., Linearized Alternating Direction Method with Parallel Splitting and Adaptive Penalty for Separable Convex Programs in Machine Learning, ML, 015.

17 LADM with Parallel Splitting and Adaptive Penalty (LADMPSAP) Theorem: If f kg is non-decreasing and upper bounded, i > nka i k, i = 1; ; n, then f(fx k i g; k)g generated by LADMPSAP converges to a KKT point of the problem. Remark: When n =, LADMPSAP is weaker than LADMAP: i > ka i k vs. i > ka i k : Related work: He & Yuan, Linearized Alternating Direction Method with Gaussian Back Substitution for Separable Convex Programg, preprint. Lin et al., Linearized Alternating Direction Method with Parallel Splitting and Adaptive Penalty for Separable Convex Programs in Machine Learning, ML, 015.

18 LADM with Parallel Splitting and Adaptive Penalty (LADMPSAP) Model problem: x 1 ; ;x n where X i µ R d i nx f i (x i ); s:t: nx A i (x i ) = b; x i X i ; i = 1; ; n; is a closed convex set. x 1 ; ;x n nx f i (x i )+ nx i=n+1 + Â xi X i n (x i ); s:t: nx A i (x i ) = b; x i = x n+i ; i = 1; ; n: Theorem: If f kg is non-decreasing and upper bounded, x n+1 ; ; x n are auxiliary variables, i > nka i k +, n+i >, i = 1; ; n, then f(fx k i g; k)g generated by LADMPSAP converges to a KKT point of the problem. i > n(ka i k + 1); n+i > n; i = 1; ; n Lin et al., Linearized Alternating Direction Method with Parallel Splitting and Adaptive Penalty for Separable Convex Programs in Machine Learning, ML, 015.

19 Experiment X kxk + 1 ¹ kb P (X)k ; s:t: X 0; + X;Y;e kxk + 1 ¹ kek + Â Y 0 (Y); s:t: b = P (Y) + e; X = Y: Lin et al., Linearized Alternating Direction Method with Parallel Splitting and Adaptive Penalty for Separable Convex Programs in Machine Learning, ML, 015.

20 Experiment Table 1: Numerical comparison on the NMC problem with synthetic data, average of 10 runs. q, t and d r denote, respectively, sample ratio, the number of measurements t = q(mn) and the \degree of freedom" de ned by d r = r(m + n r) for a matrix with rank r and q. Here we set m = n and x r = 10 in all the tests. X LADM LADMPSAP n q t=d r Iter. Time(s) RelErr FA Iter. Time(s) RelErr FA % E-5 6.1E E % E E E % E E E % E E E % E E E-5 0 Table 1: Numerical comparison on the image inpainting problem. Method #Iter. Time(s) PSNR FA FPCA dB 9.41E-4 LADM dB.9E-3 LADMPSAP dB 0 Lin et al., Linearized Alternating Direction Method with Parallel Splitting and Adaptive Penalty for Separable Convex Programs in Machine Learning, ML, 015.

21 LADM with Parallel Splitting and Adaptive Penalty (LADMPSAP) Enhanced convergence results: Theorem 1: If f kg is non-decreasing and +1 P k=1 1 k = +1, i > nka i i (x) is bounded, i = 1; ; n, then the sequence fx k i g generated by LADMP- SAP converges to an optimal solution to the model problem. Theorem : If f kg is non-decreasing, i > nka i i (x) is bounded, i = 1; ; n, then +1 P = +1 is also the necessary condition for the global k=1 1 k convergence of fx k i g generated by LADMPSAP to an optimal solution to the model problem. With the above analysis, when all the subgradients of the component objective functions are bounded we can remove the upper bound max. Lin et al., Linearized Alternating Direction Method with Parallel Splitting and Adaptive Penalty for Separable Convex Programs in Machine Learning, ML, 015.

22 LADM with Parallel Splitting and Adaptive Penalty (LADMPSAP) De ne x = (x T 1 ; ; xt n) T, x = ((x 1 )T ; ; (x )T ) T and f(x) = n P where (x 1; ; x ; ) is a KKT point of the model problem. f i (x i ), Proposition: ~x is an optimal solution to the model problem i there exists > 0, such that f(~x) f(x ) + nx ha i ( ); ~x i x nx i i + A i (~x i ) b = 0: Our criterion for checking the optimality of a solution is much simpler than that in He et al. 011, which has to compare with all (x 1 ; ; x n ; ) R d 1 R d n R m. Lin et al., Linearized Alternating Direction Method with Parallel Splitting and Adaptive Penalty for Separable Convex Programs in Machine Learning, ML, 015. B. S. He and X. Yuan. On the O(1/t) convergence rate of alternating direction method. Preprint, 011.

23 LADM with Parallel Splitting and Adaptive Penalty (LADMPSAP) Theorem 3: De ne ¹x K = K P f(¹x K ) f(x )+ nx k=0 k x k+1, where k = 1 k = P K j=0 A i ( ); ¹x K i x i + 0 nx A i (¹x K i ) b 1 j. Then <= C 0 = Ã ( )! where 1 ka i k P = (n+1) max 1; ; i = 1; ; n and C i nka i k 0 = n Ã i KX k=0 1 k! (1) x 0 i x i ; A much simpler proof of convergence rate (in ergodic sense)! Lin et al., Linearized Alternating Direction Method with Parallel Splitting and Adaptive Penalty for Separable Convex Programs in Machine Learning, ML, 015. B. S. He and X. Yuan. On the O(1/t) convergence rate of alternating direction method. Preprint, 011.

24 Proximal LADMPSAP Even more general problem: x 1 ; ;x n nx f i (x i ); s:t: nx A i (x i ) = b: f i (x i ) = g i (x i ) + h i (x i ); where both g i and h i are convex, g i is C 1;1 : krg i (x) rg i (y)k <= L i kx yk; 8x; y R d i ; and h i may not be di erentiable but its proximal operation is easily solvable.

25 Proximal LADMPSAP Linearize the augmented term to obtain: x k+1 i (k) i = arg x i h i (x i ) + g i (x i ) + ¾(k) i Further linearize g i : x k+1 i = arg x i x i x k i + Ay i h i (x i ) + g i (xk i ) + ¾(k) i A y i (^ k)=¾ (k) i (^ k)=¾ (k) i +hrg i (x k i ) + Ay(^ k); i x i x k i i + (k) i = arg h i (x i ) + (k) x i Convergence condition: i x i x k i + 1 (k) i ; i = 1; ; n; x i x k i [A y (^ k) i + rg i (x k i )] : = T i + k i, where T i L i and i > nka i k are both positive constants.

26 Experiment Group Sparse Logistic Regression with Overlap w;b 1 s sx log 1 + exp y i (w T x i + b) + ¹ tx ks j wk; (1) where x i and y i, i = 1; ; s, are the training data and labels, respectively, and w and b parameterize the linear classi er. S j, j = 1; ; t, are the selection matrices, with only one 1 at each row and the rest entries are all zeros. The groups of entries, S j w, j = 1; ; t, may overlap each other. Introducing ¹w = (w T ; b) T, ¹x i = (x T i ; 1)T, z = (z T 1 ; zt ; ; zt t )T, and ¹ S = (S; 0), where S = (S T 1 ; ; S T t ) T, (1) can be rewritten as j=1 ¹w;z 1 s sx log 1 + exp y i ( ¹w T ¹x i ) tx + ¹ kz j k; s:t: z = S ¹ ¹w; () j=1 The Lipschitz constant of the gradient of logistic function with respect to ¹w can be proven to be L ¹w 1 4s k ¹ Xk, where ¹ X = (¹x 1 ; ¹x ; ; ¹x s ).

27 Experiment (s, p, t, q) Method Time #Iter. (300, 901, 100, 10) (450, 1351, 150, 15) (600, 1801, 00, 0) k ^¹w ¹w k k ¹w k k^z z k kz k ADM LADM LADMPS LADMPSAP pladmpsap ADM LADM LADMPS LADMPSAP pladmpsap ADM LADM LADMPS LADMPSAP pladmpsap

28 Conclusions LADMAP, LADMPSAP, and P-LADMPSAP are very general methods for solving various convex programs. Adaptive penalty is important for fast convergence.

29 Thanks!

Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation

Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation Zhouchen Lin Visual Computing Group Microsoft Research Asia Risheng Liu Zhixun Su School of Mathematical Sciences