Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23
Accelerated proximal gradient For convex composite problem: minimize x f: convex and Lipschitz differentiable F (x) := f(x) + g(x) g: closed convex (possibly nondifferentiable) and simple Proximal gradient: x k+1 = arg min f(x k ), x + L f x 2 x xk 2 + g(x) convergence rate: F (x k ) F (x ) = O(1/k) Accelerated Proximal gradient [Beck-Teboulle 09, Nesterov 14]: ˆx k : extrapolated point x k+1 = arg min f(ˆx k ), x + L f x 2 x ˆxk 2 + g(x) convergence rate (with smart extrapolation): F (x k ) F (x ) = O(1/k 2 ) This talk: ways to accelerate primal-dual methods 2 / 23
Part I: accelerated linearized augmented Lagrangian 3 / 23
Affinely constrained composite convex problems minimize F (x) = f(x) + g(x), subject to Ax = b (LCP) x f: convex and Lipschitz differentiable g: closed convex and simple Examples nonnegative quadratic programming: f = 1 2 x Qx + c x, g = ι R n + TV image denoising: min{ 1 2 X B 2 F + λ Y 1, s.t. D(X) = Y } 4 / 23
Augmented Lagrangian method (ALM) At iteration k, x k+1 arg min f(x) + g(x) λ k, Ax + β x 2 Ax b 2, λ k+1 λ k γ(ax k+1 b) augmented dual gradient ascent with stepsize γ β: penalty parameter; dual gradient Lipschitz constant 1/β 0 < γ < 2β: convergence guaranteed also popular for (nonlinear, nonconvex) constrained problems x-subproblem as difficult as original problem 5 / 23
Linearized augmented Lagrangian method Linearize the smooth term f: x k+1 arg min f(x k ), x + η x 2 x xk 2 + g(x) λ k, Ax + β 2 Ax b 2. Linearize both f and Ax b 2 : x k+1 arg min f(x k ), x + g(x) λ k, Ax + βa r k, x + η x 2 x xk 2, where r k = Ax k b is the residual. Easier updates and nice convergence speed O(1/k) 6 / 23
Accelerated linearized augmented Lagrangian method At iteration k, ˆx k (1 α k ) x k + α k x k, x k+1 arg min f(ˆx k ) A λ k, x + g(x) + β k x 2 Ax b 2 + η k 2 x xk 2, x k+1 (1 α k ) x k + α k x k+1, λ k+1 λ k γ k (Ax k+1 b). Inspired by [Lan 12] on accelerated stochastic approximation reduces to linearized ALM if α k = 1, β k = β, η k = η, γ k = γ, k convergence rate: O(1/k) if η L f and 0 < γ < 2β adaptive parameters to have O(1/k 2 ) (next slides) 7 / 23
Better numerical performance Objective error Feasibility Violation objective minus optimal value 10 0 10 1 10 2 10 3 10 4 10 5 Nonaccelerated ALM Accelerated ALM violation of feasibility 10 0 10 2 10 4 10 6 10 8 Nonaccelerated ALM Accelerated ALM 10 6 0 200 400 600 800 1000 Iteration numbers 10 10 0 200 400 600 800 1000 Iteration numbers Tested on quadratic programming (subproblems solved exactly) Parameters set according to theorem (see next slide) Accelerated ALM significantly better 8 / 23
Guaranteed fast convergence Assumptions: There is a pair of primal-dual solution (x, λ ). f is Lipschitz continuous: f(x) f(y) L f x y Convergence rate of order O(1/k 2 ): Set parameters to where γ > 0 and η 2L f. Then k : α k = 2 k + 1, γ k = kγ, β k γ k 2, η k = η k, F ( x k+1 ) F (x ) A x t+1 b 1 k(k + 1) 1 k(k + 1) max(1, λ ) ( ) η x 1 x 2 + 4 λ 2, γ ( ) η x 1 x 2 + 4 λ 2, γ 9 / 23
Sketch of proof Let Φ( x, x, λ) = F ( x) F (x) λ, A x b. 1. Fundamental inequality (for any λ): Φ( x k+1, x, λ) (1 α k )Φ( x k, x, λ) [ x k+1 x 2 x k x 2 + x k+1 x k 2] + α2 k L f x k+1 x k 2 2 α kη k 2 + α k [ 2γ λ k λ 2 λ k+1 λ 2 + λ k+1 λ k 2] α kβ k λ k+1 λ k 2, k 2. α k = 2 k+1, γ k = kγ, β k γ k 2, η k = η and multiply k(k + 1) to the above ineq.: k k(k + 1)Φ( x k+1, x, λ) k(k 1)Φ( x k, x, λ) η [ x k+1 x 2 x k x 2] + 1 γ [ λ k λ 2 λ k+1 λ 2]. 3. Set λ 1 = 0 and sum the above inequality over k: Φ( x k+1, x 1, λ) (η x 1 x 2 + 1γ ) k(k + 1) λ 2 4. Take λ = max (1 + λ, 2 λ ) A xk+1 b and use the optimality condition A x k+1 b Φ( x, x, λ ) 0 F ( x k+1 ) F (x ) λ A x k+1 b γ 2 k 10 / 23
Literature [He-Yuan 10]: accelerated ALM to O(1/k 2 ) for smooth problems [Kang et. al 13]: accelerated ALM to O(1/k 2 ) for nonsmooth problems [Huang-Ma-Goldfarb 13]: accelerated linearized ALM (with linearization of augmented term) to O(1/k 2 ) for strongly convex problems [Li-Lin 16]: weak convexity, O(1/k) is optimal if augmented term linearized 11 / 23
Part II: accelerated linearized ADMM 12 / 23
Two-block structured problems Variable is partitioned into two blocks, smooth part involves one block, and nonsmooth part is separable minimize h(y) + f(z) + g(z), subject to By + Cz = b (LCP-2) y,z f convex and Lipschitz differentiable g and h closed convex and simple Examples: Total-variation regularized regression: { min y,z λ y 1 + f(z), s.t. Dz = y } 13 / 23
Alternating direction method of multipliers (ADMM) At iteration k, y k+1 arg min h(y) λ k, By + β y 2 By + Czk b 2, z k+1 arg min f(z) + g(z) λ k, Cz + β z 2 Byk+1 + Cz b 2, λ k+1 λ k γ(by k+1 + Cz k+1 b) 0 < γ < 1+ 5 2 β: convergence guaranteed [Glowinski-Marrocco 75] updating y, z alternatingly: easier than jointly update but z-subproblem can still be difficult 14 / 23
Accelerated linearized ADMM At iteration k, y k+1 arg min h(y) λ k, By + β k y 2 By + Czk + b 2, z k+1 arg min f(z k ) C λ k + β k C r k+ 1 2, z + g(z) + η k z 2 z zk 2, λ k+1 λ k γ k (By k+1 + Cz k+1 b) where r k+ 1 2 = By k+1 + Cz k b. reduces to linearized ADMM if β k = β, η k = η, γ k = γ, k convergence rate: O(1/k) if 0 < γ β and η L f + β C 2 O(1/k 2 ) if adaptive parameters and strong convexity on z (next two slides) 15 / 23
Accelerated convergence speed Assumptions: Existence of a pair of primal-dual solution (y, z, λ ) f Lipschitz continuous: f(ẑ) f( z) L f ẑ z f strongly convex with modulus µ f (not required for y) Convergence rate of order O(1/k 2 ) Set parameters as follows (with γ > 0 and γ < η µ f /2) k : β k = γ k = (k + 1)γ, η k = (k + 1)η + L f, Then ( ) max z k z 2, F (ȳ k, z k ) F, Bȳ k + C z k b O(1/k 2 ), where F (y, z) = h(y) + f(z) + g(z) and F = F (y, z ). 16 / 23
Sketch of proof 1. Fundamental inequality from optimality conditions of each iterate: F (y k+1, z k+1 ) F (y, z) λ, By k+1 + Cz k+1 b 1 (λ γ k λ k+1 ), λ λ k + β k (λ k γ k λ k+1 ) β k C(z k+1 z k ) k + L f 2 zk+1 z k 2 µ f 2 zk z 2 η k z k+1 z, z k+1 z k, 2. Plug in parameters and bound cross terms: F (y k+1, z k+1 ) F (y, z ) λ, By k+1 + Cz k+1 b ( + 1 2 η(k + 1) z k+1 z 2 + L f z k+1 z 2) 1 + 2γ(k+1) λ λk+1 2 ( 1 2 η(k + 1) z k z 2 + (L f µ f ) z k z 2) 1 + 2γ(k+1) λ λk 2. 3. Multiply k + k 0 (here k 0 2L f µ f ) and sum the inequality over k: F (ȳ k+1, z k+1 ) F (y, z ) λ, Bȳ k+1 + C z k+1 b φ(y, z, λ) k 2 4. Take a special λ and use KKT conditions 17 / 23
Literature [Ouyang et. al 15]: O(L f /k 2 + C 0/k) with only weak convexity [Goldstein et. al 14]: O(1/k 2 ) with strong convexity on both y and z [Li-Lin 16]: O(1/k) optimal with only weak convexity Impossible to improve O(1/k) without additional assumptions [Chambolle-Pock 11, Chambolle-Pock 16, Dang-Lan 14, Bredies-Sun 16]: accelerated first-order methods on bilinear saddle-point problems Open question: weakest conditions to have O(1/k 2 ) 18 / 23
Numerical experiments (More results in paper) 19 / 23
Accelerated (linearized) ADMM Tested problem: total-variation regularized image denoising minimize X,Y 1 2 X B 2 F + µ Y 1, subject to DX = Y. (TVDN) B observed noisy Cameraman image, and D finite difference operator Compared methods: original ADMM accelerated ADMM linearized ADMM accelerated linearized ADMM accelerated Chambolle-Pock 20 / 23
Performance of compared methods 10 4 10 5 objective minus optimal value 10 2 10 0 10 2 10 4 Accelerated ADMM Accelerated Linearized ADMM 10 6 Nonaccelerated ADMM Nonaccelerated Linearized ADMM Chambolle Pock 10 8 0 100 200 300 400 500 Iteration numbers objective minus optimal value 10 0 10 5 Accelerated ADMM 10 10 Accelerated Linearized ADMM Nonaccelerated ADMM Nonaccelerated Linearized ADMM Chambolle Pock 10 15 0 10 20 30 40 50 Running time (sec.) Accelerated (linearized) ADMM significantly better than nonaccelerated one (accelerated) ADMM faster than (accelerated) linearized ADMM regarding iteration number (but the latter takes less time) 21 / 23
Conclusions accelerated linearized ALM to O(1/k 2 ) from O(1/k) with merely convexity accelerated (linearized) ADMM to O(1/k 2 ) from O(1/k) with strong convexity on one block variable performed numerical experiments 22 / 23
References 1. Y. Xu. Accelerated first-order primal-dual proximal methods for linearly constrained composite convex programming, SIAM J. Optimization, 2017. 2. T. Goldstein, B. O Donoghue, S. Setzer, and R. Baraniuk. Fast alternating direction optimization methods, SIAM J. on Imaging Sciences, 2014. 3. B. He and X. Yuan. On the acceleration of augmented Lagrangian method for linearly constrained optimization, Optimization Online, 2010. 4. B. Huang, S. Ma, and D. Goldfarb. Accelerated linearized Bregman method, Journal of Scientific Computing, 2013. 5. M. Kang, S. Yun, H. Woo, and M. Kang. Accelerated bregman method for linearly constrained l 1 -l 2 minimization, Journal of Scientific Computing, 2013. 23 / 23