Accelerated primal-dual methods for linearly constrained convex problems

Similar documents
ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Adaptive Primal Dual Optimization for Image Processing and Learning

Sparse Optimization Lecture: Dual Methods, Part I

HYBRID JACOBIAN AND GAUSS SEIDEL PROXIMAL BLOCK COORDINATE UPDATE METHODS FOR LINEARLY CONSTRAINED CONVEX PROGRAMMING

Tight Rates and Equivalence Results of Operator Splitting Schemes

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

arxiv: v2 [math.oc] 25 Mar 2018

Fast proximal gradient methods

Solving DC Programs that Promote Group 1-Sparsity

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

Inexact Alternating Direction Method of Multipliers for Separable Convex Optimization

arxiv: v7 [math.oc] 22 Feb 2018

FAST ALTERNATING DIRECTION OPTIMIZATION METHODS

Coordinate Update Algorithm Short Course Operator Splitting

A Tutorial on Primal-Dual Algorithm

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

Block stochastic gradient update method

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.11

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

An interior-point stochastic approximation method and an L1-regularized delta rule

A Unified Approach to Proximal Algorithms using Bregman Distance

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

Lecture 3. Optimization Problems and Iterative Algorithms

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Lecture: Algorithms for Compressed Sensing

On the acceleration of augmented Lagrangian method for linearly constrained optimization

INERTIAL PRIMAL-DUAL ALGORITHMS FOR STRUCTURED CONVEX OPTIMIZATION

Beyond Heuristics: Applying Alternating Direction Method of Multipliers in Nonconvex Territory

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16

ARock: an algorithmic framework for asynchronous parallel coordinate updates

A Primal-dual Three-operator Splitting Scheme

Primal-dual coordinate descent

Distributed Optimization via Alternating Direction Method of Multipliers

Primal-dual algorithms for the sum of two and three functions 1

Block Coordinate Descent for Regularized Multi-convex Optimization

Dual Proximal Gradient Method

Accelerated Proximal Gradient Methods for Convex Optimization

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Optimization methods

A SIMPLE PARALLEL ALGORITHM WITH AN O(1/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS

Optimization for Learning and Big Data

Adaptive Restarting for First Order Optimization Methods

On Stochastic Primal-Dual Hybrid Gradient Approach for Compositely Regularized Minimization

9. Dual decomposition and dual algorithms

Iteration-complexity of first-order penalty methods for convex programming

arxiv: v1 [math.oc] 23 May 2017

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

More First-Order Optimization Algorithms

Math 273a: Optimization Overview of First-Order Optimization Algorithms

Gradient Sliding for Composite Optimization

Does Alternating Direction Method of Multipliers Converge for Nonconvex Problems?

SEMI-SMOOTH SECOND-ORDER TYPE METHODS FOR COMPOSITE CONVEX PROGRAMS

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

WHY DUALITY? Gradient descent Newton s method Quasi-newton Conjugate gradients. No constraints. Non-differentiable ???? Constrained problems? ????

Contraction Methods for Convex Optimization and monotone variational inequalities No.12

ON THE GLOBAL AND LINEAR CONVERGENCE OF THE GENERALIZED ALTERNATING DIRECTION METHOD OF MULTIPLIERS

Expanding the reach of optimal methods

Perturbed Proximal Primal Dual Algorithm for Nonconvex Nonsmooth Optimization

Coordinate descent methods

You should be able to...

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

Lecture 11 and 12: Penalty methods and augmented Lagrangian methods for nonlinear programming

Sequential Unconstrained Minimization: A Survey

Introduction to Alternating Direction Method of Multipliers

Lecture 23: November 21

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

10 Numerical methods for constrained problems

Dual and primal-dual methods

ACCELERATED BUNDLE LEVEL TYPE METHODS FOR LARGE SCALE CONVEX OPTIMIZATION

arxiv: v1 [math.oc] 1 Jul 2016

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Part 5: Penalty and augmented Lagrangian methods for equality constrained optimization. Nick Gould (RAL)

Convex Optimization Algorithms for Machine Learning in 10 Slides

1 Computing with constraints

The Alternating Direction Method of Multipliers

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

CSC 576: Gradient Descent Algorithms

Algorithms for constrained local optimization

A GENERAL FRAMEWORK FOR A CLASS OF FIRST ORDER PRIMAL-DUAL ALGORITHMS FOR TV MINIMIZATION

10-725/36-725: Convex Optimization Spring Lecture 21: April 6

An Algorithmic Framework of Generalized Primal-Dual Hybrid Gradient Methods for Saddle Point Problems

Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725

A GENERAL FRAMEWORK FOR A CLASS OF FIRST ORDER PRIMAL-DUAL ALGORITHMS FOR CONVEX OPTIMIZATION IN IMAGING SCIENCE

An adaptive accelerated first-order method for convex optimization

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Lasso: Algorithms and Extensions

Lecture 25: Subgradient Method and Bundle Methods April 24

Algorithms for Nonsmooth Optimization

arxiv: v1 [math.oc] 5 Dec 2014

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

On convergence rate of the Douglas-Rachford operator splitting method

Optimisation in Higher Dimensions

Quiz Discussion. IE417: Nonlinear Programming: Lecture 12. Motivation. Why do we care? Jeff Linderoth. 16th March 2006

Douglas-Rachford Splitting: Complexity Estimates and Accelerated Variants

A Linearly Convergent First-order Algorithm for Total Variation Minimization in Image Processing

A Customized ADMM for Rank-Constrained Optimization Problems with Approximate Formulations

Transcription:

Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23

Accelerated proximal gradient For convex composite problem: minimize x f: convex and Lipschitz differentiable F (x) := f(x) + g(x) g: closed convex (possibly nondifferentiable) and simple Proximal gradient: x k+1 = arg min f(x k ), x + L f x 2 x xk 2 + g(x) convergence rate: F (x k ) F (x ) = O(1/k) Accelerated Proximal gradient [Beck-Teboulle 09, Nesterov 14]: ˆx k : extrapolated point x k+1 = arg min f(ˆx k ), x + L f x 2 x ˆxk 2 + g(x) convergence rate (with smart extrapolation): F (x k ) F (x ) = O(1/k 2 ) This talk: ways to accelerate primal-dual methods 2 / 23

Part I: accelerated linearized augmented Lagrangian 3 / 23

Affinely constrained composite convex problems minimize F (x) = f(x) + g(x), subject to Ax = b (LCP) x f: convex and Lipschitz differentiable g: closed convex and simple Examples nonnegative quadratic programming: f = 1 2 x Qx + c x, g = ι R n + TV image denoising: min{ 1 2 X B 2 F + λ Y 1, s.t. D(X) = Y } 4 / 23

Augmented Lagrangian method (ALM) At iteration k, x k+1 arg min f(x) + g(x) λ k, Ax + β x 2 Ax b 2, λ k+1 λ k γ(ax k+1 b) augmented dual gradient ascent with stepsize γ β: penalty parameter; dual gradient Lipschitz constant 1/β 0 < γ < 2β: convergence guaranteed also popular for (nonlinear, nonconvex) constrained problems x-subproblem as difficult as original problem 5 / 23

Linearized augmented Lagrangian method Linearize the smooth term f: x k+1 arg min f(x k ), x + η x 2 x xk 2 + g(x) λ k, Ax + β 2 Ax b 2. Linearize both f and Ax b 2 : x k+1 arg min f(x k ), x + g(x) λ k, Ax + βa r k, x + η x 2 x xk 2, where r k = Ax k b is the residual. Easier updates and nice convergence speed O(1/k) 6 / 23

Accelerated linearized augmented Lagrangian method At iteration k, ˆx k (1 α k ) x k + α k x k, x k+1 arg min f(ˆx k ) A λ k, x + g(x) + β k x 2 Ax b 2 + η k 2 x xk 2, x k+1 (1 α k ) x k + α k x k+1, λ k+1 λ k γ k (Ax k+1 b). Inspired by [Lan 12] on accelerated stochastic approximation reduces to linearized ALM if α k = 1, β k = β, η k = η, γ k = γ, k convergence rate: O(1/k) if η L f and 0 < γ < 2β adaptive parameters to have O(1/k 2 ) (next slides) 7 / 23

Better numerical performance Objective error Feasibility Violation objective minus optimal value 10 0 10 1 10 2 10 3 10 4 10 5 Nonaccelerated ALM Accelerated ALM violation of feasibility 10 0 10 2 10 4 10 6 10 8 Nonaccelerated ALM Accelerated ALM 10 6 0 200 400 600 800 1000 Iteration numbers 10 10 0 200 400 600 800 1000 Iteration numbers Tested on quadratic programming (subproblems solved exactly) Parameters set according to theorem (see next slide) Accelerated ALM significantly better 8 / 23

Guaranteed fast convergence Assumptions: There is a pair of primal-dual solution (x, λ ). f is Lipschitz continuous: f(x) f(y) L f x y Convergence rate of order O(1/k 2 ): Set parameters to where γ > 0 and η 2L f. Then k : α k = 2 k + 1, γ k = kγ, β k γ k 2, η k = η k, F ( x k+1 ) F (x ) A x t+1 b 1 k(k + 1) 1 k(k + 1) max(1, λ ) ( ) η x 1 x 2 + 4 λ 2, γ ( ) η x 1 x 2 + 4 λ 2, γ 9 / 23

Sketch of proof Let Φ( x, x, λ) = F ( x) F (x) λ, A x b. 1. Fundamental inequality (for any λ): Φ( x k+1, x, λ) (1 α k )Φ( x k, x, λ) [ x k+1 x 2 x k x 2 + x k+1 x k 2] + α2 k L f x k+1 x k 2 2 α kη k 2 + α k [ 2γ λ k λ 2 λ k+1 λ 2 + λ k+1 λ k 2] α kβ k λ k+1 λ k 2, k 2. α k = 2 k+1, γ k = kγ, β k γ k 2, η k = η and multiply k(k + 1) to the above ineq.: k k(k + 1)Φ( x k+1, x, λ) k(k 1)Φ( x k, x, λ) η [ x k+1 x 2 x k x 2] + 1 γ [ λ k λ 2 λ k+1 λ 2]. 3. Set λ 1 = 0 and sum the above inequality over k: Φ( x k+1, x 1, λ) (η x 1 x 2 + 1γ ) k(k + 1) λ 2 4. Take λ = max (1 + λ, 2 λ ) A xk+1 b and use the optimality condition A x k+1 b Φ( x, x, λ ) 0 F ( x k+1 ) F (x ) λ A x k+1 b γ 2 k 10 / 23

Literature [He-Yuan 10]: accelerated ALM to O(1/k 2 ) for smooth problems [Kang et. al 13]: accelerated ALM to O(1/k 2 ) for nonsmooth problems [Huang-Ma-Goldfarb 13]: accelerated linearized ALM (with linearization of augmented term) to O(1/k 2 ) for strongly convex problems [Li-Lin 16]: weak convexity, O(1/k) is optimal if augmented term linearized 11 / 23

Part II: accelerated linearized ADMM 12 / 23

Two-block structured problems Variable is partitioned into two blocks, smooth part involves one block, and nonsmooth part is separable minimize h(y) + f(z) + g(z), subject to By + Cz = b (LCP-2) y,z f convex and Lipschitz differentiable g and h closed convex and simple Examples: Total-variation regularized regression: { min y,z λ y 1 + f(z), s.t. Dz = y } 13 / 23

Alternating direction method of multipliers (ADMM) At iteration k, y k+1 arg min h(y) λ k, By + β y 2 By + Czk b 2, z k+1 arg min f(z) + g(z) λ k, Cz + β z 2 Byk+1 + Cz b 2, λ k+1 λ k γ(by k+1 + Cz k+1 b) 0 < γ < 1+ 5 2 β: convergence guaranteed [Glowinski-Marrocco 75] updating y, z alternatingly: easier than jointly update but z-subproblem can still be difficult 14 / 23

Accelerated linearized ADMM At iteration k, y k+1 arg min h(y) λ k, By + β k y 2 By + Czk + b 2, z k+1 arg min f(z k ) C λ k + β k C r k+ 1 2, z + g(z) + η k z 2 z zk 2, λ k+1 λ k γ k (By k+1 + Cz k+1 b) where r k+ 1 2 = By k+1 + Cz k b. reduces to linearized ADMM if β k = β, η k = η, γ k = γ, k convergence rate: O(1/k) if 0 < γ β and η L f + β C 2 O(1/k 2 ) if adaptive parameters and strong convexity on z (next two slides) 15 / 23

Accelerated convergence speed Assumptions: Existence of a pair of primal-dual solution (y, z, λ ) f Lipschitz continuous: f(ẑ) f( z) L f ẑ z f strongly convex with modulus µ f (not required for y) Convergence rate of order O(1/k 2 ) Set parameters as follows (with γ > 0 and γ < η µ f /2) k : β k = γ k = (k + 1)γ, η k = (k + 1)η + L f, Then ( ) max z k z 2, F (ȳ k, z k ) F, Bȳ k + C z k b O(1/k 2 ), where F (y, z) = h(y) + f(z) + g(z) and F = F (y, z ). 16 / 23

Sketch of proof 1. Fundamental inequality from optimality conditions of each iterate: F (y k+1, z k+1 ) F (y, z) λ, By k+1 + Cz k+1 b 1 (λ γ k λ k+1 ), λ λ k + β k (λ k γ k λ k+1 ) β k C(z k+1 z k ) k + L f 2 zk+1 z k 2 µ f 2 zk z 2 η k z k+1 z, z k+1 z k, 2. Plug in parameters and bound cross terms: F (y k+1, z k+1 ) F (y, z ) λ, By k+1 + Cz k+1 b ( + 1 2 η(k + 1) z k+1 z 2 + L f z k+1 z 2) 1 + 2γ(k+1) λ λk+1 2 ( 1 2 η(k + 1) z k z 2 + (L f µ f ) z k z 2) 1 + 2γ(k+1) λ λk 2. 3. Multiply k + k 0 (here k 0 2L f µ f ) and sum the inequality over k: F (ȳ k+1, z k+1 ) F (y, z ) λ, Bȳ k+1 + C z k+1 b φ(y, z, λ) k 2 4. Take a special λ and use KKT conditions 17 / 23

Literature [Ouyang et. al 15]: O(L f /k 2 + C 0/k) with only weak convexity [Goldstein et. al 14]: O(1/k 2 ) with strong convexity on both y and z [Li-Lin 16]: O(1/k) optimal with only weak convexity Impossible to improve O(1/k) without additional assumptions [Chambolle-Pock 11, Chambolle-Pock 16, Dang-Lan 14, Bredies-Sun 16]: accelerated first-order methods on bilinear saddle-point problems Open question: weakest conditions to have O(1/k 2 ) 18 / 23

Numerical experiments (More results in paper) 19 / 23

Accelerated (linearized) ADMM Tested problem: total-variation regularized image denoising minimize X,Y 1 2 X B 2 F + µ Y 1, subject to DX = Y. (TVDN) B observed noisy Cameraman image, and D finite difference operator Compared methods: original ADMM accelerated ADMM linearized ADMM accelerated linearized ADMM accelerated Chambolle-Pock 20 / 23

Performance of compared methods 10 4 10 5 objective minus optimal value 10 2 10 0 10 2 10 4 Accelerated ADMM Accelerated Linearized ADMM 10 6 Nonaccelerated ADMM Nonaccelerated Linearized ADMM Chambolle Pock 10 8 0 100 200 300 400 500 Iteration numbers objective minus optimal value 10 0 10 5 Accelerated ADMM 10 10 Accelerated Linearized ADMM Nonaccelerated ADMM Nonaccelerated Linearized ADMM Chambolle Pock 10 15 0 10 20 30 40 50 Running time (sec.) Accelerated (linearized) ADMM significantly better than nonaccelerated one (accelerated) ADMM faster than (accelerated) linearized ADMM regarding iteration number (but the latter takes less time) 21 / 23

Conclusions accelerated linearized ALM to O(1/k 2 ) from O(1/k) with merely convexity accelerated (linearized) ADMM to O(1/k 2 ) from O(1/k) with strong convexity on one block variable performed numerical experiments 22 / 23

References 1. Y. Xu. Accelerated first-order primal-dual proximal methods for linearly constrained composite convex programming, SIAM J. Optimization, 2017. 2. T. Goldstein, B. O Donoghue, S. Setzer, and R. Baraniuk. Fast alternating direction optimization methods, SIAM J. on Imaging Sciences, 2014. 3. B. He and X. Yuan. On the acceleration of augmented Lagrangian method for linearly constrained optimization, Optimization Online, 2010. 4. B. Huang, S. Ma, and D. Goldfarb. Accelerated linearized Bregman method, Journal of Scientific Computing, 2013. 5. M. Kang, S. Yun, H. Woo, and M. Kang. Accelerated bregman method for linearly constrained l 1 -l 2 minimization, Journal of Scientific Computing, 2013. 23 / 23