Dual and primal-dual methods

Similar documents
Dual Proximal Gradient Method

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Proximal gradient methods

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

9. Dual decomposition and dual algorithms

6. Proximal gradient method

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

6. Proximal gradient method

Accelerated gradient methods

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Dual Decomposition.

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Fast proximal gradient methods

Lasso: Algorithms and Extensions

Math 273a: Optimization Convex Conjugacy

Proximal splitting methods on convex problems with a quadratic term: Relax!

Coordinate Update Algorithm Short Course Operator Splitting

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Splitting methods for decomposing separable convex programs

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Convergence of Fixed-Point Iterations

Accelerated Proximal Gradient Methods for Convex Optimization

Optimization methods

Sparse Optimization Lecture: Dual Methods, Part I

A General Framework for a Class of Primal-Dual Algorithms for TV Minimization

Math 273a: Optimization Overview of First-Order Optimization Algorithms

Math 273a: Optimization Subgradients of convex functions

Optimization for Learning and Big Data

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

Tight Rates and Equivalence Results of Operator Splitting Schemes

Newton s Method. Javier Peña Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

The Alternating Direction Method of Multipliers

Splitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches

The proximal mapping

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16

arxiv: v1 [math.oc] 21 Apr 2016

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Math 273a: Optimization Subgradient Methods

You should be able to...

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Proximal Methods for Optimization with Spasity-inducing Norms

Conditional Gradient (Frank-Wolfe) Method

EE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Primal/Dual Decomposition Methods

Lecture: Algorithms for Compressed Sensing

A Tutorial on Primal-Dual Algorithm

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

A Dykstra-like algorithm for two monotone operators

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Convex Optimization Conjugate, Subdifferential, Proximation

Lecture 8 Plus properties, merit functions and gap functions. September 28, 2008

2 Regularized Image Reconstruction for Compressive Imaging and Beyond

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Constrained Optimization and Lagrangian Duality

Optimization for Machine Learning

Convex Optimization & Lagrange Duality

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING

Distributed Optimization via Alternating Direction Method of Multipliers

Operator Splitting for Parallel and Distributed Optimization

Lecture 6: Conic Optimization September 8

Optimization for Machine Learning

Accelerated primal-dual methods for linearly constrained convex problems

EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6)

Lecture Note 5: Semidefinite Programming for Stability Analysis

Convex Optimization Algorithms for Machine Learning in 10 Slides

ADMM for monotone operators: convergence analysis and rates

Lecture 8: February 9

A SIMPLE PARALLEL ALGORITHM WITH AN O(1/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS

A GENERAL FRAMEWORK FOR A CLASS OF FIRST ORDER PRIMAL-DUAL ALGORITHMS FOR TV MINIMIZATION

arxiv: v1 [math.oc] 23 May 2017

On the equivalence of the primal-dual hybrid gradient method and Douglas Rachford splitting

On the acceleration of the double smoothing technique for unconstrained convex optimization problems

A GENERAL FRAMEWORK FOR A CLASS OF FIRST ORDER PRIMAL-DUAL ALGORITHMS FOR CONVEX OPTIMIZATION IN IMAGING SCIENCE

Lagrange duality. The Lagrangian. We consider an optimization program of the form

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Subgradient Method. Ryan Tibshirani Convex Optimization

Dual Ascent. Ryan Tibshirani Convex Optimization

Sparse Covariance Selection using Semidefinite Programming

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

A Primal-dual Three-operator Splitting Scheme

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Nesterov s Acceleration

minimize x subject to (x 2)(x 4) u,

Sum-Power Iterative Watefilling Algorithm

Proximal methods. S. Villa. October 7, 2014

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

arxiv: v4 [math.oc] 29 Jan 2018

UVA CS 4501: Machine Learning

On convergence rate of the Douglas-Rachford operator splitting method

Smoothing Proximal Gradient Method. General Structured Sparse Regression

E5295/5B5749 Convex optimization with engineering applications. Lecture 5. Convex programming and semidefinite programming

Primal-dual algorithms for the sum of two and three functions 1

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Transcription:

ELE 538B: Large-Scale Optimization for Data Science Dual and primal-dual methods Yuxin Chen Princeton University, Spring 2018

Outline Dual proximal gradient method Primal-dual proximal gradient method

Dual proximal gradient method

Constrained convex optimization minimize x subject to f(x) Ax + b C where f is convex, and C is convex set projection onto feasible set could be highly nontrivial (even when projection onto C is easy) Dual and primal-dual method 9-4

Constrained convex optimization More generally, minimize x f(x) + h(ax) where f and h are convex proximal operator w.r.t. h(x) := h(ax) could be difficult (even when prox h is inexpensive) Dual and primal-dual method 9-5

A possible route: dual formulation minimize x f(x) + h(ax) add auxiliary variable z minimize x,z subject to f(x) + h(z) Ax = z dual formulation: maximize λ min x,z f(x) + h(z) + λ, Ax z }{{} :=L(x,z,λ) (Lagrangian) Dual and primal-dual method 9-6

A possible route: dual formulation maximize λ maximize λ min x,z { min A λ, x + f(x) x,z maximize λ f(x) + h(z) + λ, Ax z decouple x and z } f ( A λ) h (λ) + min z { h(z) λ, z } where f (resp. h ) is Fenchel conjugate of f (resp. h) Dual and primal-dual method 9-7

Primal vs. dual problems (primal) minimize x f(x) + h(ax) (dual) minimize λ f ( A λ) + h (λ) Dual formulation is useful if proximal operator w.r.t. h is cheap (recall Moreau decomposition prox h (x) + prox h (x) = x) f is smooth (or if f is strongly convex) Dual and primal-dual method 9-8

Dual proximal gradient methods Apply proximal gradient methods for dual problem: Algorithm 9.1 Dual proximal gradient algorithm 1: for t = 0, 1, do 2: λ t+1 = prox ηth ( λ t + η t A f ( A λ t)) let Q(λ) := f ( A λ) h (λ) and Q opt = max λ Q(λ), then Q opt Q(λ t ) 1 t Dual and primal-dual method 9-9

Primal representation of dual proximal gradient methods Algorithm 9.1 admits a more explicit primal representation Algorithm 9.2 Dual proximal gradient algorithm (primal representation) 1: for t = 0, 1, do 2: x t { = arg min x f(x) + A λ t, x } 3: λ t+1 = λ t + η t Ax t η t prox η 1 t h ( η 1 t λ t + Ax t) {x t } is primal sequence, which is nevertheless not always feasible Dual and primal-dual method 9-10

Justification of primal representation By definition of x t, A λ t f(x t ) This together with conjugate subgradient theorem and smoothness of f yields x t = f ( A λ t ) Therefore, dual proximal gradient update rule can be rewritten as λ t+1 = prox ηth ( λ t + η t Ax t) Dual and primal-dual method 9-11

Justification of primal representation (cont.) Moreover, from extended Moreau decomposition, we know ( prox ηth λ t + η t Ax t) = λ t + η t Ax t ( η t prox η 1 t h η 1 t λ t + Ax t) = λ t+1 = λ t + η t Ax t ( η t prox η 1 t h η 1 t λ t + Ax t) Dual and primal-dual method 9-12

Accuracy of primal sequence One can control primal accuracy via dual accuracy: Lemma 9.1 Let x λ := arg min x { f(x) + A λ, x }. Suppose f is µ-strongly convex. Then x x λ 2 2 2( Q opt Q(λ) ) µ consequence: x x t 2 2 1/t Dual and primal-dual method 9-13

Recall that Lagrangian is given by Proof of Lemma 9.1 L(x, z, λ) := f(x) + A λ, x }{{} := f(x,λ) + h(z) λ, z }{{} := h(z,λ) For any λ, define x λ := arg min x f(x, λ) and zλ := arg min z h(x, λ) (non-rigorous). Then by strong convexity, L(x, z, λ) L(x λ, z λ, λ) f(x, λ) f(x λ, λ) 1 2 µ x x λ 2 2 In addition, since Ax = z, one has L(x, z, λ) = f(x ) + h(z ) + λ, Ax z = f(x ) + h(ax ) opt duality = F = Q opt This combined with L(x λ, z λ, λ) = Q(λ) gives Q opt Q(λ) 1 2 µ x x λ 2 2 as claimed Dual and primal-dual method 9-14

Accelerated dual proximal gradient methods One can apply FISTA to dual problem to improve convergence: Algorithm 9.3 Accelerated dual proximal gradient algorithm 1: for t = 0, 1, do 2: λ t+1 = prox ηth ( w t + η t A f ( A w t)) 3: θ t+1 = 1+ 1+4θ 2 t 2 4: w t+1 = λ t+1 + θt 1 θ t+1 ( λ t+1 λ t) apply FISTA theory and Lemma 9.1 to get Q opt Q(λ t ) 1 t 2 and x x t 2 2 1 t 2 Dual and primal-dual method 9-15

Primal representation of accelerated dual proximal gradient methods Algorithm 9.3 admits more explicit primal representation Algorithm 9.4 Accelerated dual proximal gradient algorithm (primal representation) 1: for t = 0, 1, do 2: x t = arg min x f(x) + A w t, x 3: λ t+1 = w t + η t Ax t η t prox η 1 t h 4: θ t+1 = 1+ 1+4θ 2 t 2 5: w t+1 = λ t+1 + θt 1 θ t+1 ( λ t+1 λ t) ( η 1 t w t + Ax t) Dual and primal-dual method 9-16

Primal-dual proximal gradient method

Nonsmooth optimization minimize x f(x) + h(ax) where f and h are closed and convex both f and h might be non-smooth both f and h admit inexpensive proximal operators Dual and primal-dual method 9-18

Primal-dual approaches? minimize x f(x) + h(ax) So far we have discussed proximal method (resp. dual proximal method), which essentially updates only primal (resp. dual) variables Question: can we update both primal and dual variables simultaneously and take advantage of both prox f and prox h? Dual and primal-dual method 9-19

A saddle-point formulation To this end, we first derive saddle-point formulation that includes both primal and dual variables minimize x f(x) + h(ax) add auxiliary variable z minimize x,z f(x) + h(z) subject to Ax = z maximize λ min x,z f(x) + h(z) + λ, Ax z maximize λ min x f(x) + λ, Ax h (λ) minimize x max λ f(x) + λ, Ax h (λ) (saddle-point problem) Dual and primal-dual method 9-20

A saddle-point formulation minimize x max λ f(x) + λ, Ax h (λ) (9.1) one can then consider updating primal variable x and dual variable λ simultaneously we ll first examine optimality condition for (9.1), which in turn gives ideas about how to jointly update primal and dual variables Dual and primal-dual method 9-21

Optimality condition minimize x max λ f(x) + λ, Ax h (λ) optimality condition: { 0 f(x) + A λ 0 Ax h (λ) 0 [ A A ] [ x λ ] + [ f(x) h (λ) ] := F(x, λ) (9.2) key idea: iteratively update (x, λ) to reach a point obeying 0 F(x, λ) Dual and primal-dual method 9-22

How to solve 0 F(x) in general? In general, finding solution to 0 F(x) }{{} called monotone inclusion problem if F is maximal monotone x (I + F)(x) is equivalent to finding fixed point of (I + ηf) 1, i.e. solution to }{{} resolvent of F x = (I + ηf) 1 (x) This suggests natural fixed-point iteration / resolvent iteration: x t+1 = (I + ηf) 1 (x t ), t = 0, 1, Dual and primal-dual method 9-23

Aside: monotone operators Ryu, Boyd 16 relation F is called monotone if u v, x y 0, (x, u), (y, v) F relation F is called maximal monotone if there is no monotone operator that contains it Dual and primal-dual method 9-24

Proximal point method x t+1 = (I + η t F) 1 (x t ), t = 0, 1, If F = f for some convex function f, then this proximal point method becomes x t+1 = prox ηtf (x t ), t = 0, 1, useful when prox ηtf is cheap Dual and primal-dual method 9-25

Back to primal-dual approaches Recall that we want to solve [ ] [ ] A 0 x + A λ [ f(x) h (λ) ] := F(x, λ) issue of proximal point method: computing (I + ηf) 1 is in general difficult Dual and primal-dual method 9-26

Back to primal-dual approaches observation: practically we may often consider splitting F into two operators 0 A(x, λ) + B(x, λ) with A(x, λ) = [ A A ] [ x λ ], B(x, λ) = [ f(x) h (λ) ] (9.3) (I + ηa) 1 can be computed by solving linear systems (I + ηb) 1 is easy if prox f and prox h are both inexpensive solution: design update rules based on (I + ηa) 1 and (I + ηb) 1 instead of (I + ηf) 1 Dual and primal-dual method 9-27

Operator splitting via Cayley operators We now introduce one principled approach based on operator splitting find x s.t. 0 F(x) = A(x) + B(x) }{{} operator splitting let R A := (I + ηa) 1 and R B := (I + ηb) 1 be resolvents, and C A := 2R A I and C B := 2R B I be Cayley operators Lemma 9.2 0 A(x) + B(x) }{{} x R A+B (x) C A C B (z) = z with x = R B (z) (9.4) }{{} it comes down to finding fixed point of C A C B Dual and primal-dual method 9-28

Operator splitting via Cayley operators x R A+B (x) C A C B (z) = z advantage: allows to apply C A (resp. R A ) and C B (resp. R B ) sequentially (instead of computing R A+B ) Dual and primal-dual method 9-29

Proof of Lemma 9.2 C A C B (z) = z x = R B (z) (9.5a) z = 2x z (9.5b) x = R B ( z) (9.5c) z = 2 x z (9.5d) From (9.5b) and (9.5d), we see that x = x which together with (9.5d) gives 2x = z + z (9.6) Dual and primal-dual method 9-30

Proof of Lemma 9.2 (cont.) Recall that z x + ηb(x) and z x + ηa(x) Adding these two facts and using (9.6), we get 2x = z + z 2x + ηb(x) + ηa(x) 0 A(x) + B(x) Dual and primal-dual method 9-31

Douglas-Rachford splitting How to find points obeying x = C A C B (x)? First attempt: fixed-point iteration z t+1 = C A C B (z t ) unfortunately, it may not converge in general Douglas-Rachford splitting: damped fixed-point iteration z t+1 = 1 ) I + CA C B (z 2( t ) converges when solution to 0 A(x) + B(x) exists! Dual and primal-dual method 9-32

More explicit expression for D-R splitting Douglas-Rachford splitting update rule z t+1 = 1 2( I + CA C B ) (z t ) is essentially: x t+ 1 2 = RB (z t ) z t+ 1 2 = 2x t+ 1 2 z t x t+1 ( = R A z t+ 1 ) 2 z t+1 = 1 z 2( t + 2x t+1 z t+ 1 ) 2 = z t + x t+1 x t+ 1 2 where x t+ 1 2 and z t+ 1 2 are auxiliary variables Dual and primal-dual method 9-33

More explicit expression for D-R splitting or equivalently, x t+ 1 2 = RB (z t ) x t+1 = R A ( 2x t+ 1 2 z t ) z t+1 = z t + x t+1 x t+ 1 2 Dual and primal-dual method 9-33

Douglas-Rachford primal-dual splitting minimize x max λ f(x) + λ, Ax h (λ) Applying Douglas-Rachford splitting to (9.3) yields x t+ 1 2 = proxηf (p t ) λ t+ 1 2 = proxηg (q t ) [ ] [ x t+1 I ηa λ t+1 = ηa I ] 1 [ 2x t+ 1 2 p t 2λ t+ 1 2 q t ] p t+1 = p t + x t+1 x t+ 1 2 q t+1 = q t + λ t+1 λ t+ 1 2 Dual and primal-dual method 9-34

Example minimize x x 2 + γ Ax b 1 minimize x f(x) + g(ax) with f(x) := x 2 and g(y) := γ y b 1 10 0 10 1 x k x / x ADMM primal DR primal dual DR 10 2 10 3 10 4 0 50 100 150 iteration number k Connor, Vandenberghe 14 Dual and primal-dual method 9-35

Example minimize Kx b 1 + γ Dx iso }{{} certain l 2 l 1 norm minimize x f(x) + g(ax) s.t. 0 x 1 with f(x) := 1 {0 x 1} (x) and g(y 1, y 2 ) := y 1 b 1 + γ y 2 iso 10 0 10 1 10 2 (f(x k ) f )/f CP ADMM primal DR primal dual DR 10 3 10 4 10 5 10 6 10 7 0 200 400 600 800 1000 iteration number k Connor, Vandenberghe 14 Dual and primal-dual method 9-36

Reference [1] Optimization methods for large-scale systems, EE236C lecture notes, L. Vandenberghe, UCLA. [2] Convex optimization, EE364B lecture notes, S. Boyd, Stanford. [3] Mathematical optimization, MATH301 lecture notes, E. Candes, Stanford. [4] First-order methods in optimization, A. Beck, Vol. 25, SIAM, 2017. [5] Primal-dual decomposition by operator splitting and applications to image deblurring, D. O Connor, L. Vandenberghe, SIAM Journal on Imaging Sciences, 2014. [6] On the numerical solution of heat conduction problems in two and three space variables, J. Douglas, H. Rachford, Transactions of the American mathematical Society, 1956. Dual and primal-dual method 9-37

Reference [7] A primer on monotone operator methods, E. Ryu, S. Boyd, Appl. Comput. Math., 2016. Dual and primal-dual method 9-38