ELE 538B: Large-Scale Optimization for Data Science Dual and primal-dual methods Yuxin Chen Princeton University, Spring 2018
Outline Dual proximal gradient method Primal-dual proximal gradient method
Dual proximal gradient method
Constrained convex optimization minimize x subject to f(x) Ax + b C where f is convex, and C is convex set projection onto feasible set could be highly nontrivial (even when projection onto C is easy) Dual and primal-dual method 9-4
Constrained convex optimization More generally, minimize x f(x) + h(ax) where f and h are convex proximal operator w.r.t. h(x) := h(ax) could be difficult (even when prox h is inexpensive) Dual and primal-dual method 9-5
A possible route: dual formulation minimize x f(x) + h(ax) add auxiliary variable z minimize x,z subject to f(x) + h(z) Ax = z dual formulation: maximize λ min x,z f(x) + h(z) + λ, Ax z }{{} :=L(x,z,λ) (Lagrangian) Dual and primal-dual method 9-6
A possible route: dual formulation maximize λ maximize λ min x,z { min A λ, x + f(x) x,z maximize λ f(x) + h(z) + λ, Ax z decouple x and z } f ( A λ) h (λ) + min z { h(z) λ, z } where f (resp. h ) is Fenchel conjugate of f (resp. h) Dual and primal-dual method 9-7
Primal vs. dual problems (primal) minimize x f(x) + h(ax) (dual) minimize λ f ( A λ) + h (λ) Dual formulation is useful if proximal operator w.r.t. h is cheap (recall Moreau decomposition prox h (x) + prox h (x) = x) f is smooth (or if f is strongly convex) Dual and primal-dual method 9-8
Dual proximal gradient methods Apply proximal gradient methods for dual problem: Algorithm 9.1 Dual proximal gradient algorithm 1: for t = 0, 1, do 2: λ t+1 = prox ηth ( λ t + η t A f ( A λ t)) let Q(λ) := f ( A λ) h (λ) and Q opt = max λ Q(λ), then Q opt Q(λ t ) 1 t Dual and primal-dual method 9-9
Primal representation of dual proximal gradient methods Algorithm 9.1 admits a more explicit primal representation Algorithm 9.2 Dual proximal gradient algorithm (primal representation) 1: for t = 0, 1, do 2: x t { = arg min x f(x) + A λ t, x } 3: λ t+1 = λ t + η t Ax t η t prox η 1 t h ( η 1 t λ t + Ax t) {x t } is primal sequence, which is nevertheless not always feasible Dual and primal-dual method 9-10
Justification of primal representation By definition of x t, A λ t f(x t ) This together with conjugate subgradient theorem and smoothness of f yields x t = f ( A λ t ) Therefore, dual proximal gradient update rule can be rewritten as λ t+1 = prox ηth ( λ t + η t Ax t) Dual and primal-dual method 9-11
Justification of primal representation (cont.) Moreover, from extended Moreau decomposition, we know ( prox ηth λ t + η t Ax t) = λ t + η t Ax t ( η t prox η 1 t h η 1 t λ t + Ax t) = λ t+1 = λ t + η t Ax t ( η t prox η 1 t h η 1 t λ t + Ax t) Dual and primal-dual method 9-12
Accuracy of primal sequence One can control primal accuracy via dual accuracy: Lemma 9.1 Let x λ := arg min x { f(x) + A λ, x }. Suppose f is µ-strongly convex. Then x x λ 2 2 2( Q opt Q(λ) ) µ consequence: x x t 2 2 1/t Dual and primal-dual method 9-13
Recall that Lagrangian is given by Proof of Lemma 9.1 L(x, z, λ) := f(x) + A λ, x }{{} := f(x,λ) + h(z) λ, z }{{} := h(z,λ) For any λ, define x λ := arg min x f(x, λ) and zλ := arg min z h(x, λ) (non-rigorous). Then by strong convexity, L(x, z, λ) L(x λ, z λ, λ) f(x, λ) f(x λ, λ) 1 2 µ x x λ 2 2 In addition, since Ax = z, one has L(x, z, λ) = f(x ) + h(z ) + λ, Ax z = f(x ) + h(ax ) opt duality = F = Q opt This combined with L(x λ, z λ, λ) = Q(λ) gives Q opt Q(λ) 1 2 µ x x λ 2 2 as claimed Dual and primal-dual method 9-14
Accelerated dual proximal gradient methods One can apply FISTA to dual problem to improve convergence: Algorithm 9.3 Accelerated dual proximal gradient algorithm 1: for t = 0, 1, do 2: λ t+1 = prox ηth ( w t + η t A f ( A w t)) 3: θ t+1 = 1+ 1+4θ 2 t 2 4: w t+1 = λ t+1 + θt 1 θ t+1 ( λ t+1 λ t) apply FISTA theory and Lemma 9.1 to get Q opt Q(λ t ) 1 t 2 and x x t 2 2 1 t 2 Dual and primal-dual method 9-15
Primal representation of accelerated dual proximal gradient methods Algorithm 9.3 admits more explicit primal representation Algorithm 9.4 Accelerated dual proximal gradient algorithm (primal representation) 1: for t = 0, 1, do 2: x t = arg min x f(x) + A w t, x 3: λ t+1 = w t + η t Ax t η t prox η 1 t h 4: θ t+1 = 1+ 1+4θ 2 t 2 5: w t+1 = λ t+1 + θt 1 θ t+1 ( λ t+1 λ t) ( η 1 t w t + Ax t) Dual and primal-dual method 9-16
Primal-dual proximal gradient method
Nonsmooth optimization minimize x f(x) + h(ax) where f and h are closed and convex both f and h might be non-smooth both f and h admit inexpensive proximal operators Dual and primal-dual method 9-18
Primal-dual approaches? minimize x f(x) + h(ax) So far we have discussed proximal method (resp. dual proximal method), which essentially updates only primal (resp. dual) variables Question: can we update both primal and dual variables simultaneously and take advantage of both prox f and prox h? Dual and primal-dual method 9-19
A saddle-point formulation To this end, we first derive saddle-point formulation that includes both primal and dual variables minimize x f(x) + h(ax) add auxiliary variable z minimize x,z f(x) + h(z) subject to Ax = z maximize λ min x,z f(x) + h(z) + λ, Ax z maximize λ min x f(x) + λ, Ax h (λ) minimize x max λ f(x) + λ, Ax h (λ) (saddle-point problem) Dual and primal-dual method 9-20
A saddle-point formulation minimize x max λ f(x) + λ, Ax h (λ) (9.1) one can then consider updating primal variable x and dual variable λ simultaneously we ll first examine optimality condition for (9.1), which in turn gives ideas about how to jointly update primal and dual variables Dual and primal-dual method 9-21
Optimality condition minimize x max λ f(x) + λ, Ax h (λ) optimality condition: { 0 f(x) + A λ 0 Ax h (λ) 0 [ A A ] [ x λ ] + [ f(x) h (λ) ] := F(x, λ) (9.2) key idea: iteratively update (x, λ) to reach a point obeying 0 F(x, λ) Dual and primal-dual method 9-22
How to solve 0 F(x) in general? In general, finding solution to 0 F(x) }{{} called monotone inclusion problem if F is maximal monotone x (I + F)(x) is equivalent to finding fixed point of (I + ηf) 1, i.e. solution to }{{} resolvent of F x = (I + ηf) 1 (x) This suggests natural fixed-point iteration / resolvent iteration: x t+1 = (I + ηf) 1 (x t ), t = 0, 1, Dual and primal-dual method 9-23
Aside: monotone operators Ryu, Boyd 16 relation F is called monotone if u v, x y 0, (x, u), (y, v) F relation F is called maximal monotone if there is no monotone operator that contains it Dual and primal-dual method 9-24
Proximal point method x t+1 = (I + η t F) 1 (x t ), t = 0, 1, If F = f for some convex function f, then this proximal point method becomes x t+1 = prox ηtf (x t ), t = 0, 1, useful when prox ηtf is cheap Dual and primal-dual method 9-25
Back to primal-dual approaches Recall that we want to solve [ ] [ ] A 0 x + A λ [ f(x) h (λ) ] := F(x, λ) issue of proximal point method: computing (I + ηf) 1 is in general difficult Dual and primal-dual method 9-26
Back to primal-dual approaches observation: practically we may often consider splitting F into two operators 0 A(x, λ) + B(x, λ) with A(x, λ) = [ A A ] [ x λ ], B(x, λ) = [ f(x) h (λ) ] (9.3) (I + ηa) 1 can be computed by solving linear systems (I + ηb) 1 is easy if prox f and prox h are both inexpensive solution: design update rules based on (I + ηa) 1 and (I + ηb) 1 instead of (I + ηf) 1 Dual and primal-dual method 9-27
Operator splitting via Cayley operators We now introduce one principled approach based on operator splitting find x s.t. 0 F(x) = A(x) + B(x) }{{} operator splitting let R A := (I + ηa) 1 and R B := (I + ηb) 1 be resolvents, and C A := 2R A I and C B := 2R B I be Cayley operators Lemma 9.2 0 A(x) + B(x) }{{} x R A+B (x) C A C B (z) = z with x = R B (z) (9.4) }{{} it comes down to finding fixed point of C A C B Dual and primal-dual method 9-28
Operator splitting via Cayley operators x R A+B (x) C A C B (z) = z advantage: allows to apply C A (resp. R A ) and C B (resp. R B ) sequentially (instead of computing R A+B ) Dual and primal-dual method 9-29
Proof of Lemma 9.2 C A C B (z) = z x = R B (z) (9.5a) z = 2x z (9.5b) x = R B ( z) (9.5c) z = 2 x z (9.5d) From (9.5b) and (9.5d), we see that x = x which together with (9.5d) gives 2x = z + z (9.6) Dual and primal-dual method 9-30
Proof of Lemma 9.2 (cont.) Recall that z x + ηb(x) and z x + ηa(x) Adding these two facts and using (9.6), we get 2x = z + z 2x + ηb(x) + ηa(x) 0 A(x) + B(x) Dual and primal-dual method 9-31
Douglas-Rachford splitting How to find points obeying x = C A C B (x)? First attempt: fixed-point iteration z t+1 = C A C B (z t ) unfortunately, it may not converge in general Douglas-Rachford splitting: damped fixed-point iteration z t+1 = 1 ) I + CA C B (z 2( t ) converges when solution to 0 A(x) + B(x) exists! Dual and primal-dual method 9-32
More explicit expression for D-R splitting Douglas-Rachford splitting update rule z t+1 = 1 2( I + CA C B ) (z t ) is essentially: x t+ 1 2 = RB (z t ) z t+ 1 2 = 2x t+ 1 2 z t x t+1 ( = R A z t+ 1 ) 2 z t+1 = 1 z 2( t + 2x t+1 z t+ 1 ) 2 = z t + x t+1 x t+ 1 2 where x t+ 1 2 and z t+ 1 2 are auxiliary variables Dual and primal-dual method 9-33
More explicit expression for D-R splitting or equivalently, x t+ 1 2 = RB (z t ) x t+1 = R A ( 2x t+ 1 2 z t ) z t+1 = z t + x t+1 x t+ 1 2 Dual and primal-dual method 9-33
Douglas-Rachford primal-dual splitting minimize x max λ f(x) + λ, Ax h (λ) Applying Douglas-Rachford splitting to (9.3) yields x t+ 1 2 = proxηf (p t ) λ t+ 1 2 = proxηg (q t ) [ ] [ x t+1 I ηa λ t+1 = ηa I ] 1 [ 2x t+ 1 2 p t 2λ t+ 1 2 q t ] p t+1 = p t + x t+1 x t+ 1 2 q t+1 = q t + λ t+1 λ t+ 1 2 Dual and primal-dual method 9-34
Example minimize x x 2 + γ Ax b 1 minimize x f(x) + g(ax) with f(x) := x 2 and g(y) := γ y b 1 10 0 10 1 x k x / x ADMM primal DR primal dual DR 10 2 10 3 10 4 0 50 100 150 iteration number k Connor, Vandenberghe 14 Dual and primal-dual method 9-35
Example minimize Kx b 1 + γ Dx iso }{{} certain l 2 l 1 norm minimize x f(x) + g(ax) s.t. 0 x 1 with f(x) := 1 {0 x 1} (x) and g(y 1, y 2 ) := y 1 b 1 + γ y 2 iso 10 0 10 1 10 2 (f(x k ) f )/f CP ADMM primal DR primal dual DR 10 3 10 4 10 5 10 6 10 7 0 200 400 600 800 1000 iteration number k Connor, Vandenberghe 14 Dual and primal-dual method 9-36
Reference [1] Optimization methods for large-scale systems, EE236C lecture notes, L. Vandenberghe, UCLA. [2] Convex optimization, EE364B lecture notes, S. Boyd, Stanford. [3] Mathematical optimization, MATH301 lecture notes, E. Candes, Stanford. [4] First-order methods in optimization, A. Beck, Vol. 25, SIAM, 2017. [5] Primal-dual decomposition by operator splitting and applications to image deblurring, D. O Connor, L. Vandenberghe, SIAM Journal on Imaging Sciences, 2014. [6] On the numerical solution of heat conduction problems in two and three space variables, J. Douglas, H. Rachford, Transactions of the American mathematical Society, 1956. Dual and primal-dual method 9-37
Reference [7] A primer on monotone operator methods, E. Ryu, S. Boyd, Appl. Comput. Math., 2016. Dual and primal-dual method 9-38