Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Similar documents
Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Dual Ascent. Ryan Tibshirani Convex Optimization

10-725/36-725: Convex Optimization Spring Lecture 21: April 6

Alternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Lecture 7: September 17

Subgradient Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Duality Uses and Correspondences. Ryan Tibshirani Convex Optimization

Distributed Optimization via Alternating Direction Method of Multipliers

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Dual Proximal Gradient Method

Conditional Gradient (Frank-Wolfe) Method

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Constrained Optimization and Lagrangian Duality

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Solving Dual Problems

Preconditioning via Diagonal Scaling

The Alternating Direction Method of Multipliers

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

2 Regularized Image Reconstruction for Compressive Imaging and Beyond

Karush-Kuhn-Tucker Conditions. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Introduction to Alternating Direction Method of Multipliers

9. Dual decomposition and dual algorithms

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

A SIMPLE PARALLEL ALGORITHM WITH AN O(1/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS

Lecture 5: September 15

Lecture 8: February 9

Lecture 25: November 27

Optimization methods

Motivation. Lecture 2 Topics from Optimization and Duality. network utility maximization (NUM) problem:

Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725

5. Subgradient method

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

10-725/ Optimization Midterm Exam

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Lecture 10: Linear programming duality and sensitivity 0-0

Optimization for Machine Learning

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

I.3. LMI DUALITY. Didier HENRION EECI Graduate School on Control Supélec - Spring 2010

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution

Selected Topics in Optimization. Some slides borrowed from

EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6)

Lasso: Algorithms and Extensions

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Accelerated primal-dual methods for linearly constrained convex problems

Distributed Optimization: Analysis and Synthesis via Circuits

Primal/Dual Decomposition Methods

Quiz Discussion. IE417: Nonlinear Programming: Lecture 12. Motivation. Why do we care? Jeff Linderoth. 16th March 2006

5. Duality. Lagrangian

Dual and primal-dual methods

Lecture 23: November 19

Lecture 23: November 21

Penalty and Barrier Methods. So we again build on our unconstrained algorithms, but in a different way.

A Unified Approach to Proximal Algorithms using Bregman Distance

Convex Optimization Boyd & Vandenberghe. 5. Duality

6. Proximal gradient method

You should be able to...

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

A Customized ADMM for Rank-Constrained Optimization Problems with Approximate Formulations

Coordinate Descent and Ascent Methods

Lecture 10: Linear programming. duality. and. The dual of the LP in standard form. maximize w = b T y (D) subject to A T y c, minimize z = c T x (P)

Lecture 6: Conic Optimization September 8

Distributed Optimization and Statistics via Alternating Direction Method of Multipliers

CS269: Machine Learning Theory Lecture 16: SVMs and Kernels November 17, 2010

Newton s Method. Javier Peña Convex Optimization /36-725

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

Divide-and-combine Strategies in Statistical Modeling for Massive Data

Dual Decomposition.

Lecture 11 and 12: Penalty methods and augmented Lagrangian methods for nonlinear programming

ISM206 Lecture Optimization of Nonlinear Objective with Linear Constraints

Lecture 9: September 28

Optimization methods

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Lecture: Duality of LP, SOCP and SDP

Lecture 6: September 12

Lecture 6: September 17

Homework 4. Convex Optimization /36-725

Convex Optimization & Lagrange Duality

6. Proximal gradient method

arxiv: v1 [math.oc] 23 May 2017

UVA CS 4501: Machine Learning

ICS-E4030 Kernel Methods in Machine Learning

UC Berkeley Department of Electrical Engineering and Computer Science. EECS 227A Nonlinear and Convex Optimization. Solutions 5 Fall 2009

Sparse Gaussian conditional random fields

On the Method of Lagrange Multipliers

3. THE SIMPLEX ALGORITHM

2.098/6.255/ Optimization Methods Practice True/False Questions

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

12. Interior-point methods

EE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1

Math 273a: Optimization Subgradients of convex functions

Math 273a: Optimization Lagrange Duality

An interior-point stochastic approximation method and an L1-regularized delta rule

Transcription:

Dual methods and ADMM Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1

Given f : R n R, the function is called its conjugate Recall conjugate functions f (y) = max x R n yt x f(x) Conjugates appear frequently in dual programs, as f (y) = min x R n f(x) yt x If f is closed and convex, then f = f. Also, x f (y) y f(x) x argmin z R n f(z) y T z and for strictly convex f, f (y) = argmin z R n(f(z) y T z) 2

Outline Today: Dual gradient methods Dual decomposition Augmented Lagrangians ADMM 3

Dual gradient methods What if we can t derive dual (conjugate) in closed form, but want to utilize dual relationship? Turns out we can still use dual-based subradient or gradient methods E.g., consider the problem Its dual problem is min x R n f(x) subject to Ax = b max f ( A T u) b T u u R m where f is conjugate of f. Defining g(u) = f ( A T u), note that g(u) = A f ( A T u), and recall x f ( A T u) x argmin z R n f(z) + u T Az 4

Therefore the dual subgradient method (for minimizing negative of dual objective) starts with an initial dual guess u (0), and repeats for k = 1, 2, 3,... x (k) argmin x R n f(x) + (u (k 1) ) T Ax u (k) = u (k 1) + t k (Ax (k 1) b) where t k are step sizes, chosen in standard ways Recall that if f is strictly convex, then f is differentiable, and so we get dual gradient ascent, which repeats for k = 1, 2, 3,... x (k) = argmin x R n f(x) + (u (k 1) ) T Ax u (k) = u (k 1) + t k (Ax (k 1) b) (difference is that x (k) is unique, here) 5

Fact: if f strongly convex with parameter d, then f Lipschitz with parameter 1/d Check: if f strongly convex and x is its minimizer, then f(y) f(x) + d 2 y x 2, all y Hence defining x u = f (u), x v = f (v), f(x v ) u T x v f(x u ) u T x u + d 2 x u x v 2 2 f(x u ) v T x u f(x v ) v T x v + d 2 x u x v 2 2 Adding these together: d x u x v 2 2 (u v) T (x u x v ) Using Cauchy-Schwartz, rearranging: x u x v 2 (1/d) u v 2 6

Applying what we know about gradient descent: if f is strongly convex with parameter d, then dual gradient ascent with constant step size t k d converges at rate O(1/k). (Note: this is quite a strong assumption leading to a modest rate!) Dual generalized gradient ascent and accelerated dual generalized gradient method carry through in similar manner Disadvantages of dual methods: Can be slow to converge (think of subgradient method) Poor convergence properties: even though we may achieve convergence in dual objective value, convergence of u (k), x (k) to solutions requires strong assumptions (primal iterates x (k) can even end up being infeasible in limit) Advantage: decomposability 7

Consider min x R n Dual decomposition B f i (x i ) subject to Ax = b i=1 Here x = (x 1,... x B ) is division into B blocks of variables, so each x i R n i. We can also partition A accordingly A = [A 1,... A B ], where A i R m n i Simple but powerful observation, in calculation of (sub)gradient: x + argmin x R n x + i argmin x i R n i B f i (x i ) + u T Ax i=1 f i (x i ) + u T A i x i, for i = 1,... B i.e., minimization decomposes into B separate problems 8

Dual decomposition algorithm: repeat for k = 1, 2, 3,... x (k) i argmin x i R n i u (k) = u (k 1) + t k ( B f i (x i ) + (u (k 1) ) T A i x i, i=1 A i x (k 1) i ) b i = 1,... B Can think of these steps as: Broadcast: send u to each of the B processors, each optimizes in parallel to find x i Gather: collect A i x i from each processor, update the global dual variable u x 1 u x 2 u x 3 u 9

10 Example with inequality constraints: min x R n B f i (x i ) subject to i=1 B A i x i b i=1 Dual decomposition (projected subgradient method) repeats for k = 1, 2, 3,... x (k) i argmin x i R n i v (k) = u (k 1) + t k ( B f i (x i ) + (u (k 1) ) T A i x i, i=1 A i x (k 1) i ) b i = 1,... B u (k) = (v (k) ) + where ( ) + is componentwise thresholding, (u + ) i = max{0, u i }

11 Price coordination interpretation (from Vandenberghe s lecture notes): Have B units in a system, each unit chooses its own decision variable x i (how to allocate its goods) Constraints are limits on shared resources (rows of A), each component of dual variable u j is price of resource j Dual update: u + j = (u j ts j ) +, j = 1,... m where s = b B i=1 A ix i are slacks Increase price uj if resource j is over-utilized, s j < 0 Decrease price uj if resource j is under-utilized, s j > 0 Never let prices get negative

12 Augmented Lagrangian Convergence of dual methods can be greatly improved by utilizing augmented Lagrangian. Start by transforming primal min x R n f(x) + ρ 2 Ax b 2 2 subject to Ax = b Clearly extra term (ρ/2) Ax b 2 2 does not change problem Assuming, e.g., A has full column rank, primal objective is strongly convex (parameter ρ σmin 2 (A)), so dual objective is differentiable and we can use dual gradient ascent: repeat for k = 1, 2, 3,... x (k) = argmin x R n f(x) + (u (k 1) ) T Ax + ρ 2 Ax b 2 2 u (k) = u (k 1) + ρ(ax (k 1) b)

13 Note step size choice t k = ρ, for all k, in dual gradient ascent Why? Since x (k) minimizes f(x) + (u (k 1) ) T Ax + ρ 2 Ax b 2 2 over x R n, ( ) 0 f(x (k) ) + A T u (k 1) + ρ(ax (k) b) = f(x (k) ) + A T u (k) This is exactly the stationarity condition for the original primal problem; can show under mild conditions that Ax (k) b approaches zero (primal iterates approach feasibility), hence in the limit KKT conditions are satisfied and x (k), u (k) approach optimality Advantage: much better convergence properties Disadvantage: not decomposable (separability compromised by augmented Lagrangian!)

14 ADMM ADMM (Alternating Direction Method of Multipliers): go for the best of both worlds! I.e., good convergence properties of augmented Lagrangians, along with decomposability Consider minimization problem min x R n f 1(x 1 ) + f 2 (x 2 ) subject to A 1 x 1 + A 2 x 2 = b As usual, we augment the objective min x R n f 1(x 1 ) + f 2 (x 2 ) + ρ 2 A 1x 1 + A 2 x 2 b 2 2 subject to A 1 x 1 + A 2 x 2 = b

15 Write the augmented Lagrangian as L ρ (x 1, x 2, u) = f 1 (x 1 ) + f 2 (x 2 ) + u T (A 1 x 1 + A 2 x 2 b) + ρ 2 A 1x 1 + A 2 x 2 b 2 2 ADMM repeats the steps, for k = 1, 2, 3,... x (k) 1 = argmin L ρ (x 1, x (k 1) x 1 R n 2, u (k 1) ) 1 x (k) 2 = argmin L ρ (x (k) x 2 R n 1, x 2, u (k 1) ) 2 u (k) = u (k 1) + ρ(a 1 x (k) 1 + A 2 x (k) 2 b) Note that the usual method of multipliers would have replaced the first two steps by (x (k) 1, x(k) 2 ) = argmin (x 1,x 2 ) R n L ρ (x 1, x 2, u (k 1) )

16 Convergence guarantees Under modest assumptions on f 1, f 2 (note: these do not require A 1, A 2 to be full rank), we get that ADMM iterates for any ρ > 0 satisfy: Residual convergence: r (k) = A 1 x (k) 1 A 2 x (k) 2 b 0 as k, i.e., primal iterates approach feasibility Objective convergence: f 1 (x (k) 1 ) + f 2(x (k) 2 ) f, where f is the optimal primal criterion value Dual convergence: u (k) u, where u is a dual solution For details, see Boyd et al. (2010) Note that we do not generically get primal convergence, but this can be shown under more assumptions

17 Scaled form It is often easier to express the ADMM algorithm in scaled form, where we replace the dual variable u by a scaled variable w = u/ρ In this parametrization, the ADMM steps are x (k) 1 = argmin f 1 (x 1 ) + ρ x 1 R n 1 2 A 1x 1 + A 2 x (k 1) 2 b + w (k 1) 2 2 x (k) 2 = argmin f 2 (x 2 ) + ρ x 2 R n 2 2 A 1x (k) 1 + A 2 x 2 b + w (k 1) 2 2 w (k) = w (k 1) + A 1 x (k) 1 + A 2 x (k) 2 b Note that here the kth iterate w (k) is just given by a running sum of residuals: w (k) = w (0) + k i=1 ( A1 x (i) 1 + A 2x (i) 2 b)

18 Practicalities and tricks In practice, ADMM obtains a relatively accurate solution in a handful of iterations, but requires many, many iterations for a highly accurate solution. Hence it behaves more like a first-order method than a second-order method Choice of ρ can greatly influence practical convergence of ADMM ρ too large not enough emphasis on minimizing f 1 + f 2 ρ too small not enough emphasis on feasibility Boyd et al. (2010) give a strategy for varying ρ that is useful in practice (but without convergence guarantees) Like deriving duals, getting a problem into ADMM form often requires a bit of trickery (and different forms can lead to different algorithms)

Alternating projections, revisited Consider finding a point in intersection of convex sets C, D R n. We solve min x R n 1 C(x) + 1 D (x) To get into ADMM form, we write this as min 1 C(x) + 1 D (z) subject to x z = 0 x,z R n Each ADMM cycle involves two projections: x (k) = argmin x R n z (k) = argmin z R n P C ( z (k 1) w (k 1)) P D ( x (k) + w (k 1)) w (k) = w (k 1) + x (k) z (k) This is like the classical alternating projections method, but now with a dual variable w (much more efficient) 19

20 Generalized lasso, revisited Given the usual y R n, X R n p, and an additional D R m p, the generalized lasso problem solves 1 min β R p 2 y Xβ 2 2 + λ Dβ 1 The generalized lasso is computationally harder than the lasso (D = I); recall our previous discussion on algorithms. Rewrite as 1 min β R p, α R m 2 y Xβ 2 2 + λ α 1 subject to Dβ α = 0 and ADMM delivers a simple algorithm for the generalized lasso, β (k) = (X T X + ρd T D) +( X T y + ρd T (α (k 1) w (k 1) ) ) α (k) = S λ/ρ (Dβ (k) + w (k 1) ) w (k) = w (k 1) + Dβ (k) α (k)

21 References S. Boyd and N. Parikh and E. Chu and B. Peleato and J. Eckstein (2010), Distributed optimization and statistical learning via the alternating direction method of multipliers L. Vandenberghe, Lecture Notes for EE 236C, UCLA, Spring 2011-2012