Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Similar documents
Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Dual Ascent. Ryan Tibshirani Convex Optimization

Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725

10-725/36-725: Convex Optimization Spring Lecture 21: April 6

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Subgradient Method. Ryan Tibshirani Convex Optimization

Dual Proximal Gradient Method

Alternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Optimization methods

Constrained Optimization and Lagrangian Duality

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

2 Regularized Image Reconstruction for Compressive Imaging and Beyond

ICS-E4030 Kernel Methods in Machine Learning

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Karush-Kuhn-Tucker Conditions. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Conditional Gradient (Frank-Wolfe) Method

The Alternating Direction Method of Multipliers

10-725/ Optimization Midterm Exam

Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Duality Uses and Correspondences. Ryan Tibshirani Convex Optimization

Lasso: Algorithms and Extensions

Introduction to Alternating Direction Method of Multipliers

Newton s Method. Javier Peña Convex Optimization /36-725

Optimization methods

9. Dual decomposition and dual algorithms

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Dual and primal-dual methods

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

CS-E4830 Kernel Methods in Machine Learning

Dual Decomposition.

Lecture 23: Conditional Gradient Method

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Distributed Optimization via Alternating Direction Method of Multipliers

Coordinate Descent and Ascent Methods

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

6. Proximal gradient method

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

Optimization for Machine Learning

Lecture 25: November 27

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Duality. Geoff Gordon & Ryan Tibshirani Optimization /

I.3. LMI DUALITY. Didier HENRION EECI Graduate School on Control Supélec - Spring 2010

Motivation. Lecture 2 Topics from Optimization and Duality. network utility maximization (NUM) problem:

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Homework 4. Convex Optimization /36-725

Preconditioning via Diagonal Scaling

6. Proximal gradient method

Lecture 8: February 9

Homework 5 ADMM, Primal-dual interior point Dual Theory, Dual ascent

Lecture 23: November 19

3.10 Lagrangian relaxation

1 Sparsity and l 1 relaxation

Selected Topics in Optimization. Some slides borrowed from

Math 273a: Optimization Subgradients of convex functions

Homework 3. Convex Optimization /36-725

Solving Dual Problems

Lecture 23: November 21

EE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Sparse Optimization Lecture: Dual Methods, Part I

Lecture 7: September 17

Support Vector Machines and Kernel Methods

minimize x subject to (x 2)(x 4) u,

12. Interior-point methods

Proximal Methods for Optimization with Spasity-inducing Norms

Sparse Gaussian conditional random fields

Lagrangian Duality and Convex Optimization

Accelerated primal-dual methods for linearly constrained convex problems

Lecture 9: September 28

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

A SIMPLE PARALLEL ALGORITHM WITH AN O(1/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS

Lecture 1: September 25

1. f(β) 0 (that is, β is a feasible point for the constraints)

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Written Examination

EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6)

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Convex Optimization and l 1 -minimization

Convex Optimization Lecture 16

Penalty and Barrier Methods. So we again build on our unconstrained algorithms, but in a different way.

1 Computing with constraints

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Lecture: Duality of LP, SOCP and SDP

Big Data Analytics: Optimization and Randomization

Introduction to Mathematical Programming IE406. Lecture 10. Dr. Ted Ralphs

5. Duality. Lagrangian

EE364b Convex Optimization II May 30 June 2, Final exam

Distributed Smooth and Strongly Convex Optimization with Inexact Dual Methods

Review Solutions, Exam 2, Operations Research

Convex Optimization Algorithms for Machine Learning in 10 Slides

Lecture 24: August 28

Smoothing Proximal Gradient Method. General Structured Sparse Regression

Transcription:

Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1

Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear frequently in dual programs, as f (y) = min x R n f(x) yt x If f is closed and convex, then f = f. Also, x f (y) y f(x) x argmin z R n f(z) y T z and for strictly convex f, f (y) = argmin z R n(f(z) y T z) 2

Uses of duality We already discussed two key uses of duality: For x primal feasible and u, v dual feasible, f(x) g(u, v) is called the duality gap between x and u, v. Since f(x) f(x ) f(x) g(u, v) a zero duality gap implies optimality. Also, the duality gap can be used as a stopping criterion in algorithms Under strong duality, given dual optimal u, v, any primal solution minimizes L(x, u, v ) over x R n (i.e., satisfies stationarity condition). This can be used to characterize or compute primal solutions 3

Outline Examples Dual gradient methods Dual decomposition Augmented Lagrangians (And many more uses of duality e.g., dual certificates in recovery theory, dual simplex algorithm, dual smoothing) 4

where A 1,... A p are columns of A. I.e., A T i u < λ implies x i = 0 5 Lasso and projections onto polyhedra Recall the lasso problem: and its dual problem: 1 min x R p 2 y Ax 2 + λ x 1 min y u 2 subject u Rn to A T u λ According to stationarity condition (with respect to z, x blocks): Ax = y u {λ} if x i > 0 A T i u { λ} if x i < 0, i = 1,... p [ λ, λ] if x i = 0

Directly from dual problem, min y u 2 subject u Rn to A T u λ we see that u = P C (y) projection of y onto polyhedron C = {u R n : A T u λ} p = {u : A T i u λ} {u : A T i u λ} i=1 Therefore the lasso fit is Ax = (I P C )(y) residual from projecting onto C 6

7

Consider the lasso fit Ax as a function of y R n, for fixed A, λ. From the dual perspective (and some geometric arguments): The lasso fit Ax is nonexpansive with respect to y, i.e., it is Lipschitz with constant L = 1: Ax (y) Ax (y ) y y for all y, y Each face of polyhedron C corresponds to a particular active set S for lasso solutions 1 For almost every y R n, if we move y slightly, it will still project to the same face of C Therefore, for almost every y, the active set S of the lasso solution is locally constant, 2 and the lasso fit is a locally affine projection map 1,2 These statements assume that the lasso solution is unique; analogous statements exist for the nonunique case 8

Safe rules For the lasso problem, somewhat amazingly, we have a safe rule: 3 A T i y < λ A i y λ max λ λ max x i = 0, all i = 1,... p where λ max = A T y (the smallest value of λ such that x = 0), i.e., we can eliminate features apriori, without solving the problem. (Note: this is not an if and only if statement!) Why this rule? Construction comes from lasso dual: max u R n g(u) subject to AT u λ where g(u) = ( y 2 y u 2 )/2. Suppose u 0 is a dual feasible point (e.g., take u 0 = y λ/λ max ). Then γ = g(u 0 ) lower bounds dual optimal value, so dual problem is equivalent to max u R n g(u) subject to AT u λ, g(u) γ 3 L. El Ghaoui et al. (2010), Safe feature elimination in sparse learning. Safe rules extend to lasso logistic regression and 1-norm SVMs, only g changes 9

Now consider computing m i = max u R n AT i u subject to g(u) γ, for i = 1,... p Note that m i < λ A T i u < λ x i = 0 Through another dual argument, we can explicitly compute m i, and m i < λ A T i y < λ y 2 2γ x Substituting γ = g(y λ/λ max ) then gives safe rule on previous slide 4 4 From L. El Ghaoui et al. (2010), Safe feature elimination in sparse learning 10

11 Beyond pure sparsity Consider something like a reverse lasso problem (also called 1-norm analysis): 1 min x R p 2 y x 2 + λ Dx 1 where D R m n is a given penalty matrix (analysis operator). Note this cannot be turned into a lasso problem if rank(d) < m Basic idea: Dx is now sparse, and we choose D so that this gives some type of desired structure in x. E.g., fused lasso (also called total variation denoising problems), where D is chosen so that Dx 1 = (i,j) E x i x j for some set of pairs E. In other words, D is incidence matrix for graph G = ({1,... p}, E), with arbitrary edge orientations

12 original image noisy version fused lasso solution Here D is incidence matrix on 2d grid

13 For each state, we have log proportion of H1N1 cases in 2009 (from the CDC) observed data fused lasso solution Here D is the incidence matrix on the graph formed by joining US states to their geographic neighbors

14 Using similar steps as in lasso dual derivation, here dual problem is: min y u R DT u 2 m subject to u λ and primal-dual relationship is x = y D T u {λ} if (Dx ) i > 0 u { λ} if (Dx ) i < 0, i = 1,... m [ λ, λ] if (Dx ) i = 0 Clearly D T u = P C (y), where now C = {D T u : u λ} also a polyhedron, and therefore x = (I P C )(y)

15

16 Same arguments as before show that: Primal solution x is Lipschitz continuous as a function of y (for fixed D, λ) with constant L = 1 Each face of polyhedron C corresponds to a nonzero pattern in Dx Almost everywhere in y, primal solution x admits a locally constant structure S = supp(dx ), and therefore is a locally affine projection map Dual is also very helpful for algorithmic reasons: it uncomplicates (disentagles) involvement of linear operator D with 1-norm Prox function in dual problem now very easy (projection onto -norm ball) so we can use, e.g., generalized gradient descent or accelerated generalized gradient method on the dual problem

17 Dual gradient methods What if we can t derive dual (conjugate) in closed form, but want to utilize dual relationship? Turns out we can still use dual-based subradient or gradient methods E.g., consider the problem Its dual problem is min x R n f(x) subject to Ax = b max f ( A T u) b T u u R m where f is conjugate of f. Defining g(u) = f ( A T u), note that g(u) = A f ( A T u), and recall x f ( A T u) x argmin z R n f(z) + u T Az

18 Therefore the dual subgradient method (for minimizing negative of dual objective) starts with an initial dual guess u (0), and repeats for k = 1, 2, 3,... x (k) argmin x R n f(x) + (u (k 1) ) T Ax u (k) = u (k 1) + t k (Ax (k 1) b) where t k are step sizes, chosen in standard ways Recall that if f is strictly convex, then f is differentiable, and so we get dual gradient ascent, which repeats for k = 1, 2, 3,... x (k) = argmin x R n f(x) + (u (k 1) ) T Ax u (k) = u (k 1) + t k (Ax (k 1) b) (difference is that x (k) is unique, here)

19 In fact, f strongly convex with parameter d f Lipschitz with parameter 1/d Check: if f strongly convex and x is its minimizer, then f(y) f(x) + d y x, 2 all y Hence defining x u = f (u), x v = f (v), f(x v ) u T x v f(x u ) u T x u + d 2 x u x v 2 f(x u ) v T x u f(x v ) v T x v + d 2 x u x v 2 Adding these together: d x u x v 2 (u v) T (x u x v ) Use Cauchy-Schwartz and rearrange, x u x v (1/d) u v

20 Applying what we know about gradient descent: if f is strongly convex with parameter d, then dual gradient ascent with constant step size t k d converges at rate O(1/k). (Note: this is quite a strong assumption leading to a modest rate!) Dual generalized gradient ascent and accelerated dual generalized gradient method carry through in similar manner Disadvantages of dual methods: Can be slow to converge (think of subgradient method) Poor convergence properties: even though we may achieve convergence in dual objective value, convergence of u (k), x (k) to solutions requires strong assumptions (primal iterates x (k) can even end up being infeasible in limit) Advantage: decomposability

21 Consider min x R n Dual decomposition B f i (x i ) subject to Ax = b i=1 Here x = (x 1,... x B ) is division into B blocks of variables, so each x i R n i. We can also partition A accordingly A = [A 1,... A B ], where A i R m n i Simple but powerful observation, in calculation of (sub)gradient: x + argmin x R n x + i argmin x i R n i B f i (x i ) + u T Ax i=1 f i (x i ) + u T A i x i, for i = 1,... B i.e., minimization decomposes into B separate problems

22 Dual decomposition algorithm: repeat for k = 1, 2, 3,... x (k) i argmin x i R n i u (k) = u (k 1) + t k ( B f i (x i ) + (u (k 1) ) T A i x i, i=1 A i x (k 1) i ) b i = 1,... B Can think of these steps as: Broadcast: send u to each of the B processors, each optimizes in parallel to find x i Gather: collect A i x i from each processor, update the global dual variable u x 1 u x 2 u x 3 u

23 Example with inequality constraints: min x R n B f i (x i ) subject to i=1 B A i x i b i=1 Dual decomposition (projected subgradient method) repeats for k = 1, 2, 3,... x (k) i argmin x i R n i v (k) = u (k 1) + t k ( B f i (x i ) + (u (k 1) ) T A i x i, i=1 A i x (k 1) i ) b i = 1,... B u (k) = (v (k) ) + where ( ) + is componentwise thresholding, (u + ) i = max{0, u i }

24 Price coordination interpretation (from Vandenberghe s lecture notes): Have B units in a system, each unit chooses its own decision variable x i (how to allocate its goods) Constraints are limits on shared resources (rows of A), each component of dual variable u j is price of resource j Dual update: u + j = (u j ts j ) +, j = 1,... m where s = b B i=1 A ix i are slacks Increase price uj if resource j is over-utilized, s j < 0 Decrease price uj if resource j is under-utilized, s j > 0 Never let prices get negative

25 Augmented Lagrangian Convergence of dual methods can be greatly improved by utilizing augmented Lagrangian. Start by transforming primal min f(x) + ρ Ax b 2 x R n 2 subject to Ax = b Clearly extra term (ρ/2) Ax b 2 does not change problem Assuming, e.g., A has full column rank, primal objective is strongly convex (parameter ρ σmin 2 (A)), so dual objective is differentiable and we can use dual gradient ascent: repeat for k = 1, 2, 3,... x (k) = argmin x R n f(x) + (u (k 1) ) T Ax + ρ Ax b 2 2 u (k) = u (k 1) + ρ(ax (k 1) b)

Note step size choice t k = ρ, for all k, in dual gradient ascent. Why? Since x (k) minimizes f(x) + (u (k 1) ) T Ax + ρ 2 Ax b 2 over x R n, ( ) 0 f(x (k) ) + A T u (k 1) + ρ(ax (k) b) = f(x (k) ) + A T y (k) This is exactly the stationarity condition for the original primal problem; can show under mild conditions that Ax (k) b approaches zero (primal iterates approach feasibility), hence in the limit KKT conditions are satisfied and x (k), u (k) approach optimality Advantage: much better convergence properties Disadvantage: not decomposable (separability comprimised by augmented Lagrangian) ADMM (Alternating Direction Method of Multipliers): tries for best of both worlds 26

27 References S. Boyd and N. Parikh and E. Chu and B. Peleato and J. Eckstein (2010), Distributed optimization and statistical learning via the alternating direction method of multipliers L. Vandenberghe, Lecture Notes for EE 236C, UCLA, Spring 2011-2012