Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Similar documents
Dual Ascent. Ryan Tibshirani Convex Optimization

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

10-725/36-725: Convex Optimization Spring Lecture 21: April 6

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Alternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization

Introduction to Alternating Direction Method of Multipliers

Duality Uses and Correspondences. Ryan Tibshirani Convex Optimization

Lecture 23: November 19

Math 273a: Optimization Lagrange Duality

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Dual Proximal Gradient Method

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Distributed Optimization via Alternating Direction Method of Multipliers

Lecture 23: Conditional Gradient Method

Decentralized Quadratically Approximated Alternating Direction Method of Multipliers

Lecture 16: October 22

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

The Alternating Direction Method of Multipliers

Subgradient Method. Ryan Tibshirani Convex Optimization

A SIMPLE PARALLEL ALGORITHM WITH AN O(1/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

arxiv: v1 [math.oc] 23 May 2017

Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Lecture 14: Newton s Method

Newton s Method. Javier Peña Convex Optimization /36-725

Lecture 24 November 27

Solving Dual Problems

2 Regularized Image Reconstruction for Compressive Imaging and Beyond

1 Kernel methods & optimization

Conditional Gradient (Frank-Wolfe) Method

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

Primal/Dual Decomposition Methods

Constrained Optimization and Lagrangian Duality

Lecture 6: September 17

Karush-Kuhn-Tucker Conditions. Lecturer: Ryan Tibshirani Convex Optimization /36-725

6. Proximal gradient method

Motivation. Lecture 2 Topics from Optimization and Duality. network utility maximization (NUM) problem:

9. Dual decomposition and dual algorithms

Lecture 5: September 15

Preconditioning via Diagonal Scaling

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

6. Proximal gradient method

In applications, we encounter many constrained optimization problems. Examples Basis pursuit: exact sparse recovery problem

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Lecture 6: September 12

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

10701 Recitation 5 Duality and SVM. Ahmed Hefny

Dual Decomposition.

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Convex Optimization. Lecture 12 - Equality Constrained Optimization. Instructor: Yuanzhang Xiao. Fall University of Hawaii at Manoa

Math 273a: Optimization Subgradients of convex functions

ALADIN An Algorithm for Distributed Non-Convex Optimization and Control

Lecture 14: Optimality Conditions for Conic Problems

CPSC 540: Machine Learning

Lecture 7: September 17

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

Nonparametric Quantile Regression for Prediction of Parking Data

Optimization methods

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

Lecture 25: November 27

A Customized ADMM for Rank-Constrained Optimization Problems with Approximate Formulations

Accelerated primal-dual methods for linearly constrained convex problems

Convex Optimization Overview (cnt d)

Topic 8c Multi Variable Optimization

Big Data Analytics: Optimization and Randomization

UVA CS 4501: Machine Learning

Selected Topics in Optimization. Some slides borrowed from

Distributed Optimization: Analysis and Synthesis via Circuits

A Unified Approach to Proximal Algorithms using Bregman Distance

Convexity II: Optimization Basics

Nonlinear Optimization for Optimal Control

Asynchronous Non-Convex Optimization For Separable Problem

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Constrained Optimization

Lecture 23: November 21

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Dual and primal-dual methods

Sparse Gaussian conditional random fields

ECE Optimization for wireless networks Final. minimize f o (x) s.t. Ax = b,

Lecture 8: February 9

Optimization for Machine Learning

Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints

EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6)

Computational Optimization. Constrained Optimization Part 2

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Math 273a: Optimization Subgradient Methods

Homework 4. Convex Optimization /36-725

Lecture 11 and 12: Penalty methods and augmented Lagrangian methods for nonlinear programming

Bregman Alternating Direction Method of Multipliers

An Optimization-based Approach to Decentralized Assignability

12. Interior-point methods

Distributed Smooth and Strongly Convex Optimization with Inexact Dual Methods

Sparse Optimization Lecture: Dual Methods, Part I

Descent methods. min x. f(x)

Transcription:

Dual Methods Lecturer: Ryan Tibshirani Conve Optimization 10-725/36-725 1

Last time: proimal Newton method Consider the problem min g() + h() where g, h are conve, g is twice differentiable, and h is simple. Proimal Newton method: let (0) R n, and repeat: v (k) = argmin v g( (k 1) ) T v + 1 2 vt 2 g( (k 1) )v + h( (k 1) + v) (k) = (k 1) + t k v (k), k = 1, 2, 3,... Step sizes are typically chosen by backtracking Iterations here are typically very epensive (computing d (k) is typically a formidable task) But typically very few iterations are needed until convergence: under appropriate conditions, get local quadratic convergence 2

Reminder: conjugate functions Recall that given f : R n R, the function f (y) = ma y T f() is called its conjugate Conjugates appear frequently in dual programs, since f (y) = min f() y T If f is closed and conve, then f = f. Also, f (y) y f() argmin z f(z) y T z If f is strictly conve, then f (y) = argmin z f(z) y T z 3

Proof details We will show that f (y) y f(), assuming that f is conve and closed Proof of = : Suppose y f(). Then M y, the set of maimizers of y T z f(z) over z. But f (y) = ma z y T z f(z) and f (y) = cl(conv( z My {z})). Thus f (y). Proof of = : From what we showed above, if f (y), then y f (), but f = f. Clearly y f() argmin z f(z) y T z Lastly if f is strictly conve, then we know that f(z) y T z has a unique minimizer over z, and this must be f (y) 4

Outline Today: Dual (sub)gradient methods Dual decomposition Augmented Lagrangians A peak at ADMM 5

Dual (sub)gradient methods Even if we can t derive dual (conjugate) in closed form, we can still use dual-based subgradient or gradient methods Eample: consider the problem min f() subject to A = b Its dual problem is ma u f ( A T u) b T u where f is conjugate of f. Defining g(u) = f ( A T u) b T u, note that g(u) = A f ( A T u) b 6

Therefore, using what we know about conjugates g(u) = A b where argmin z f(z) + u T Az The dual subgradient method (for maimizing the dual objective) starts with an initial dual guess u (0), and repeats for k = 1, 2, 3,... (k) argmin f() + (u (k 1) ) T A u (k) = u (k 1) + t k (A (k) b) Step sizes t k, k = 1, 2, 3,..., are chosen in standard ways 7

Recall that if f is strictly conve, then f is differentiable, and so this becomes dual gradient ascent, which repeats for k = 1, 2, 3,... (k) = argmin f() + (u (k 1) ) T A u (k) = u (k 1) + t k (A (k) b) (Difference is that each (k) is unique, here.) Again, step sizes t k, k = 1, 2, 3,... are chosen in standard ways Also, proimal gradients and acceleration can be applied as they would usually 8

Lipschitz gradients and strong conveity Assume that f is a closed and conve function. Then f is strongly conve with parameter d f Lipschitz with parameter 1/d Proof of = : Recall, if g strongly conve with minimizer, then g(y) g() + d 2 y 2, for all y Hence defining u = f (u), v = f (v), f( v ) u T v f( u ) u T u + d 2 u v 2 2 f( u ) v T u f( v ) v T v + d 2 u v 2 2 Adding these together, using Cauchy-Schwartz, and rearranging shows that u v 2 u v 2 /d 9

10 Convergence guarantees The following results hold from combining the last fact with what we already know about gradient descent: If f is strongly conve with parameter d, then dual gradient ascent with fied step sizes t k = d, k = 1, 2, 3,..., converges at the rate O(1/ɛ) If f is strongly conve with parameter d, and f is Lipschitz with parameter L, then dual gradient ascent with step sizes t k = 2/(1/d + 1/L), k = 1, 2, 3,..., converges at the rate O(log(1/ɛ)) Note that these results describe convergence of the dual objective to its optimal value

11 Consider min Dual decomposition B f i ( i ) subject to A = b i=1 Here = ( 1,... B ) R n divides into B blocks of variables, with each i R n i. We can also partition A accordingly A = [A 1... A B ], where A i R m n i Simple but powerful observation, in calculation of (sub)gradient, is that the minimization decomposes into B separate problems: + i + argmin B f i ( i ) + u T A i=1 argmin f i ( i ) + u T A i i, i = 1,... B i

12 Dual decomposition algorithm: repeat for k = 1, 2, 3,... (k) i argmin f i ( i ) + (u (k 1) ) T A i i, i = 1,... B i u (k) = u (k 1) + t k ( B i=1 A i (k) i ) b Can think of these steps as: Broadcast: send u to each of the B processors, each optimizes in parallel to find i Gather: collect A i i from each processor, update the global dual variable u 1 u 2 u 3 u

13 Dual decomposition with inequality constraints Consider min B f i ( i ) subject to i=1 B A i i b i=1 Dual decomposition (projected subgradient method): repeat for k = 1, 2, 3,... argmin f i ( i ) + (u (k 1) ) T A i i, i = 1,... B i ( ( B ) ) u (k) = u (k 1) + t k A i (k) i b (k) i i=1 where u + denotes the positive part of u, i.e., (u + ) i = ma{0, u i }, i = 1,..., m +

14 Price coordination interpretation (Vandenberghe): Have B units in a system, each unit chooses its own decision variable i (how to allocate its goods) Constraints are limits on shared resources (rows of A), each component of dual variable u j is price of resource j Dual update: u + j = (u j ts j ) +, j = 1,... m where s = b B i=1 A i i are slacks Increase price uj if resource j is over-utilized, s j < 0 Decrease price uj if resource j is under-utilized, s j > 0 Never let prices get negative

15 Augmented Lagrangian method also known as: method of multipliers Disadvantage of dual ascent: require strong conditions to ensure convergence. Improved by augmented Lagrangian method, also called method of multipliers. We transform the primal problem: min f() + ρ 2 A b 2 2 subject to A = b where ρ > 0 is a parameter. Clearly equivalent to original problem, and objective is strongly conve when A has full column rank. Use dual gradient ascent: repeat for k = 1, 2, 3,... (k) = argmin u (k) = u (k 1) + ρ(a (k) b) f() + (u (k 1) ) T A + ρ 2 A b 2 2

16 Notice step size choice t k = ρ, k = 1, 2, 3,... in dual algorithm. Why? Since (k) minimizes f() + (u (k 1) ) T A + ρ 2 A b 2 2 over, we have ( ) 0 f( (k) ) + A T u (k 1) + ρ(a (k) b) = f( (k) ) + A T u (k) This is the stationarity condition for the original primal problem; can show under mild conditions that A (k) b approaches zero (i.e., primal iterates approach feasibility), hence in the limit KKT conditions are satisfied and (k), u (k) approach optimality Advantage: much better convergence properties. Disadvantage: lose decomposability! (Separability is compromised by augmented Lagrangian...)

17 Alternating direction method of multipliers Alternating direction method of multipliers or ADMM: the best of both worlds, i.e., we get strong convergence properties, along with decomposability. Consider min,z f() + g(z) subject to A + Bz = c As before, we augment the objective min f() + g(z) + ρ 2 A + Bz c 2 2 subject to A + Bz = c for a parameter ρ > 0. We define augmented Lagrangian L ρ (, z, u) = f() + g(z) + u T (A + Bz c) + ρ 2 A + Bz c 2 2

18 ADMM repeats the steps, for k = 1, 2, 3,... (k) = argmin z (k) = argmin z L ρ (, z (k 1), u (k 1) ) L ρ ( (k), z, u (k 1) ) u (k) = u (k 1) + ρ(a (k) + Bz (k) c) Note that the usual method of multipliers would have replaced the first two steps by a joint minimization ( (k), z (k) ) = argmin,z L ρ (, z, u (k 1) )

19 Convergence guarantees Under modest assumptions on f, g (these do not require A, B to be full rank), the ADMM iterates satisfy, for any ρ > 0: Residual convergence: r (k) = A (k) Bz (k) c 0 as k, i.e., primal iterates approach feasibility Objective convergence: f( (k) ) + g(z (k) ) f + g, where f + g is the optimal primal objective value Dual convergence: u (k) u, where u is a dual solution For details, see Boyd et al. (2010). Note that we do not generically get primal convergence, but this is true under more assumptions Convergence rate: not known in general, theory is currently being developed, e.g., in Hong and Luo (2012), Deng and Yin (2012), Iutzeler et al. (2014), Nishihara et al. (2015). Roughly, it behaves like a first-order method (or a bit faster)

20 ADMM in scaled form It is often easier to epress the ADMM algorithm in a scaled form, where we replace the dual variable u by a scaled variable w = u/ρ. In this parametrization, the ADMM steps are: (k) = argmin z (k) = argmin z f() + ρ 2 A + Bz(k 1) c + w (k 1) 2 2 g(z) + ρ 2 A(k) + Bz c + w (k 1) 2 2 w (k) = w (k 1) + A (k) + Bz (k) c Note that here the kth iterate w (k) is just given by a running sum of residuals: w (k) = w (0) + k ( A (i) + Bz (i) c ) i=1

21 Eample: alternating projections Consider finding a point in intersection of conve sets C, D R n : min I C () + I D () To get this into ADMM form, we epress it as min,z I C () + I D (z) subject to z = 0 Each ADMM cycle involves two projections: (k) ( = argmin P C z (k 1) w (k 1)) z (k) ( = argmin P D (k) + w (k 1)) z w (k) = w (k 1) + (k) z (k) Like the classical alternating projections method, but more efficient

References S. Boyd and N. Parikh and E. Chu and B. Peleato and J. Eckstein (2010), Distributed optimization and statistical learning via the alternating direction method of multipliers W. Deng and W. Yin (2012), On the global and linear convergence of the generalized alternating direction method of multipliers M. Hong and Z. Luo (2012), On the linear convergence of the alternating direction method of multipliers F. Iutzeler and P. Bianchi and Ph. Ciblat and W. Hachem, (2014), Linear convergence rate for distributed optimization with the alternating direction method of multipliers R. Nishihara and L. Lessard and B. Recht and A. Packard and M. Jordan (2015), A general analysis of the convergence of ADMM L. Vandenberghe, Lecture Notes for EE 236C, UCLA, Spring 2011-2012 22