Math 273a: Optimization Convex Conjugacy

Similar documents
Sparse Optimization Lecture: Dual Methods, Part I

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Lecture: Smoothing.

Math 273a: Optimization Subgradients of convex functions

Math 273a: Optimization Subgradient Methods

Math 273a: Optimization Subgradients of convex functions

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Math 273a: Optimization Lagrange Duality

Dual Proximal Gradient Method

Convergence of Fixed-Point Iterations

Constrained Optimization and Lagrangian Duality

8. Conjugate functions

Subgradient Method. Ryan Tibshirani Convex Optimization

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 4. Subgradient

Lagrange duality. The Lagrangian. We consider an optimization program of the form

Dual and primal-dual methods

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

BASICS OF CONVEX ANALYSIS

Subgradient. Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes. definition. subgradient calculus

LECTURE 12 LECTURE OUTLINE. Subgradients Fenchel inequality Sensitivity in constrained optimization Subdifferential calculus Optimality conditions

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

5. Subgradient method

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

Lecture: Duality of LP, SOCP and SDP

Coordinate Update Algorithm Short Course Operator Splitting

Lecture 6: Conic Optimization September 8

Iterative Convex Optimization Algorithms; Part One: Using the Baillon Haddad Theorem

5. Duality. Lagrangian

Smoothing Proximal Gradient Method. General Structured Sparse Regression

Linear and non-linear programming

EE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Newton s Method. Javier Peña Convex Optimization /36-725

Extended Monotropic Programming and Duality 1

Convex Optimization Conjugate, Subdifferential, Proximation

Optimization and Optimal Control in Banach Spaces

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Convex Optimization Boyd & Vandenberghe. 5. Duality

ICS-E4030 Kernel Methods in Machine Learning

I.3. LMI DUALITY. Didier HENRION EECI Graduate School on Control Supélec - Spring 2010

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Optimization for Communications and Networks. Poompat Saengudomlert. Session 4 Duality and Lagrange Multipliers

Dual Decomposition.

10 Numerical methods for constrained problems

Convex Analysis Notes. Lecturer: Adrian Lewis, Cornell ORIE Scribe: Kevin Kircher, Cornell MAE

Optimization for Machine Learning

Applications of Linear Programming

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

ALGORITHMS FOR MINIMIZING DIFFERENCES OF CONVEX FUNCTIONS AND APPLICATIONS

The proximal mapping

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Convex Analysis Background

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Convex functions, subdifferentials, and the L.a.s.s.o.

Epiconvergence and ε-subgradients of Convex Functions

Duality (Continued) min f ( x), X R R. Recall, the general primal problem is. The Lagrangian is a function. defined by

Subgradients. subgradients and quasigradients. subgradient calculus. optimality conditions via subgradients. directional derivatives

Conditional Gradient (Frank-Wolfe) Method

LECTURE 10 LECTURE OUTLINE

Lecture: Duality.

In applications, we encounter many constrained optimization problems. Examples Basis pursuit: exact sparse recovery problem

Convex Functions. Pontus Giselsson

CSCI : Optimization and Control of Networks. Review on Convex Optimization

Subgradients. subgradients. strong and weak subgradient calculus. optimality conditions via subgradients. directional derivatives

You should be able to...

Lecture 1: Background on Convex Analysis

Convex Analysis and Optimization Chapter 4 Solutions

Accelerated Proximal Gradient Methods for Convex Optimization

CS-E4830 Kernel Methods in Machine Learning

Subgradient Descent. David S. Rosenberg. New York University. February 7, 2018

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

9. Dual decomposition and dual algorithms

Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Additional Homework Problems

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Convex Optimization Theory. Chapter 5 Exercises and Solutions: Extended Version

A Brief Review on Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Lagrangian Duality Theory

Math 273a: Optimization Overview of First-Order Optimization Algorithms

12. Interior-point methods

Lecture 8. Strong Duality Results. September 22, 2008

Convex Optimization and Modeling

Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints

GEOMETRIC APPROACH TO CONVEX SUBDIFFERENTIAL CALCULUS October 10, Dedicated to Franco Giannessi and Diethard Pallaschke with great respect

QUADRATIC MAJORIZATION 1. INTRODUCTION

Introduction to Alternating Direction Method of Multipliers

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

Unconstrained minimization of smooth functions

A Tutorial on Primal-Dual Algorithm

Lecture 4: Convex Functions, Part I February 1

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Convex Optimization Theory. Athena Scientific, Supplementary Chapter 6 on Convex Optimization Algorithms

minimize x subject to (x 2)(x 4) u,

MATH 680 Fall November 27, Homework 3

On the acceleration of the double smoothing technique for unconstrained convex optimization problems

Convex Duality and Financial Mathematics

Transcription:

Math 273a: Optimization Convex Conjugacy Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com

Convex conjugate (the Legendre transform) Let f be a closed proper convex function. The convex conjugate of f is f is convex (in y). f (y) = sup {y T x f(x)} x domf Reason: for each fixed x, (y T x f(x)) is linear in y. Hence, f is point-wise maximum of linear functions, that is, point y and over x. As long as f is proper, f is proper closed convex.

Geometry For fixed y and z, consider linear function: g(x) = y T x z the corresponding hyperplane is H = {(x, g(x)) : x R n } R n+1 [y; 1] is the downward normal direction of H H crosses the (n + 1)th axis at g(0) = z y (tilt) and z (height) uniquely define the hyperplane H and function g Our task: for each y, determine z using the function f, thus determining g

If we set two rules 1. g(x) = f(x) at some point x, i.e., H intersects epi(f), 2. H is as low as possible, i.e., z is as small as possible, then H will be the supporting hyperplane of f By rule #1, point x domf, y T x z = f(x) or z = f(x) y T x By rule #2, z = inf domf {f(x) y T x} = z = sup x domf {y T x f(x)} Therefore, z = f (x) g(y) = y T x f (x) H is the supporting hyperplane

Geometry 1. H intersects epi(f) 2. H is as low as possible = f (y) = inf x domf {f(x) y T x} epi(f ) f (y) f(x) x f almost completely characterizes f. Almost is because covers only up to closure. This is a result of the Hahn-Banach Separation Theorem.

Relation to Lagrange duality Consider convex problem minimize f(x) subject to Ax = b. x Lagrangian: L(x; y) = f(x) y T (Ax b). Lagrange dual function: d(y) = inf x domf L(x; y) = sup {y T (Ax b) f(x)} x domf = f (A T y) b T y. Lagrange dual problem (given in terms of convex conjugate f ): minimize y f (A T y) b T y or maximize b T y f (A T y). y

Exercise Derive a Lagrange dual problem for minimize x R n f(x) + g(ax)

Primal and dual subdifferentials Suppose that [y; 1] is the downward normal of the hyperplane touching epi(f) at x; therefore, y = f(x ) In general, for proper closed convex f, y f(x ). Therefore, domf = { f(x) : x domf}. Theorem (biconjugation) Let x domf and y domf. Then, y f(x) x f (y). If the relation holds, then f(x) + f (y) = y T x. The result is very useful in deriving optimality conditions.

Fenchel s inequality Theorem (Fenchel s inequality) For arbitrary x domf and y domf, we have f(x) + f (y) y T x. Proof. Since x is not necessarily the maximizing point for f(y) = sup x { }, we have f (y) y T x f(x).

Theorem If f is proper, closed, convex, then (f ) = f, i.e., f(x) = sup y domf {y T x f (y)}. Proof. Consider linear function g y,z, defined as g y,z(x) = y T x z. Step 1. (cont.) g y,z f y T x z f(x), x y T x f(x) z, x sup{y T x f(x)} z x f (y) z (y, z) epi(f )

Proof. Step 2. From the Hahn-Banach Separation Theorem, Step 3. sup y,z f(x) = sup{g y,z(x) : g y,z f}, x domf. y,z {g y,z (x) : g y,z f} = sup{g y,z (x) : f (y) z} by Step 1 y,z = sup{y T x z : f (y) z} y,z = sup{y T x f (y)} y = (f ) (x) Combining Steps 2 and 3, we get f = (f ).

Examples f(x) = ι C(x), indicator function of nonempty closed convex set C; then is the support function of C σc := f (y) = sup y T x x C Applying the theorem, we get (ι C) = ι C and (σ C) = σ C.

Examples f(x) = ι { 1 x 1}, indicator function of the unit hypercube; then f (y) = sup y T x = y 1 1 x 1 f(x) = ι { x 2 1}, then f(x) = 1 p x p p, 1 < p <, then f (y) = y 2 f (y) = 1 q x q q 1 p + 1 q = 1 lots of smooth examples...

Alternative representation Previously, we can represent f by f via f(x) = sup y domf {x T y f (y)} We can introduce a more general representation: f(x) = sup {(Ax b) T y h (y)} = h(ax b) y domh so that h might be simpler than f (or h is simpler than f) in form.

Example f(x) = x 1 Let C = {y = [y 1; y 2] : y 1 + y 2 = 1, y 1, y 2 0} and [ ] 1 A =. 1 We have x 1 = sup y {(y 1 y 2) T x ι C(y)} = sup{y T Ax ι C(y)}. Since ι C([x 1; x 2]) = 1 T (max{x 1, x 2}), where max is taken entry-wise, and we have sup {(Ax b) T y h (y)} = h(ax b) y domh x 1 = ι C(Ax) = 1 T (max{x, x}). y

Application: dual smoothing Idea: strongly convexify h = f becomes Lipschitz-differentiable Suppose that f is represented in terms of h as f(x) = sup {y T (Ax b) h (y)} y domh Let us strongly convexify h by adding strongly convex function d: (a simple choice is d(y) = 1 2 y 2 ) ĥ (y) = h (y) + μd(y) Obtain a Lipschitz-differentiable approximation: f μ(x) = sup {y T (Ax b) ĥ (y)} y domh f μ(x) is differentiable since h (y) + μd(y) is strongly convex.

Example: augmented l 1 primal problem: min{ x 1 : Ax = b} dual problem: max{b T y + ι [ 1,1] n(a T y)} f(y) = ι [ 1,1] n(y) is non-differentiable plan: strongly convexify the primal so that dual becomes smooth and can be quickly solved let f (x) = x 1 and f(y) = ι [ 1,1] n(y) = sup x {y T x f (x)}, add μ 2 x 2 to f (x) and obtain f μ(y) = sup{y T x ( x 1 + μ x 2 x 2 2)} = 1 2μ y Proj [ 1,1] n(y) 2 2 f μ(y) is differentiable; f μ(y) = 1 μ shrink(y). On the other hand, we can also directly smooth f (x) = x 1 and obtain differentiable f μ(x) by adding d(y) to f(y). (see the next slide...)

Example: smoothed absolute value Let x R. Recall f(x) = x = sup{yx ι [ 1,1] (y)} y add d(y) = y 2 /2 to ι [ 1,1] (y) and obtain: f μ = sup{yx (ι [ 1,1] (y) + μy 2 /2)} = y which is the Huber function. { x 2 /(2μ), x μ, x μ/2, x > μ, If x μ, the maximizing y stays within [ 1, 1] even if ι [ 1,1] (y) vanishes, so it leaves with μy 2 /2 only. If x > μ, the maximizing y occurs at ±1, so μy 2 /2 = μ/2. The Huber function is used in robust least squares.

add d(y) = 1 1 y 2, which is well defined and strongly convex in [ 1, 1]: fμ = sup{yx (ι [ 1,1] (y) μ 1 y 2 )} μ = x 2 + μ 2 μ, y which is used in reweighted least-squares methods Recall for C = {y : y 1 + y 2 = 1, y 1, y 2 0}. x = sup{(y 1 y 2 )x ι C (y)} y Add negative entropy d(y) = y 1 log y 1 + y 2 log y 2 + log 2: f μ(x) = sup y {(y 1 y 2)x (ι C(y) + μd(y))} = μ log ex/μ + e x/μ. 2

x log(x) 1.5 x log(x) 1 0.5 0 0.5 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 x x log(x) is strongly convex between [0, C] for any finite C > 0

Compare three smoothed functions Courtesy of L. Vandenberghe

Example: smoothed maximum eigenvalue Let X S n. Consider f(x) = λ max(x), which is the l norm on the matrix spectrum. Recall the dual of l is l 1, and we have x = sup{x T y ι {1 T y=1,y 0}(y)} y Let C = {Y S n : try = 1, Y 0}. We have Next, strongly convexify ι C(Y): f(x) = λ max(x) = sup{y X ι C(Y)} Y

Negative entropy of {λ i(y)}: n d(y) = λ i(y) log λ i(y) + log n i=1 (Courtesy of L. Vandenberghe) Smoothed function ( 1 f μ(x) = sup{y X (ι C(Y) + μd(y))} = μ log Y n n i=1 e λ i(x)/μ )

Application: smoothed minimization 1 Instead of solving solve min f(x), x min x f μ(x) = sup y domh { y T (Ax + b) [h (y) + μd(y)] } by gradient descent, with acceleration, line search, etc... 1?

Since h (y) + μd(y) is strongly convex, f μ(x) is given by: f μ(x) = A T ȳ, where ȳ = arg max y domh {y T (Ax + b) [h (y) + μd(y)]}. Theorem If d(y) is strongly convex with modulus ν > 0, then h (y) + μd(y) is strongly convex with modulus at least μν f μ(x) is Lipschitz continuous with constant no more than A 2 /μν.

Nonsmooth optimization Examples: min Ax b 1 min TV(x) s.t. Ax b 2 σ Worst-case complexity for ɛ-approximation: If f is convex and f is L-Lipschitz, accelerated gradient method takes O( L/ɛ) iterations. If f is convex and nonsmooth, f is G-Lipschitz, subgradient method takes O(G 2 /ɛ 2 ) iterations. Smooth optimization has much better complexities.

Nesterov s complexity analysis 1. Construct smooth approximate satisfying f μ f f μ + μd and consequently f(x) f f μ (x) fμ + μd 2. Choose μ such that μd ɛ/2 = 1 μ 2D ɛ 3. Minimize f μ such that f μ (x) f μ ɛ/2 Step 3 has complexity 1 D O( μɛ ) = O( ɛ ), which can be much better than the subgradient method s O(G 2 /ɛ 2 ).

Theorem Consider h(x) = h μ(x) = sup y T x h (y), y domh sup y T x (h (y) + μd(y)), y domh where d(y) 0 is strongly convex with modulus ν > 0. Then, 1. h μ is (μν) 1 -Lipschitz; 2. if d(y) D for y domh, then h μ(x) h(x) h μ(x) + μd, x domh.

Example: Huber function Recall h μ(x) = sup{yx (ι [ 1,1] (y) + μy 2 /2)} = y { x 2 /(2μ), x μ, x μ/2, x > μ, We have and h μ(x) x h μ(x) + μ/2 h μ(x) = { x/μ, x μ, sign(x), x > μ, which is Lipschitz with constant μ 1. Apply to l 1-norm: h μ(x i) x 1 h μ(x i) + nμ/2. i i

Robust least squares Consider min x f(x) = Ax b 1 Representation where C = {y : y 1}. x 1 = sup{y T (Ax b) ι C(y)} y Add μd(y) = μ 2 y 2 2 to ι C(y) and obtain min x f μ(x) = m h μ(ai T x b i). i=1

Other examples and questions Total variation / analysis l 1 minimization examples Nuclear norm examples More other one nonsmooth terms... Stopping criteria