Subgradients. subgradients. strong and weak subgradient calculus. optimality conditions via subgradients. directional derivatives

Similar documents
Subgradients. subgradients and quasigradients. subgradient calculus. optimality conditions via subgradients. directional derivatives

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 4. Subgradient

Subgradient. Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes. definition. subgradient calculus

Primal/Dual Decomposition Methods

Lecture 1: Background on Convex Analysis

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Convex Optimization. (EE227A: UC Berkeley) Lecture 4. Suvrit Sra. (Conjugates, subdifferentials) 31 Jan, 2013

5. Duality. Lagrangian

LECTURE 12 LECTURE OUTLINE. Subgradients Fenchel inequality Sensitivity in constrained optimization Subdifferential calculus Optimality conditions

Constrained Optimization and Lagrangian Duality

Optimality Conditions for Nonsmooth Convex Optimization

Convex Optimization Boyd & Vandenberghe. 5. Duality

Convex Functions. Pontus Giselsson

3. Convex functions. basic properties and examples. operations that preserve convexity. the conjugate function. quasiconvex functions

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

Convex Optimization M2

Convex analysis and profit/cost/support functions

Convex Optimization & Lagrange Duality

A SET OF LECTURE NOTES ON CONVEX OPTIMIZATION WITH SOME APPLICATIONS TO PROBABILITY THEORY INCOMPLETE DRAFT. MAY 06

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

Lecture: Duality.

Lecture 6: September 17

CSCI : Optimization and Control of Networks. Review on Convex Optimization

Optimization and Optimal Control in Banach Spaces

Convex Optimization and Modeling

A Brief Review on Convex Optimization

Lagrangian Duality and Convex Optimization

3. Convex functions. basic properties and examples. operations that preserve convexity. the conjugate function. quasiconvex functions

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Lecture: Duality of LP, SOCP and SDP

Optimization for Machine Learning

Linear Programming. Larry Blume Cornell University, IHS Vienna and SFI. Summer 2016

Convex Analysis Background

Chapter 2 Convex Analysis

Lecture 2: Convex functions

In Progress: Summary of Notation and Basic Results Convex Analysis C&O 663, Fall 2009

Convex Optimization Theory. Chapter 5 Exercises and Solutions: Extended Version

The proximal mapping

Introduction to Machine Learning Lecture 7. Mehryar Mohri Courant Institute and Google Research

Motivation. Lecture 2 Topics from Optimization and Duality. network utility maximization (NUM) problem:

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Chap 2. Optimality conditions

8. Conjugate functions

Subgradient Descent. David S. Rosenberg. New York University. February 7, 2018

Optimality Conditions for Constrained Optimization

EE364b Homework 2. λ i f i ( x) = 0, i=1

Lecture 8. Strong Duality Results. September 22, 2008

BASICS OF CONVEX ANALYSIS

Lagrange duality. The Lagrangian. We consider an optimization program of the form

CONSTRAINT QUALIFICATIONS, LAGRANGIAN DUALITY & SADDLE POINT OPTIMALITY CONDITIONS

Karush-Kuhn-Tucker Conditions. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Math 273a: Optimization Subgradients of convex functions

Constraint qualifications for convex inequality systems with applications in constrained optimization

Convex Optimization Lecture 6: KKT Conditions, and applications

I.3. LMI DUALITY. Didier HENRION EECI Graduate School on Control Supélec - Spring 2010

Math 273a: Optimization Subgradients of convex functions

Convex Optimization in Communications and Signal Processing

6. Proximal gradient method

14. Duality. ˆ Upper and lower bounds. ˆ General duality. ˆ Constraint qualifications. ˆ Counterexample. ˆ Complementary slackness.

g 2 (x) (1/3)M 1 = (1/3)(2/3)M.

Duality Uses and Correspondences. Ryan Tibshirani Convex Optimization

Convex Functions. Daniel P. Palomar. Hong Kong University of Science and Technology (HKUST)

Finite Dimensional Optimization Part III: Convex Optimization 1

Subdifferential representation of convex functions: refinements and applications

Division of the Humanities and Social Sciences. Supergradients. KC Border Fall 2001 v ::15.45

Lecture 6: Conic Optimization September 8

SYMBOLIC CONVEX ANALYSIS

subject to (x 2)(x 4) u,

EE/AA 578, Univ of Washington, Fall Duality

Convex Analysis and Optimization Chapter 4 Solutions

Lecture 3. Optimization Problems and Iterative Algorithms

CONVEX OPTIMIZATION, DUALITY, AND THEIR APPLICATION TO SUPPORT VECTOR MACHINES. Contents 1. Introduction 1 2. Convex Sets

On duality theory of conic linear problems

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Characterizing Robust Solution Sets of Convex Programs under Data Uncertainty

Lagrange Duality. Daniel P. Palomar. Hong Kong University of Science and Technology (HKUST)

A function(al) f is convex if dom f is a convex set, and. f(θx + (1 θ)y) < θf(x) + (1 θ)f(y) f(x) = x 3

Duality (Continued) min f ( x), X R R. Recall, the general primal problem is. The Lagrangian is a function. defined by

CS-E4830 Kernel Methods in Machine Learning

New formulas for the Fenchel subdifferential of the conjugate function

5. Subgradient method

3 Measurable Functions

Subgradient Method. Ryan Tibshirani Convex Optimization

Solutions Chapter 5. The problem of finding the minimum distance from the origin to a line is written as. min 1 2 kxk2. subject to Ax = b.

Optimisation convexe: performance, complexité et applications.

Constrained optimization

Local strong convexity and local Lipschitz continuity of the gradient of convex functions

Optimization for Communications and Networks. Poompat Saengudomlert. Session 4 Duality and Lagrange Multipliers

Lagrangian Duality Theory

CHAPTER 2: CONVEX SETS AND CONCAVE FUNCTIONS. W. Erwin Diewert January 31, 2008.

Tutorial on Convex Optimization for Engineers Part II

Convex functions. Definition. f : R n! R is convex if dom f is a convex set and. f ( x +(1 )y) < f (x)+(1 )f (y) f ( x +(1 )y) apple f (x)+(1 )f (y)

Convex Analysis Notes. Lecturer: Adrian Lewis, Cornell ORIE Scribe: Kevin Kircher, Cornell MAE

Helly's Theorem and its Equivalences via Convex Analysis

Interior Point Algorithms for Constrained Convex Optimization

MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution

Duality. Lagrange dual problem weak and strong duality optimality conditions perturbation and sensitivity analysis generalized inequalities

4. Convex optimization problems

GEOMETRIC APPROACH TO CONVEX SUBDIFFERENTIAL CALCULUS October 10, Dedicated to Franco Giannessi and Diethard Pallaschke with great respect

Dual Decomposition.

Transcription:

Subgradients subgradients strong and weak subgradient calculus optimality conditions via subgradients directional derivatives Prof. S. Boyd, EE364b, Stanford University

Basic inequality recall basic inequality for convex differentiable f: f(y) f(x) + f(x) T (y x) first-order approximation of f at x is global underestimator ( f(x), 1) supports epi f at (x,f(x)) what if f is not differentiable? Prof. S. Boyd, EE364b, Stanford University 1

Subgradient of a function g is a subgradient of f (not necessarily convex) at x if f(y) f(x) + g T (y x) for all y f(x) f(x 1 ) + g T 1 (x x 1) f(x 2 ) + g T 2 (x x 2) f(x 2 ) + g T 3 (x x 2) x 1 x 2 g 2, g 3 are subgradients at x 2 ; g 1 is a subgradient at x 1 Prof. S. Boyd, EE364b, Stanford University 2

g is a subgradient of f at x iff (g, 1) supports epi f at (x,f(x)) g is a subgradient iff f(x) + g T (y x) is a global (affine) underestimator of f if f is convex and differentiable, f(x) is a subgradient of f at x subgradients come up in several contexts: algorithms for nondifferentiable convex optimization convex analysis, e.g., optimality conditions, duality for nondifferentiable problems (if f(y) f(x) + g T (y x) for all y, then g is a supergradient) Prof. S. Boyd, EE364b, Stanford University 3

Example f = max{f 1, f 2 }, with f 1, f 2 convex and differentiable f(x) f 2 (x) f 1 (x) x 0 f 1 (x 0 ) > f 2 (x 0 ): unique subgradient g = f 1 (x 0 ) f 2 (x 0 ) > f 1 (x 0 ): unique subgradient g = f 2 (x 0 ) f 1 (x 0 ) = f 2 (x 0 ): subgradients form a line segment [ f 1 (x 0 ), f 2 (x 0 )] Prof. S. Boyd, EE364b, Stanford University 4

Subdifferential set of all subgradients of f at x is called the subdifferential of f at x, denoted f(x) f(x) is a closed convex set (can be empty) if f is convex, f(x) is nonempty, for x relint dom f f(x) = { f(x)}, if f is differentiable at x if f(x) = {g}, then f is differentiable at x and g = f(x) Prof. S. Boyd, EE364b, Stanford University 5

Example f(x) = x f(x) = x f(x) 1 x 1 x righthand plot shows {(x, f(x)) x R} Prof. S. Boyd, EE364b, Stanford University 6

Subgradient calculus weak subgradient calculus: formulas for finding one subgradient g f(x) strong subgradient calculus: formulas for finding the whole subdifferential f(x), i.e., all subgradients of f at x many algorithms for nondifferentiable convex optimization require only one subgradient at each step, so weak calculus suffices some algorithms, optimality conditions, etc., need whole subdifferential roughly speaking: if you can compute f(x), you can usually compute a g f(x) we ll assume that f is convex, and x relintdom f Prof. S. Boyd, EE364b, Stanford University 7

Some basic rules f(x) = { f(x)} if f is differentiable at x scaling: (αf) = α f (if α > 0) addition: (f 1 + f 2 ) = f 1 + f 2 (RHS is addition of sets) affine transformation of variables: if g(x) = f(ax + b), then g(x) = A T f(ax + b) finite pointwise maximum: if f = max i=1,...,m f i, then f(x) = Co { f i (x) f i (x) = f(x)}, i.e., convex hull of union of subdifferentials of active functions at x Prof. S. Boyd, EE364b, Stanford University 8

f(x) = max{f 1 (x),...,f m (x)}, with f 1,...,f m differentiable f(x) = Co{ f i (x) f i (x) = f(x)} example: f(x) = x 1 = max{s T x s i { 1,1}} 1 1 1 1 1 (1,1) 1 1 f(x) at x = (0, 0) at x = (1, 0) at x = (1, 1) Prof. S. Boyd, EE364b, Stanford University 9

Pointwise supremum if f = sup f α, α A clco { f β (x) f β (x) = f(x)} f(x) (usually get equality, but requires some technical conditions to hold, e.g., A compact, f α cts in x and α) roughly speaking, f(x) is closure of convex hull of union of subdifferentials of active functions Prof. S. Boyd, EE364b, Stanford University 10

Weak rule for pointwise supremum f = sup f α α A find any β for which f β (x) = f(x) (assuming supremum is achieved) choose any g f β (x) then, g f(x) Prof. S. Boyd, EE364b, Stanford University 11

example f(x) = λ max (A(x)) = sup y T A(x)y y 2 =1 where A(x) = A 0 + x 1 A 1 + + x n A n, A i S k f is pointwise supremum of g y (x) = y T A(x)y over y 2 = 1 g y is affine in x, with g y (x) = (y T A 1 y,...,y T A n y) hence, f(x) Co { g y A(x)y = λ max (A(x))y, y 2 = 1} (in fact equality holds here) to find one subgradient at x, can choose any unit eigenvector y associated with λ max (A(x)); then (y T A 1 y,...,y T A n y) f(x) Prof. S. Boyd, EE364b, Stanford University 12

Expectation f(x) = E f(x,u), with f convex in x for each u, u a random variable for each u, choose any g u f (x, u) (so u g u is a function) then, g = Eg u f(x) Monte Carlo method for (approximately) computing f(x) and a g f(x): generate independent samples u 1,...,u K from distribution of u f(x) (1/K) K i=1 f(x, u i) for each i choose g i x f(x,u i ) g = (1/K) K i=1 g(x, u i) is an (approximate) subgradient (more on this later) Prof. S. Boyd, EE364b, Stanford University 13

Minimization define g(y) as the optimal value of minimize f 0 (x) subject to f i (x) y i, i = 1,...,m (f i convex; variable x) with λ an optimal dual variable, we have g(z) g(y) m λ i(z i y i ) i=1 i.e., λ is a subgradient of g at y Prof. S. Boyd, EE364b, Stanford University 14

Composition f(x) = h(f 1 (x),...,f k (x)), with h convex nondecreasing, f i convex find q h(f 1 (x),...,f k (x)), g i f i (x) then, g = q 1 g 1 + + q k g k f(x) reduces to standard formula for differentiable h, g i proof: f(y) = h(f 1 (y),...,f k (y)) h(f 1 (x) + g1 T (y x),...,f k (x) + gk T (y x)) h(f 1 (x),...,f k (x)) + q T (g1 T (y x),...,gk T (y x)) = f(x) + g T (y x) Prof. S. Boyd, EE364b, Stanford University 15

Subgradients and sublevel sets g is a subgradient at x means f(y) f(x) + g T (y x) hence f(y) f(x) = g T (y x) 0 g f(x 0 ) x 0 f(x) f(x 0 ) x 1 f(x 1 ) Prof. S. Boyd, EE364b, Stanford University 16

f differentiable at x 0 : f(x 0 ) is normal to the sublevel set {x f(x) f(x 0 )} f nondifferentiable at x 0 : subgradient defines a supporting hyperplane to sublevel set through x 0 Prof. S. Boyd, EE364b, Stanford University 17

Quasigradients g 0 is a quasigradient of f at x if g T (y x) 0 = f(y) f(x) holds for all y f(y) f(x) x g quasigradients at x form a cone Prof. S. Boyd, EE364b, Stanford University 18

example: f(x) = at x + b c T x + d, (dom f = {x ct x + d > 0}) g = a f(x 0 )c is a quasigradient at x 0 proof: for c T x + d > 0: a T (x x 0 ) f(x 0 )c T (x x 0 ) = f(x) f(x 0 ) Prof. S. Boyd, EE364b, Stanford University 19

example: degree of a 1 + a 2 t + + a n t n 1 f(a) = min{i a i+2 = = a n = 0} g = sign(a k+1 )e k+1 (with k = f(a)) is a quasigradient at a 0 proof: implies b k+1 0 g T (b a) = sign(a k+1 )b k+1 a k+1 0 Prof. S. Boyd, EE364b, Stanford University 20

Optimality conditions unconstrained recall for f convex, differentiable, f(x ) = inf x f(x) 0 = f(x ) generalization to nondifferentiable convex f: f(x ) = inf x f(x) 0 f(x ) Prof. S. Boyd, EE364b, Stanford University 21

f(x) 0 f(x 0 ) x 0 x proof. by definition (!) f(y) f(x ) + 0 T (y x ) for all y 0 f(x )... seems trivial but isn t Prof. S. Boyd, EE364b, Stanford University 22

Example: piecewise linear minimization f(x) = max i=1,...,m (a T i x + b i) x minimizes f 0 f(x ) = Co{a i a T i x + b i = f(x )} there is a λ with λ 0, 1 T λ = 1, m λ i a i = 0 i=1 where λ i = 0 if a T i x + b i < f(x ) Prof. S. Boyd, EE364b, Stanford University 23

... but these are the KKT conditions for the epigraph form minimize t subject to a T i x + b i t, i = 1,...,m with dual maximize b T λ subject to λ 0, A T λ = 0, 1 T λ = 1 Prof. S. Boyd, EE364b, Stanford University 24

we assume Optimality conditions constrained minimize f 0 (x) subject to f i (x) 0, i = 1,...,m f i convex, defined on R n (hence subdifferentiable) strict feasibility (Slater s condition) x is primal optimal (λ is dual optimal) iff f i (x ) 0, λ i 0 0 f 0 (x ) + m i=1 λ i f i(x ) λ i f i(x ) = 0... generalizes KKT for nondifferentiable f i Prof. S. Boyd, EE364b, Stanford University 25

Directional derivative directional derivative of f at x in the direction δx is can be + or f (x; δx) = lim hց0 f(x + hδx) f(x) h f convex, finite near x = f (x;δx) exists f differentiable at x if and only if, for some g (= f(x)) and all δx, f (x;δx) = g T δx (i.e., f (x;δx) is a linear function of δx) Prof. S. Boyd, EE364b, Stanford University 26

Directional derivative and subdifferential general formula for convex f: f (x;δx) = sup g T δx g f(x) δx f(x) Prof. S. Boyd, EE364b, Stanford University 27

Descent directions δx is a descent direction for f at x if f (x;δx) < 0 for differentiable f, δx = f(x) is always a descent direction (except when it is zero) warning: for nondifferentiable (convex) functions, δx = g, with g f(x), need not be descent direction x 2 g example: f(x) = x 1 + 2 x 2 x 1 Prof. S. Boyd, EE364b, Stanford University 28

Subgradients and distance to sublevel sets if f is convex, f(z) < f(x), g f(x), then for small t > 0, x tg z 2 < x z 2 thus g is descent direction for x z 2, for any z with f(z) < f(x) (e.g., x ) negative subgradient is descent direction for distance to optimal point proof: x tg z 2 2 = x z 2 2 2tg T (x z) + t 2 g 2 2 x z 2 2 2t(f(x) f(z)) + t 2 g 2 2 Prof. S. Boyd, EE364b, Stanford University 29

Descent directions and optimality fact: for f convex, finite near x, either 0 f(x) (in which case x minimizes f), or there is a descent direction for f at x i.e., x is optimal (minimizes f) iff there is no descent direction for f at x proof: define δx sd = argmin z z f(x) if δx sd = 0, then 0 f(x), so x is optimal; otherwise f (x;δx sd ) = ( inf z f(x) z ) 2 < 0, so δxsd is a descent direction Prof. S. Boyd, EE364b, Stanford University 30

f(x) x sd idea extends to constrained case (feasible descent direction) Prof. S. Boyd, EE364b, Stanford University 31