Subgradients. subgradients. strong and weak subgradient calculus. optimality conditions via subgradients. directional derivatives

Subgradients subgradients strong and weak subgradient calculus optimality conditions via subgradients directional derivatives Prof. S. Boyd, EE364b, Stanford University

Basic inequality recall basic inequality for convex differentiable f: f(y) f(x) + f(x) T (y x) first-order approximation of f at x is global underestimator ( f(x), 1) supports epi f at (x,f(x)) what if f is not differentiable? Prof. S. Boyd, EE364b, Stanford University 1

Subgradient of a function g is a subgradient of f (not necessarily convex) at x if f(y) f(x) + g T (y x) for all y f(x) f(x 1 ) + g T 1 (x x 1) f(x 2 ) + g T 2 (x x 2) f(x 2 ) + g T 3 (x x 2) x 1 x 2 g 2, g 3 are subgradients at x 2 ; g 1 is a subgradient at x 1 Prof. S. Boyd, EE364b, Stanford University 2

g is a subgradient of f at x iff (g, 1) supports epi f at (x,f(x)) g is a subgradient iff f(x) + g T (y x) is a global (affine) underestimator of f if f is convex and differentiable, f(x) is a subgradient of f at x subgradients come up in several contexts: algorithms for nondifferentiable convex optimization convex analysis, e.g., optimality conditions, duality for nondifferentiable problems (if f(y) f(x) + g T (y x) for all y, then g is a supergradient) Prof. S. Boyd, EE364b, Stanford University 3

Example f = max{f 1, f 2 }, with f 1, f 2 convex and differentiable f(x) f 2 (x) f 1 (x) x 0 f 1 (x 0 ) > f 2 (x 0 ): unique subgradient g = f 1 (x 0 ) f 2 (x 0 ) > f 1 (x 0 ): unique subgradient g = f 2 (x 0 ) f 1 (x 0 ) = f 2 (x 0 ): subgradients form a line segment [ f 1 (x 0 ), f 2 (x 0 )] Prof. S. Boyd, EE364b, Stanford University 4

Subdifferential set of all subgradients of f at x is called the subdifferential of f at x, denoted f(x) f(x) is a closed convex set (can be empty) if f is convex, f(x) is nonempty, for x relint dom f f(x) = { f(x)}, if f is differentiable at x if f(x) = {g}, then f is differentiable at x and g = f(x) Prof. S. Boyd, EE364b, Stanford University 5

Example f(x) = x f(x) = x f(x) 1 x 1 x righthand plot shows {(x, f(x)) x R} Prof. S. Boyd, EE364b, Stanford University 6

Subgradient calculus weak subgradient calculus: formulas for finding one subgradient g f(x) strong subgradient calculus: formulas for finding the whole subdifferential f(x), i.e., all subgradients of f at x many algorithms for nondifferentiable convex optimization require only one subgradient at each step, so weak calculus suffices some algorithms, optimality conditions, etc., need whole subdifferential roughly speaking: if you can compute f(x), you can usually compute a g f(x) we ll assume that f is convex, and x relintdom f Prof. S. Boyd, EE364b, Stanford University 7

Some basic rules f(x) = { f(x)} if f is differentiable at x scaling: (αf) = α f (if α > 0) addition: (f 1 + f 2 ) = f 1 + f 2 (RHS is addition of sets) affine transformation of variables: if g(x) = f(ax + b), then g(x) = A T f(ax + b) finite pointwise maximum: if f = max i=1,...,m f i, then f(x) = Co { f i (x) f i (x) = f(x)}, i.e., convex hull of union of subdifferentials of active functions at x Prof. S. Boyd, EE364b, Stanford University 8

f(x) = max{f 1 (x),...,f m (x)}, with f 1,...,f m differentiable f(x) = Co{ f i (x) f i (x) = f(x)} example: f(x) = x 1 = max{s T x s i { 1,1}} 1 1 1 1 1 (1,1) 1 1 f(x) at x = (0, 0) at x = (1, 0) at x = (1, 1) Prof. S. Boyd, EE364b, Stanford University 9

Pointwise supremum if f = sup f α, α A clco { f β (x) f β (x) = f(x)} f(x) (usually get equality, but requires some technical conditions to hold, e.g., A compact, f α cts in x and α) roughly speaking, f(x) is closure of convex hull of union of subdifferentials of active functions Prof. S. Boyd, EE364b, Stanford University 10

Weak rule for pointwise supremum f = sup f α α A find any β for which f β (x) = f(x) (assuming supremum is achieved) choose any g f β (x) then, g f(x) Prof. S. Boyd, EE364b, Stanford University 11

example f(x) = λ max (A(x)) = sup y T A(x)y y 2 =1 where A(x) = A 0 + x 1 A 1 + + x n A n, A i S k f is pointwise supremum of g y (x) = y T A(x)y over y 2 = 1 g y is affine in x, with g y (x) = (y T A 1 y,...,y T A n y) hence, f(x) Co { g y A(x)y = λ max (A(x))y, y 2 = 1} (in fact equality holds here) to find one subgradient at x, can choose any unit eigenvector y associated with λ max (A(x)); then (y T A 1 y,...,y T A n y) f(x) Prof. S. Boyd, EE364b, Stanford University 12

Expectation f(x) = E f(x,u), with f convex in x for each u, u a random variable for each u, choose any g u f (x, u) (so u g u is a function) then, g = Eg u f(x) Monte Carlo method for (approximately) computing f(x) and a g f(x): generate independent samples u 1,...,u K from distribution of u f(x) (1/K) K i=1 f(x, u i) for each i choose g i x f(x,u i ) g = (1/K) K i=1 g(x, u i) is an (approximate) subgradient (more on this later) Prof. S. Boyd, EE364b, Stanford University 13

Minimization define g(y) as the optimal value of minimize f 0 (x) subject to f i (x) y i, i = 1,...,m (f i convex; variable x) with λ an optimal dual variable, we have g(z) g(y) m λ i(z i y i ) i=1 i.e., λ is a subgradient of g at y Prof. S. Boyd, EE364b, Stanford University 14

Composition f(x) = h(f 1 (x),...,f k (x)), with h convex nondecreasing, f i convex find q h(f 1 (x),...,f k (x)), g i f i (x) then, g = q 1 g 1 + + q k g k f(x) reduces to standard formula for differentiable h, g i proof: f(y) = h(f 1 (y),...,f k (y)) h(f 1 (x) + g1 T (y x),...,f k (x) + gk T (y x)) h(f 1 (x),...,f k (x)) + q T (g1 T (y x),...,gk T (y x)) = f(x) + g T (y x) Prof. S. Boyd, EE364b, Stanford University 15

Subgradients and sublevel sets g is a subgradient at x means f(y) f(x) + g T (y x) hence f(y) f(x) = g T (y x) 0 g f(x 0 ) x 0 f(x) f(x 0 ) x 1 f(x 1 ) Prof. S. Boyd, EE364b, Stanford University 16

f differentiable at x 0 : f(x 0 ) is normal to the sublevel set {x f(x) f(x 0 )} f nondifferentiable at x 0 : subgradient defines a supporting hyperplane to sublevel set through x 0 Prof. S. Boyd, EE364b, Stanford University 17

Quasigradients g 0 is a quasigradient of f at x if g T (y x) 0 = f(y) f(x) holds for all y f(y) f(x) x g quasigradients at x form a cone Prof. S. Boyd, EE364b, Stanford University 18

example: f(x) = at x + b c T x + d, (dom f = {x ct x + d > 0}) g = a f(x 0 )c is a quasigradient at x 0 proof: for c T x + d > 0: a T (x x 0 ) f(x 0 )c T (x x 0 ) = f(x) f(x 0 ) Prof. S. Boyd, EE364b, Stanford University 19

example: degree of a 1 + a 2 t + + a n t n 1 f(a) = min{i a i+2 = = a n = 0} g = sign(a k+1 )e k+1 (with k = f(a)) is a quasigradient at a 0 proof: implies b k+1 0 g T (b a) = sign(a k+1 )b k+1 a k+1 0 Prof. S. Boyd, EE364b, Stanford University 20

Optimality conditions unconstrained recall for f convex, differentiable, f(x ) = inf x f(x) 0 = f(x ) generalization to nondifferentiable convex f: f(x ) = inf x f(x) 0 f(x ) Prof. S. Boyd, EE364b, Stanford University 21

f(x) 0 f(x 0 ) x 0 x proof. by definition (!) f(y) f(x ) + 0 T (y x ) for all y 0 f(x )... seems trivial but isn t Prof. S. Boyd, EE364b, Stanford University 22

Example: piecewise linear minimization f(x) = max i=1,...,m (a T i x + b i) x minimizes f 0 f(x ) = Co{a i a T i x + b i = f(x )} there is a λ with λ 0, 1 T λ = 1, m λ i a i = 0 i=1 where λ i = 0 if a T i x + b i < f(x ) Prof. S. Boyd, EE364b, Stanford University 23

... but these are the KKT conditions for the epigraph form minimize t subject to a T i x + b i t, i = 1,...,m with dual maximize b T λ subject to λ 0, A T λ = 0, 1 T λ = 1 Prof. S. Boyd, EE364b, Stanford University 24

we assume Optimality conditions constrained minimize f 0 (x) subject to f i (x) 0, i = 1,...,m f i convex, defined on R n (hence subdifferentiable) strict feasibility (Slater s condition) x is primal optimal (λ is dual optimal) iff f i (x ) 0, λ i 0 0 f 0 (x ) + m i=1 λ i f i(x ) λ i f i(x ) = 0... generalizes KKT for nondifferentiable f i Prof. S. Boyd, EE364b, Stanford University 25

Directional derivative directional derivative of f at x in the direction δx is can be + or f (x; δx) = lim hց0 f(x + hδx) f(x) h f convex, finite near x = f (x;δx) exists f differentiable at x if and only if, for some g (= f(x)) and all δx, f (x;δx) = g T δx (i.e., f (x;δx) is a linear function of δx) Prof. S. Boyd, EE364b, Stanford University 26

Directional derivative and subdifferential general formula for convex f: f (x;δx) = sup g T δx g f(x) δx f(x) Prof. S. Boyd, EE364b, Stanford University 27

Descent directions δx is a descent direction for f at x if f (x;δx) < 0 for differentiable f, δx = f(x) is always a descent direction (except when it is zero) warning: for nondifferentiable (convex) functions, δx = g, with g f(x), need not be descent direction x 2 g example: f(x) = x 1 + 2 x 2 x 1 Prof. S. Boyd, EE364b, Stanford University 28

Subgradients and distance to sublevel sets if f is convex, f(z) < f(x), g f(x), then for small t > 0, x tg z 2 < x z 2 thus g is descent direction for x z 2, for any z with f(z) < f(x) (e.g., x ) negative subgradient is descent direction for distance to optimal point proof: x tg z 2 2 = x z 2 2 2tg T (x z) + t 2 g 2 2 x z 2 2 2t(f(x) f(z)) + t 2 g 2 2 Prof. S. Boyd, EE364b, Stanford University 29

Descent directions and optimality fact: for f convex, finite near x, either 0 f(x) (in which case x minimizes f), or there is a descent direction for f at x i.e., x is optimal (minimizes f) iff there is no descent direction for f at x proof: define δx sd = argmin z z f(x) if δx sd = 0, then 0 f(x), so x is optimal; otherwise f (x;δx sd ) = ( inf z f(x) z ) 2 < 0, so δxsd is a descent direction Prof. S. Boyd, EE364b, Stanford University 30

f(x) x sd idea extends to constrained case (feasible descent direction) Prof. S. Boyd, EE364b, Stanford University 31