Subgradients subgradients strong and weak subgradient calculus optimality conditions via subgradients directional derivatives Prof. S. Boyd, EE364b, Stanford University
Basic inequality recall basic inequality for convex differentiable f: f(y) f(x) + f(x) T (y x) first-order approximation of f at x is global underestimator ( f(x), 1) supports epi f at (x,f(x)) what if f is not differentiable? Prof. S. Boyd, EE364b, Stanford University 1
Subgradient of a function g is a subgradient of f (not necessarily convex) at x if f(y) f(x) + g T (y x) for all y f(x) f(x 1 ) + g T 1 (x x 1) f(x 2 ) + g T 2 (x x 2) f(x 2 ) + g T 3 (x x 2) x 1 x 2 g 2, g 3 are subgradients at x 2 ; g 1 is a subgradient at x 1 Prof. S. Boyd, EE364b, Stanford University 2
g is a subgradient of f at x iff (g, 1) supports epi f at (x,f(x)) g is a subgradient iff f(x) + g T (y x) is a global (affine) underestimator of f if f is convex and differentiable, f(x) is a subgradient of f at x subgradients come up in several contexts: algorithms for nondifferentiable convex optimization convex analysis, e.g., optimality conditions, duality for nondifferentiable problems (if f(y) f(x) + g T (y x) for all y, then g is a supergradient) Prof. S. Boyd, EE364b, Stanford University 3
Example f = max{f 1, f 2 }, with f 1, f 2 convex and differentiable f(x) f 2 (x) f 1 (x) x 0 f 1 (x 0 ) > f 2 (x 0 ): unique subgradient g = f 1 (x 0 ) f 2 (x 0 ) > f 1 (x 0 ): unique subgradient g = f 2 (x 0 ) f 1 (x 0 ) = f 2 (x 0 ): subgradients form a line segment [ f 1 (x 0 ), f 2 (x 0 )] Prof. S. Boyd, EE364b, Stanford University 4
Subdifferential set of all subgradients of f at x is called the subdifferential of f at x, denoted f(x) f(x) is a closed convex set (can be empty) if f is convex, f(x) is nonempty, for x relint dom f f(x) = { f(x)}, if f is differentiable at x if f(x) = {g}, then f is differentiable at x and g = f(x) Prof. S. Boyd, EE364b, Stanford University 5
Example f(x) = x f(x) = x f(x) 1 x 1 x righthand plot shows {(x, f(x)) x R} Prof. S. Boyd, EE364b, Stanford University 6
Subgradient calculus weak subgradient calculus: formulas for finding one subgradient g f(x) strong subgradient calculus: formulas for finding the whole subdifferential f(x), i.e., all subgradients of f at x many algorithms for nondifferentiable convex optimization require only one subgradient at each step, so weak calculus suffices some algorithms, optimality conditions, etc., need whole subdifferential roughly speaking: if you can compute f(x), you can usually compute a g f(x) we ll assume that f is convex, and x relintdom f Prof. S. Boyd, EE364b, Stanford University 7
Some basic rules f(x) = { f(x)} if f is differentiable at x scaling: (αf) = α f (if α > 0) addition: (f 1 + f 2 ) = f 1 + f 2 (RHS is addition of sets) affine transformation of variables: if g(x) = f(ax + b), then g(x) = A T f(ax + b) finite pointwise maximum: if f = max i=1,...,m f i, then f(x) = Co { f i (x) f i (x) = f(x)}, i.e., convex hull of union of subdifferentials of active functions at x Prof. S. Boyd, EE364b, Stanford University 8
f(x) = max{f 1 (x),...,f m (x)}, with f 1,...,f m differentiable f(x) = Co{ f i (x) f i (x) = f(x)} example: f(x) = x 1 = max{s T x s i { 1,1}} 1 1 1 1 1 (1,1) 1 1 f(x) at x = (0, 0) at x = (1, 0) at x = (1, 1) Prof. S. Boyd, EE364b, Stanford University 9
Pointwise supremum if f = sup f α, α A clco { f β (x) f β (x) = f(x)} f(x) (usually get equality, but requires some technical conditions to hold, e.g., A compact, f α cts in x and α) roughly speaking, f(x) is closure of convex hull of union of subdifferentials of active functions Prof. S. Boyd, EE364b, Stanford University 10
Weak rule for pointwise supremum f = sup f α α A find any β for which f β (x) = f(x) (assuming supremum is achieved) choose any g f β (x) then, g f(x) Prof. S. Boyd, EE364b, Stanford University 11
example f(x) = λ max (A(x)) = sup y T A(x)y y 2 =1 where A(x) = A 0 + x 1 A 1 + + x n A n, A i S k f is pointwise supremum of g y (x) = y T A(x)y over y 2 = 1 g y is affine in x, with g y (x) = (y T A 1 y,...,y T A n y) hence, f(x) Co { g y A(x)y = λ max (A(x))y, y 2 = 1} (in fact equality holds here) to find one subgradient at x, can choose any unit eigenvector y associated with λ max (A(x)); then (y T A 1 y,...,y T A n y) f(x) Prof. S. Boyd, EE364b, Stanford University 12
Expectation f(x) = E f(x,u), with f convex in x for each u, u a random variable for each u, choose any g u f (x, u) (so u g u is a function) then, g = Eg u f(x) Monte Carlo method for (approximately) computing f(x) and a g f(x): generate independent samples u 1,...,u K from distribution of u f(x) (1/K) K i=1 f(x, u i) for each i choose g i x f(x,u i ) g = (1/K) K i=1 g(x, u i) is an (approximate) subgradient (more on this later) Prof. S. Boyd, EE364b, Stanford University 13
Minimization define g(y) as the optimal value of minimize f 0 (x) subject to f i (x) y i, i = 1,...,m (f i convex; variable x) with λ an optimal dual variable, we have g(z) g(y) m λ i(z i y i ) i=1 i.e., λ is a subgradient of g at y Prof. S. Boyd, EE364b, Stanford University 14
Composition f(x) = h(f 1 (x),...,f k (x)), with h convex nondecreasing, f i convex find q h(f 1 (x),...,f k (x)), g i f i (x) then, g = q 1 g 1 + + q k g k f(x) reduces to standard formula for differentiable h, g i proof: f(y) = h(f 1 (y),...,f k (y)) h(f 1 (x) + g1 T (y x),...,f k (x) + gk T (y x)) h(f 1 (x),...,f k (x)) + q T (g1 T (y x),...,gk T (y x)) = f(x) + g T (y x) Prof. S. Boyd, EE364b, Stanford University 15
Subgradients and sublevel sets g is a subgradient at x means f(y) f(x) + g T (y x) hence f(y) f(x) = g T (y x) 0 g f(x 0 ) x 0 f(x) f(x 0 ) x 1 f(x 1 ) Prof. S. Boyd, EE364b, Stanford University 16
f differentiable at x 0 : f(x 0 ) is normal to the sublevel set {x f(x) f(x 0 )} f nondifferentiable at x 0 : subgradient defines a supporting hyperplane to sublevel set through x 0 Prof. S. Boyd, EE364b, Stanford University 17
Quasigradients g 0 is a quasigradient of f at x if g T (y x) 0 = f(y) f(x) holds for all y f(y) f(x) x g quasigradients at x form a cone Prof. S. Boyd, EE364b, Stanford University 18
example: f(x) = at x + b c T x + d, (dom f = {x ct x + d > 0}) g = a f(x 0 )c is a quasigradient at x 0 proof: for c T x + d > 0: a T (x x 0 ) f(x 0 )c T (x x 0 ) = f(x) f(x 0 ) Prof. S. Boyd, EE364b, Stanford University 19
example: degree of a 1 + a 2 t + + a n t n 1 f(a) = min{i a i+2 = = a n = 0} g = sign(a k+1 )e k+1 (with k = f(a)) is a quasigradient at a 0 proof: implies b k+1 0 g T (b a) = sign(a k+1 )b k+1 a k+1 0 Prof. S. Boyd, EE364b, Stanford University 20
Optimality conditions unconstrained recall for f convex, differentiable, f(x ) = inf x f(x) 0 = f(x ) generalization to nondifferentiable convex f: f(x ) = inf x f(x) 0 f(x ) Prof. S. Boyd, EE364b, Stanford University 21
f(x) 0 f(x 0 ) x 0 x proof. by definition (!) f(y) f(x ) + 0 T (y x ) for all y 0 f(x )... seems trivial but isn t Prof. S. Boyd, EE364b, Stanford University 22
Example: piecewise linear minimization f(x) = max i=1,...,m (a T i x + b i) x minimizes f 0 f(x ) = Co{a i a T i x + b i = f(x )} there is a λ with λ 0, 1 T λ = 1, m λ i a i = 0 i=1 where λ i = 0 if a T i x + b i < f(x ) Prof. S. Boyd, EE364b, Stanford University 23
... but these are the KKT conditions for the epigraph form minimize t subject to a T i x + b i t, i = 1,...,m with dual maximize b T λ subject to λ 0, A T λ = 0, 1 T λ = 1 Prof. S. Boyd, EE364b, Stanford University 24
we assume Optimality conditions constrained minimize f 0 (x) subject to f i (x) 0, i = 1,...,m f i convex, defined on R n (hence subdifferentiable) strict feasibility (Slater s condition) x is primal optimal (λ is dual optimal) iff f i (x ) 0, λ i 0 0 f 0 (x ) + m i=1 λ i f i(x ) λ i f i(x ) = 0... generalizes KKT for nondifferentiable f i Prof. S. Boyd, EE364b, Stanford University 25
Directional derivative directional derivative of f at x in the direction δx is can be + or f (x; δx) = lim hց0 f(x + hδx) f(x) h f convex, finite near x = f (x;δx) exists f differentiable at x if and only if, for some g (= f(x)) and all δx, f (x;δx) = g T δx (i.e., f (x;δx) is a linear function of δx) Prof. S. Boyd, EE364b, Stanford University 26
Directional derivative and subdifferential general formula for convex f: f (x;δx) = sup g T δx g f(x) δx f(x) Prof. S. Boyd, EE364b, Stanford University 27
Descent directions δx is a descent direction for f at x if f (x;δx) < 0 for differentiable f, δx = f(x) is always a descent direction (except when it is zero) warning: for nondifferentiable (convex) functions, δx = g, with g f(x), need not be descent direction x 2 g example: f(x) = x 1 + 2 x 2 x 1 Prof. S. Boyd, EE364b, Stanford University 28
Subgradients and distance to sublevel sets if f is convex, f(z) < f(x), g f(x), then for small t > 0, x tg z 2 < x z 2 thus g is descent direction for x z 2, for any z with f(z) < f(x) (e.g., x ) negative subgradient is descent direction for distance to optimal point proof: x tg z 2 2 = x z 2 2 2tg T (x z) + t 2 g 2 2 x z 2 2 2t(f(x) f(z)) + t 2 g 2 2 Prof. S. Boyd, EE364b, Stanford University 29
Descent directions and optimality fact: for f convex, finite near x, either 0 f(x) (in which case x minimizes f), or there is a descent direction for f at x i.e., x is optimal (minimizes f) iff there is no descent direction for f at x proof: define δx sd = argmin z z f(x) if δx sd = 0, then 0 f(x), so x is optimal; otherwise f (x;δx sd ) = ( inf z f(x) z ) 2 < 0, so δxsd is a descent direction Prof. S. Boyd, EE364b, Stanford University 30
f(x) x sd idea extends to constrained case (feasible descent direction) Prof. S. Boyd, EE364b, Stanford University 31