Subgradients subgradients and quasigradients subgradient calculus optimality conditions via subgradients directional derivatives Prof. S. Boyd, EE392o, Stanford University
Basic inequality recall basic inequality for convex differentiable f: f(y) f(x) + f(x) T (y x) first-order approximation of f at x is global underestimator ( f(x), 1) supports epi f at (x, f(x)) What if f is not differentiable? Prof. S. Boyd, EE392o, Stanford University 1
Subgradient of a function g is a subgradient of f (not necessarily convex) at x if f(y) f(x) + g T (y x) ( (g, 1) supports epi f at (x, f(x))) PSfrag replacements f(x 1 ) + g T 1 (x x 1) for all y f(x) f(x 2 ) + g T 2 (x x 2) f(x 2 ) + g T 3 (x x 2) x 1 x 2 g 2, g 3 are subgradients at x 2 ; g 1 is a subgradient at x 1 Prof. S. Boyd, EE392o, Stanford University 2
subgradient gives affine global underestimator of f if f is convex, it has at least one subgradient at every point in relint dom f if f is convex and differentiable, f(x) is a subgradient of f at x Prof. S. Boyd, EE392o, Stanford University 3
Example f = max{f 1, f 2 }, with f 1, f 2 convex and differentiable PSfrag replacements f 2 (x) f(x) f 1 (x) x 0 f 1 (x 0 ) > f 2 (x 0 ): unique subgradient g = f 1 (x 0 ) f 2 (x 0 ) > f 1 (x 0 ): unique subgradient g = f 2 (x 0 ) f 1 (x 0 ) = f 2 (x 0 ): subgradients form a line segment [ f 1 (x 0 ), f 2 (x 0 )] Prof. S. Boyd, EE392o, Stanford University 4
Subdifferential set of all subgradients of f at x is called the subdifferential of f at x, written f(x) f(x) is a closed convex set f(x) nonempty (if f convex, and finite near x) f(x) = { f(x)} if f is differentiable at x if f(x) = {g}, then f is differentiable at x and g = f(x) in many applications, don t need complete f(x); it is sufficient to find one g f(x) Prof. S. Boyd, EE392o, Stanford University 5
example: f(x) = x Sfrag replacements f(x) = x f(x) 1 x 1 x Prof. S. Boyd, EE392o, Stanford University 6
Calculus of subgradients assumption: all functions are finite near x f(x) = { f(x)} if f is differentiable at x scaling: (αf) = α f (if α > 0) addition: (f 1 + f 2 ) = f 1 + f 2 (RHS is addition of sets) affine transformation of variables: if g(x) = f(ax + b), then g(x) = A T f(ax + b) pointwise maximum: if f = max f i, then i=1,...,m f(x) = Co { f i (x) f i (x) = f(x)}, i.e., convex hull of union of subdifferentials of active functions at x Prof. S. Boyd, EE392o, Stanford University 7
special case: if f i differentiable replacements f(x) = Co{ f i (x) f i (x) = f(x)} example: f(x) = x 1 = max{s T x s i { 1, 1}} 1 1 1 1 1 (1,1) 1 1 f(x) at x = (0, 0) at x = (1, 0) at x = (1, 1) Prof. S. Boyd, EE392o, Stanford University 8
Pointwise supremum if f = sup f α, α A cl Co { f β (x) f β (x) = f(x)} f(x) (usually get equality, but requires some technical conditions to hold, e.g., A compact, f α cts in x and α) roughly speaking, f(x) is closure of convex hull of union of subdifferentials of active function in any case, if f β (x) = f(x), then f β (x) f(x) Prof. S. Boyd, EE392o, Stanford University 9
example f(x) = λ max (A(x)) = sup y T A(x)y y 2 =1 where A(x) = A 0 + x 1 A 1 + + x n A n, A i S k f is pointwise supremum of g y (x) = y T A(x)y over y 2 = 1 g y is affine in x, with g y (x) = (y T A 1 y,..., y T A n y) hence, f(x) = Co { g y A(x)y = λ max (A(x))y, y 2 = 1} (not hard to verify) to find one subgradient at x, can choose any unit eigenvector y associated with λ max (A(x)); then (y T A 1 y,..., y T A n y) f(x) Prof. S. Boyd, EE392o, Stanford University 10
define g(y) as the optimal value of Minimization minimize subject to f 0 (x) f i (x) y i, i = 1,..., m (f i convex; variable x) with λ an optimal dual variable, we have g(z) g(y) m λ i (z i y i ) i=1 i.e., λ is a subgradient of g at y Prof. S. Boyd, EE392o, Stanford University 11
Subgradients and sublevel sets g is a subgradient at x means f(y) f(x) + g T (y x) hence f(y) f(x) = g T (y x) 0 g f(x 0 ) PSfrag replacements x 0 f(x) f(x 0 ) x 1 f(x 1 ) Prof. S. Boyd, EE392o, Stanford University 12
f differentiable at x 0 : f(x 0 ) is normal to the sublevel set {x f(x) f(x 0 )} f nondifferentiable at x 0 : subgradient defines a supporting hyperplane to sublevel set through x 0 Prof. S. Boyd, EE392o, Stanford University 13
Quasigradients g 0 is a quasigradient of f at x if g T (y x) 0 = f(y) f(x) holds for all y PSfrag replacements f(y) f(x) x g quasigradients at x form a cone Prof. S. Boyd, EE392o, Stanford University 14
example: f(x) = at x + b c T x + d, (dom f = {x ct x + d > 0}) g = a f(x 0 )c is a quasigradient at x 0 proof: for c T x + d > 0: a T (x x 0 ) f(x 0 )c T (x x 0 ) = f(x) f(x 0 ) Prof. S. Boyd, EE392o, Stanford University 15
example: degree of a 1 + a 2 t + + a n t n 1 f(a) = min{i a i+2 = = a n = 0} g = sign(a k+1 )e k+1 (with k = f(a)) is a quasigradient at a 0 proof: implies b k+1 0 g T (b a) = sign(a k+1 )b k+1 a k+1 0 Prof. S. Boyd, EE392o, Stanford University 16
Optimality conditions unconstrained recall for f convex, differentiable, f(x ) = inf x f(x) 0 = f(x ) generalization to nondifferentiable convex f: f(x ) = inf x f(x) 0 f(x ) Prof. S. Boyd, EE392o, Stanford University 17
f(x) PSfrag replacements 0 f(x 0 ) x 0 x proof. by definition (!) f(y) f(x ) + 0 T (y x ) for all y 0 f(x )... seems trivial but isn t Prof. S. Boyd, EE392o, Stanford University 18
Example: piecewise linear minimization f(x) = max i=1,...,m (a T i x + b i) x minimizes f 0 f(x ) = Co{a i a T i x + b i = f(x )} there is a λ with λ 0, 1 T λ = 1, m λ i a i = 0 i=1 where λ i = 0 if a T i x + b i < f(x ) Prof. S. Boyd, EE392o, Stanford University 19
... but these are the KKT conditions for the epigraph form minimize t subject to a T i x + b i t, i = 1,..., m with dual maximize b T λ subject to λ 0, A T λ = 0, 1 T λ = 1 Prof. S. Boyd, EE392o, Stanford University 20
Optimality conditions constrained minimize subject to f 0 (x) f i (x) 0, i = 1,..., m we assume f i convex, defined on R n (hence subdifferentiable) strict feasibility (Slater s condition) x is primal optimal (λ is dual optimal) iff f i (x ) 0, λ i 0 0 f 0 (x ) + m i=1 λ i f i(x ) λ i f i(x ) = 0... generalizes KKT for nondifferentiable f i Prof. S. Boyd, EE392o, Stanford University 21
Directional derivative directional derivative of f at x in the direction δx is can be + or f (x; δx) = lim h 0 f(x + hδx) f(x) h f convex, finite near x = f (x; δx) exists f differentiable at x if and only if, for some g (= f(x)) and all δx, f (x; δx) = g T δx (i.e., f (x; δx) is a linear function of δx) Prof. S. Boyd, EE392o, Stanford University 22
Directional derivative and subdifferential general formula for convex f: f (x; δx) = sup g T δx g f(x) δx PSfrag replacements f(x) Prof. S. Boyd, EE392o, Stanford University 23
Descent directions δx is a descent direction for f at x if f (x; δx) < 0 for differentiable f, δx = f(x) is always a descent direction (except when it is zero) warning: for nondifferentiable (convex) functions, δx = g, with g f(x), need not be descent direction x 2 g example: f(x) = x 1 + 2 x 2 PSfrag replacements x 1 Prof. S. Boyd, EE392o, Stanford University 24
Subgradients and distance to sublevel sets if f is convex, f(z) < f(x), g f(x), then for small t > 0, x tg z 2 < x z 2 thus g is descent direction for x z 2, for any z with f(z) < f(x) (e.g., x ) negative subgradient is descent direction for distance to optimal point proof: x tg z 2 2 = x z 2 2 2tg T (x z) + t 2 g 2 2 x z 2 2 2t(f(x) f(z)) + t 2 g 2 2 Prof. S. Boyd, EE392o, Stanford University 25
Descent directions and optimality fact: for f convex, finite near x, either 0 f(x) (in which case x minimizes f), or there is a descent direction for f at x i.e., x is optimal (minimizes f) iff there is no descent direction for f at x proof: define δx sd = argmin z z f(x) if δx sd = 0, then 0 f(x), so x is optimal; otherwise f (x; δx sd ) = ( inf z f(x) z ) 2 < 0, so δxsd is a descent direction Prof. S. Boyd, EE392o, Stanford University 26
f(x) PSfrag replacements x sd idea extends to constrained case (feasible descent direction) Prof. S. Boyd, EE392o, Stanford University 27