Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 20
Subgradients Assumptions and Notation 1. f : R n R n { } is a closed proper convex function, i.e. epif = {(x, t) R n R f(x) t} is closed and convex. 2. The effective domain of f is domf = {x R n : f(x) < } 3. The function f is proper, i.e. domf. 4. A raised (e.g., x ) means global minimum of some function. 2 / 20
C 1 convex function Recall: a convex function of C 1 obeys f(y) f(x) + f(x), y x, x, y R n 1 1 figure taken from Boyd and Vandenberghe, Convex Optimization. 3 / 20
Non-C 1 convex functions For all each ˉx R n, f(ˉx) := {g R n : f(y) f(ˉx) + g, y ˉx } 4 / 20
subgradient is defined via a global property, not by taking limits. in contrast, the set of regular subgradients is defined as ˆ f(x) = {g δ > 0 such that f(y) f(x)+ g, y x +o( y x ), y B δ (x)} g is a general subgradient of f at x if: there is a sequence x i x and g i ˆ f(x i ) with g i g. Reference: Variational analysis by Rockafellar and Wets. Definition 8.3. 5 / 20
Which functions have subgradients? If f C 1, then f(x) f(x). In fact, f(x) = { f(x)}. Proof: if g is a subgradient, then for y R n f(x), y = f(x + ty) f(x) lim t 0 t g, x + ty x lim t 0 = g, y. Change y to y, and the inequality still holds. Plugging in standard basis vectors = f(x) = g. Next, the general case. t (by f def.) (by subgrad def.) = f(x), y = g, y. 6 / 20
Which functions have subgradients? Theorem (Nesterov 03 Thm 3.1.13) Let f be a closed convex function and x 0 int(dom(f)). Then f(x 0) is a nonempty bounded set. Proof uses supporting hyperplanes of epigraph to show existence, and local Lipschitz continuity of convex functions to show boundedness. 2 2 figure taken from Boyd and Vandenberghe, http://see.stanford.edu/materials/lsocoee364b/01-subgradients_notes.pdf. 7 / 20
The converse Lemma (Nesterov 03 Lm 3.1.6) If f(x) for all x dom(f), then f is convex. Proof. x, y dom(f), α [0, 1], y α = (1 α)x + αy = x + α(y x), g f(y α ). f(y) f(y α) + g, y y α = f(y α) + (1 α) g, y x (1) f(x) f(y α) + g, x y α = f(y α) α g, y x (2) Multiply equation (1) by α and equation (2) by (1 α). Add them together to get αf(y) + (1 α)f(x) f(y α) 8 / 20
Technicality: int(dom(f)) We cannot relax the assumption x int(dom(f)) to x dom(f). Example: f : [0, + ) R. f(x) = x. dom(f) = [0, + ) but f(0) =. 9 / 20
Compute subgradients: general rules Smooth functions: f(x) = { f(x)}. Composition with affine mapping: φ(x) = f(a(x) + b) φ(x) = A T f(a(x) + b). Positive sums: α, β > 0, f(x) = αf 1(x) + βf 2(x). f(x) = α f 1(x) + β f 2(x) Maximums: f(x) = max i {1,,n} {f i(x)} f(x) = conv{ f i(x) f i(x) = f(x)} 10 / 20
Examples f(x) = x. f(x) = { {sign(x)} x 0; [ 1, 1] otherwise 3 as seen, f is a relation or a point-to-set mapping 3 figure taken from Boyd and Vandenberghe, http://see.stanford.edu/materials/lsocoee364b/01-subgradients_notes.pdf. 11 / 20
Examples f(x) = n ai, x bi. Define i=1 I (x) = {i a i, x b i < 0} I +(x) = {i a i, x b i > 0} I 0(x) = {i a i, x b i = 0}. Then f(x) = a i a i + [ a i, a i] i I + (x) i I (x) i I 0 (x) (the last sum is the Minkowski sum) minimize x Ax b 1 is known as robust fitting, which is more robust to the outliers than the least-squares problem. 12 / 20
f(x) = max i {1,,n} x i. Then f(x) = conv{e i x i = f(x)} f(0) = conv { e i i {1,, n} } Note: conv denotes the convex hull: conv{x i} := { α ix i : a i 0, α i = 1 } i i 13 / 20
Examples f(x) = x 2. f is differential away from 0, so: f(x) = At 0, go back to subgradient equation: Thus, g f(0), if, and only if, g,y y 2 ball B2(0, 1) = B 2(0, 1). This is a common pattern! x x 2, where x 0. y 2 0 + g, y 0 1 for all y 0. Thus, g is in the dual 14 / 20
How to compute subgradients: Examples f(x) = x = max i {1,,n} x (i). f(x) = conv{g i : g i x (i), x (i) = f(x)}, where x 0. Going back to subgradient equation y 0 + g, y g,y Thus, g f(0), if, and only if, y 1 for all y 0. Thus, f(0) is the dual ball to the l norm: B 1(0, 1). 15 / 20
Examples f(x) = x 1 = n xi. Then i=1 f(x) = e i x i >0 e i + [ e i, e i] x i <0 x i =0 for all x. Then n f(0) = [ e i, e i] = B (0, 1). i=1 16 / 20
Semi-continuity Definition (upper semi-continuity) A point-to-set mapping M is upper semi-continuous at x if any convergent sequence (x k, s k ) (x, s ) satisfying s k M(x k ) for each k also obeys s M(x ). Interpretation: if (i) x k x and (ii) for each x k you can find s k M(x k ) so that s k s, then s M(x ). Lower semi-continuity is essentially the opposite. Definition (lower semi-continuity, lsc) A point-to-set mapping M is lower semi-continuous if any convergent sequence x k x and s M(x ), there exists sequence s i M(x k i ) such that s i s. 17 / 20
Lemma Let f be a proper convex function. f is upper semi-continuous, and f(x) is a convex set. Proof: Take the limit of f(y) f(x k ) + s k, y x k, s k f(x k ). The second part is a direct result of linearity of, y x. However, if f(x) = x, the f is not lower semi-continuous at x = 0. 18 / 20
Directional derivative versus Subgradient Assume that f : R n R {+ } is a proper, closed (thus lsc), and convex. Then 1. the directional derivative is well defined for every d R n : f (x; d) = lim α 0 f(x + αd) f(x). α 2. the directional derivatives at x bound all the subgradient projections f(x) = {p R n : f (x; d) p, d, d R n }. 3. directional derivatives are extreme subgradients: f (x; d) = max{ p, d : p f(x)}. 19 / 20
0 subgradient = minimum Suppose that 0 f(x). If f is smooth and convex, 0 f(x) = { f(x)} = f(x) = 0. In general: If 0 f(x), then f(y) f(x) + 0, y x = f(x) for all y R n. = x is a minimum! Converse also true: f(y) f(x ) + 0 = f(x ) + 0, y x. 20 / 20