Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 30
Notation f : H R { } is a closed proper convex function domf := {x R n : f(x) < } f is closed if epif is a closed set f is proper if domf 2 / 30
Definition of subgradient
Differentiable function f (gradient of f) is the vector of partial derivatives If f is convex, then f(x) = [ f f (x);... ; (x) ] x 1 x n f(y) f(x) + f(x), y x, x, y H 1 2 1 restricting x, y domf is unnecessary since f(x) = if x domf 2 figure taken from Boyd and Vandenberghe, Convex Optimization. 3 / 30
Non-differentiable function assumption: f is a proper function (nonconvexity is allowed) At x domf, the subdifferential of f is f( x) := { g R n : f(y) f( x) + g, y x, y domf } (defined via a global inequality, not locally or by taking limits) g f( x) is called a subgradient of f at x; we may use f( x) in some context 4 / 30
Existence If f C 1 is proper convex, then f(x) f(x) for x domf In fact, f(x) = { f(x)} If f is proper closed convex and x ri(domf), then f(x) is nonempty Conversely, if the set f(x) is nonempty for all x dom(f), then f is convex. 5 / 30
Computing subgradients
General rules smooth functions: f(x) = { f(x)} chain rule: φ(x) = f(ax + b) φ(x) = A T f(ax + b) positive scaling: λ > 0 f(λx) = λ f(x) positive sums: α, β > 0, f(x) = αf 1(x) + βf 2(x) f(x) α f 1(x) + β f 2(x) under additional conditions e.g. 0 sri(domf 1 domf 2), f(x) = α f 1(x) + β f 2(x) 6 / 30
maximums: f(x) = max i {1,,n} {f i(x)} f(x) = conv{ f i(x) f i(x) = f(x)} separable sum: f(x) = n i=1 fi(xi) xf(x) = x1 f 1(x 1) xn f n(x n) 7 / 30
Examples f(x) = x, x R. f(x) = { {sign(x)} x 0; [ 1, 1] otherwise f(x) = x 1, x R n. f(x) = x 1 x n. 8 / 30
Examples f(x) = n ai, x bi. Define i=1 I (x) = {i a i, x b i < 0} I +(x) = {i a i, x b i > 0} I 0(x) = {i a i, x b i = 0}. Then f(x) = a i a i + [ a i, a i] i I + (x) i I (x) i I 0 (x) 9 / 30
Examples f(x) = max i {1,,n} x i. Then f(x) = conv{e i x i = f(x)} For example, at the origin 0, f(0) = conv{e i i {1,, n}} 10 / 30
Examples f(x) = x 2. f is differential away from 0, so: f(x) = At 0, go back to subgradient equation: x x 2 x 0. y 2 0 + g, y 0 Thus, g f(0), if, and only if, g,y y 2 dual ball B2(0, 1) = B 2(0, 1). This is a common pattern! 1 for all y 0. Thus, g is in the 11 / 30
Examples f(x) = x = max i {1,,n} x (i). f(x) = conv{[ e i, e i] x (i) = f(x)} x 0. Going back to subgradient equation y 0 + g, y g,y Thus, g f(0), if, and only if, y 1 for all y 0. Thus, f(0) is the dual ball to the l norm: B 1(0, 1). 12 / 30
Examples Let C be a closed nonempty convex set Define the indicator function ι C(x) = Subdifferential of ι C: let x C, { 0, if x C, otherwise. ι C(x) = { g : ι C(y) ι C(x) + g, y x, y} = {g : g, y x 0, y C} which is a cone, called the normal cone N C(x) By convention, if x C, then ι C(x) = 13 / 30
figures taken from D.Bertsekas, MIT 253 Spring 12 14 / 30
Comparisons Top: left: l 1(x) = x 1 + x 2 right: f(x) = x 1 + 2 x 2 Bottom: left: f(x) = f(rx) ( π 4 -rotated f), right: {x : f(x) 2} 15 / 30
1-norm function: l 1(x) = x 1 + x 2 pick any α, β > 0. (α, 0) 1 = 1 [ 1, 1] = {(1, g 2) : g 2 [ 1, 1]} (0, β) 1 = [ 1, 1] 1 = {(g 1, 1) : g 1 [ 1, 1]} (α, β) 1 = {(1, 1)} 16 / 30
Weighted 1-norm function: f(x) = x 1 + 2 x 2 pick any α, β > 0. f(α, 0) = 1 [ 2, 2] f(0, β) = [ 1, 1] 2 f(α, β) = {(1, 2)} 17 / 30
Weighted 1-norm, rotated rotation matrix: function: f(x) = f(rx) pick any α > 0. R := [ ] cos π sin π 4 4 sin π cos π 4 4 f(α, α) = R T ( 1 [ 2, 2] ) = { (g 1, g 2) : g 1 + g 2 = 2, g 1 [ 1 2, 3 2 ] 2 } f( α, α) = R T ( [ 1, 1] 2 ) = { (g 1, g 2) : g 2 g 1 = 2 2, g 1 [ 3 2, 1 2 ] 2 } 18 / 30
Weighted 1-norm ball, rotated set: C = {x : f(x) 2} function: ι C(x) = 0 if x C and ι C(x) = if x C. pick any α > 0. ι C(α, α) = N C(α, α) = { θ 1( 1, 3) + θ 2(3, 1) : θ 1, θ 2 0 } ι C( α, α) = N C( α, α) = { θ 1( 1, 3) + θ 2( 3, 5) : θ 1, θ 2 0 } 19 / 30
Comparisons left: f(x) = x 2. If x 0, f(x) = { x x 2 } which contains only 1 point right: f(x) = ι{ 2 C}(x). If x 2 = C, f(x) = { θ is a ray x x 2 : θ 0 } which 20 / 30
Partial subgradient Let f(x 1, x 2) be a proper closed convex function If f(x 1, x 2) is differentiable p 1 = 1f(x 1, x 2) p 2 = 2f(x 1, x 2) If f(x 1, x 2) is non-differentiable p 1 1f(x 1, x 2) p 2 2f(x 1, x 2) = [ p1 p 2 [ p1 p 2 ] = f(x 1, x 2) ] f(x 1, x 2) In general, 1f(x 1, x 2) 2f(x 1, x 2) f(x 1, x 2) exception: = holds for separable f(x 1, x 2) = f 1(x 1) + f 2(x 2) 21 / 30
called the rotated weighted l 1 function f 5 4 3 2 10 12 10 8 8 6 6 4 1 2 6 8 0-1 8 4 2 4-2 -3-4 6 6 8 4 6 8-5 -5-4 -3-2 -1 0 1 2 3 4 5 10 12 10 Take (x 1, x 2) = (α, α) for arbitrary α > 0 0 x1 f(α, α) and 0 x2 f(α, α), but [ ] 0 f(α, α) = { g : g 1 + g 2 = 2 2, 0 2 g1 3 2 } 2 22 / 30
Subgradient optimality condition
0 subgradient for unconstrained minimization Let f be a proper function. Convexity is not required. The set of minimizers arg min f can be empty, a singleton, or a set with infinitely many points Lemma: x arg min f if and only if 0 f(x ) Proof: : Let 0 f(x ). For all y f(y) f(x ) + 0, y x = f(x ). : let x arg min f(x). f(y) f(x) = f(x) + 0, y x thus 0 f(x). 23 / 30
Variational inequality for constrained minimization Let f be a proper closed convex function and C be a nonempty closed set. Lemma: Under proper regularity conditions (e.g., 0 sri(domf C)), 0 f(x ) + N C(x ) x arg min { f(x) : x C }. interpretation of the 1st condition: there exists a subgradient f(x ) such that the following variational inequality holds: f(x ), y x 0, y C. 24 / 30
3 figures taken from D.Bertsekas, MIT 253 Spring 12 25 / 30 3
Subgradient method
Negative subgradient is not necessarily a descent direction Consider f(x) = x, x R. Recall f(0) = {g : g 1}. Subgradients may not vanish at the minimum. Many g is an ascending direction. No matter how x is near 0, f(x) = sign(x). Consider f(x) = x 1 + 2 x 2. At x = (1, 0), g = (1, 2) f(x), but g is not a descent direction. Seemingly only a zero-measure set of points cause the issue, but your solutions are often there! 26 / 30
Subgradient method applications: find a point in the intersections of convex sets minimize nonsmooth functions dual ascent method (often nonsmooth) iterations: x k+1 x k α k f(x k ) objective sequence f k := f(x k ) is typically non-monotonic (difficult to ensure since f is non-continuous) 27 / 30
step size choices: Define f k best := min { f 0, f 1,..., f k } Fix α k α: during k = 0,..., O(α 2 G 2 ), fbest k f = O(α 1 k 1 ) Reduce α k like lim α k = 0 and α k k = : fbest k f = O(k 1/2 ) Several other choices 28 / 30
Summary Subgradients lose several good features of gradients Subgradients are easy to compute for some convex functions They help define solutions Subgradient method works but is slow 29 / 30
Not covered Limiting subdifferential defined with by taking limits Subgradients of dual functions, compute them by minimizing Lagrangian Methods based on subgradients: cutting plane method and bundle method Proximal map 30 / 30